语音产生模型

曹限东Speech Signal Processing
语音信号处理
School of Electronic Information, Wuhan University
Chapter 2 Model of Speech Production
We here introduce basic human speech production system and its digital model, which have influenced research on digital signal analysis of speech and applications of speech coding, synthesis, and recognition.
2.1 Speech Production
z Speech Production Physiology
z Speech production Mechanism
z Speech Production System – A simplified Model
2.2 Digital Model of Speech Production
2.3 Speech Perception
z Sound Pressure Level
z Physiology of the Ear
z Frequency Analysis
z Masking Effect
2.1 Speech Production
Speech is produced by air-pressure waves emanating from the mouth and the nostrils of a speaker.
We restrict ourselves to a schematic view of only the major articulators (发音器官), as diagrammed in above figure. The gross components of the speech production apparatus are the lungs, trachea(气管), larynx (喉,organ of voice production), pharyngeal cavity (咽腔,throat), oral and nasal cavity(口腔和鼻腔). The pharyngeal and oral cavities are typically referred to as the vocal tract, and the nasal cavity as the nasal tract. As illustrated in above figure, the human speech production apparatus consists of:
z Lungs: source of air during speech.
z Vocal cords (larynx): when the vocal folds are held close together and oscillate against one another during a speech sound, the sound is said to be voiced. When the folds are too slack or tense to vibrate periodically, the sound is said to be unvoiced. The place where the vocal folds come together is called the glottis(声门).
z Velum (Soft Palate,软腭): operates as a valve, opening to allow passage of air (and thus resonance) through the nasal cavity.
Sounds produced with the flap open include m and n.
z Hard palate: a long relatively hard surface at the roof inside the mouth, which, when the tongue is placed against it, enables
consonant articulation.
z Tongue: flexible articulator, shaped away from the palate for vowels,
placed close to or on the palate or other hard surfaces for consonant articulation.
z Teeth: another place of articulation used to brace the tongue for certain consonants.司法鉴定机构登记管理办法
z Lips: can be rounded or spread to affect vowel quality, and closed completely to stop the oral air flow in certain consonants (p, b, m). Vocal cords, soft palate or velum, tongue, teeth and lips – these components move to different positions to produce various speech sounds and are sometimes referred to as articulators.
Speech production can be thought as an acoustic filtering operation: • System (filter): vocal and nasal tracts
• Excitation (input) is provided by the organs (e.g., larynx, lungs) below the vocal tract.
• Articulators change the shape of the filter, hence producing different speech sounds
神州行储值卡
The most fundamental distinction between sound types is
voiced/unvoiced distinction, Depending on the type of excitation.我的野蛮女友主题曲
• Voiced sounds are produced by forcing air through the glottis or an
opening between the vocal folds. The vocal folds vibrate in this case.
Quasi-periodic puffs of air are produced that excite the vocal tract.
Example voiced sound is the vowel “ee” in ‘sees’.
• Unvoiced sounds are generated by forming a constriction at some
point along the vocal tract and forcing air through the constriction to
produce turbulence. Vocal folds do not vibrate in this case. Example
unvoiced sound is “s” as in “sees”.
• A sound can also be simultaneously voiced and unvoiced (mixed).
Example mixed sound is “z” in “sees”.
库伦效率
• Another type is silence.
Vocal fold vibration
• Vocal folds open and close in a periodic pattern. For that reason,
voiced sounds are quasi-periodic(类周期的).
• The frequency at which vocal folds open and close is called the
fundamental frequency F0.
Fundamental frequency (F0) range of different sources are as followed.
红磷
Voiced Unvoiced
Voiced sounds, including vowels, have in their time and frequency
structure a roughly regular pattern that unvoiced sounds, such as
consonants like s, lack. Voiced sounds typically have more energy as
shown in above Figure. We see here the waveform of the word “sees”,
which consists of three phonemes: an unvoiced consonant “/s/”, a
vowel “/i:/” and, a voiced consonant “/z/” (mixed of voiced and unvoiced).
We can obtain a simplified model of speech production as followed.
Since the glottal wave is periodic, consisting of fundamental frequency (F0) and a number of harmonics (integral multiples of F0), it can be analyzed as a sum of sine waves as discussed in Digital signal processing course. The resonances(共鸣)of the vocal tract are excited by the glottal energy. Suppose, for simplicity, we regard the vocal tract as a straight tube of uniform cross-sectional area, closed at the glottal end, open at the lips. When the shape of the vocal tract changes, the resonances change also. Harmonics near the resonances are emphasized, and, in speech, the resonances of the cavities that are typical of particular articulator configurations (e.g., the different vo
wel timbres) are called formants(共振峰). The vowels in an actual speech waveform can be viewed from a number of different perspectives, emphasizing either a cross-sectional view of the harmonic responses at a single moment, or a longer-term view of the formant track evolution over time.
Formants in Speech

本文发布于:2024-09-21 22:51:54,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/335504.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议