Draw a speech: Software Automatic Mouth

I concluded the last year’s article “We draw sound” with recognition: “Is it possible to draw sound from a blank sheet without tracing the spectrogram of the audio recording? Frankly, I didn’t succeed. ” But recently I learned about SAM - released in 1982 by Don't Ask Software, it was the first commercially successful PC speech synthesis program. In the mid -2000s, German demo appraisers Tobias Korbmacher and Sebastian Macke took an assembled SAM listing for Commodore 64 and converted it into unreadable, but workable C code; then in 2014 the British Vidar Hokstad tried to bring the C code into a readable form - manually giving the variables meaningful names and replacinggotoon loops and branches; and finally, in 2017 another German Christian Schiffler rewrote code from C to JavaScript. You can try it in action as a "black box" on discordier.imtqy.com/sam .

In my opinion, a primitive JavaScript speech synthesizer is the most convenient experimental model for those who want to understand how speech synthesis works in general. My SAM fork with substantially cleaned code and comments is available at github.com/tyomitch/sam . Unfortunately, the previous authors managed to fade interest in SAM, and now they are not up to the analysis of pull requests in a hobby project of many years ago.

SAM consists of four functional components:

  1. Reciter translates the English text into a phoneme record: for example, “A LITTLE TOO LOW” (an example from the demo program attached to SAM ) turns into “AH LIHTUL TUW5 LOW”.
  2. Parser turns a phonemic record into a phonetic one: from "AH LIHTUL TUW5 LOW" it turns out " AH, ,L,IH,DX,AX,LX, ,T,*,*,UX,WX, ,L,OW,WX". For each background displayed, Parser also sets the duration and tone.
  3. Renderer builds an array of frequencies, amplitudes and other acoustic characteristics from phonetic recording;
  4. The last, anonymous component (function ProcessFrames) turns an array of frequencies and amplitudes into a PCM stream for audio output.

In this article, I will analyze all four components in turn.

Reciter


Reciter was attached to SAM as a separate program: the creators claimed that the Reciter 469 pronunciation rules laid down correctly transcribed about 90% of English words. This means that the transcription of every tenth word needed manual editing before submitting it to the input of the following components.

SAM uses its own transcription system , where English phonemes are indicated by separate characters from a set [A-Z/]or in pairs of two such characters:
PhonemeDesignationPhonemeDesignationPhonemeDesignationPhonemeDesignation
/ b /B/ p /P/ v /V/ f /F
/ d /D/ t /T/ z /Z/ s /S
/ dʒ /J/ tʃ /CH/ ʒ /ZH/ ʃ /SH
/ g /G/ k /K/ h //H/ ð /DH
/ m /M/ n /N/ ŋ /NX/ θ /TH
/ l /L/ r /R/ j /Y/ w /W
/ æ /AE/ ɛ /EH/ ɪ /IH/ i /IY
/ ʌ /AH/ ɔ /AO/ ʊ /UH/ u /UX
/ ɒ /OH/ ɑ /AA/ ə /AX/ ɜ /ER
/ eɪ /EY/ aɪ /AY/ ɔɪ /OY/ aʊ /AW
/ oʊ /OW[l̩]UL[m̩]UM[n̩]UN
In addition to phonemes, the numbers 1–8 are used in SAM transcription to indicate stress and tone : 1 means “very emotional” stress, 4 means normal stress, 6 means neutral tone, 8 means “extreme drop in tone”.

Reciter is arranged quite simply: context-sensitive rules from the list are applied alternately to the input line , for example, the rule " (IR)#=AYR" replaces the text ⟨ir⟩ before the vowel with / aɪr /; the rule " .(S) =Z" replaces ⟨s⟩ between the voiced consonant and the space (end of the word) with / z /; the rule " (U)^^=AH5" replaces ⟨u⟩ before two consonants in a row with / ʌ /, and makes the syllable stressed. It is important to note that in many words Reciter does not emphasize any vowel, and in some it notes several vowels at once: for example,the word ⟨provoking⟩ turns into "PRUW4VOW5KIHNX", ie / ˈpruˈvoʊkɪŋ /. An attentive reader will notice that unnecessary stress is not the only mistake in this transcription.

I decided that transcription is the least interesting part of the speech synthesizer; and given the relatively low quality of transcription at the Reciter output, I decided There are several freely available Internet services for transcribing excerpts of English texts; instead of heuristic rules, these services use rather large dictionaries. In my experience, the best quality of transcription is for tophonetics.com and photransedit.com; at the same time, the second has a number of drawbacks: it uses not quite standard phoneme notation, notes stress even in monosyllabic words, and what is most inconvenient - it is written in ASP.NET and requires correct values ​​in POST requests __VIEWSTATEand __EVENTVALIDATION, which complicates its use from third-party sites. Therefore, in my demonstration of the device and the SAM, available on tyomitch.imtqy.com , I used transliteration through https://cors-anywhere.herokuapp.com/https://tophonetics.com/

Parser


Unlike Reciter, so called by the creators of SAM, the Parser and Renderer components were given names by German reverse engineers, so these names do not quite accurately reflect the purpose of these components.

Parser has three main tasks:

  1. «» (, ) . ( ) «-» UL, UM, UN, [l̩, m̩, n̩]. , /əl, əm, ən/; Parser , AXL, AXM, AXN .
  2. , .. . «AH LIHTUL TUW LOW» , /t/ [ɾ] (DX) [t] (T,*,*) . ( .) , /l/ [ɫ] (LX) , [l] (L) .
  3. .

SAM supports 81 backgrounds, of which 61 have names and can be used in phoneme recordings to “outwit” Parser and immediately set the desired sound. The remaining 20 backgrounds are nameless; 18 of them can appear only as a result of the work of Parser, and backgrounds with codes 46 and 47 cannot appear in any way, and probably remained undetermined by an oversight of SAM developers

Backgrounds with codes 0-4 ( .?,-) correspond to silence; the rest are summarized in the following table:
The codeDesignationSoundThe codeDesignationSound
5IY[i]42CH[t] in the composition / tʃ /
6IH[ɪ]43*[ʃ] as part of / tʃ /
7EH[ɛ]44J[d] in the composition / dʒ /
8AE[æ]45*[ʒ] of / dʒ /
9AA[ɑ]48EY~ [ɜ] in / eɪ /
10AH[ʌ]49AY~ [ɑ] in / aɪ /
elevenAO[ɔ]fiftyOY[ɔ] of / ɔɪ /
12UH[ʊ]51AW[ɑ] in / aʊ /
thirteenAX[ə]52OW[ɔ] as part of / oʊ /
14IXshorter [ɪ]53UW~ [u]
fifteenER[ɜ]54B[b]
sixteenUX[u]55*
17OH[o]56*
eighteenRX[ɹ]57D[d]
nineteenLX[ɫ]58*
twentyWXshort [ʊ] in diphthongs59*
21YXshort [ɪ] in diphthongs60G[gʲ]
22WHlonger [w]61*
23R[ɹ̠]62*
24L[l]63GX[g]
25W[w]64*
26Y[j]65*
27M[m]66P[p]
28N[n]67*
29thNX[ŋ]68*
thirtyDX[ɾ]69T[t]
31Q[ʔ]70*
32S[s]71*
33SH[ʃ]72K[kʲ]
34F[f]73*
35TH[θ]74*
36/H[ç]75KX[k]
37/X[h]76*
38Z[z]77*
39ZH[ʒ]78UL[l̩]
40V[v]79UM[m̩]
41DH[ð]80UN[n̩]

The actions performed by Parser consist of seven steps:

  1. Parsing itself: a list of background codes and a parallel list of tones given by numbers in the input line are formed on the input line.
  2. Applying a set of two dozen rules to the list of backgrounds: for example, the substitutions / t / + / r / → [tʃ] + [ɹ̠] and / k / + / non-front vowel / → [k] + [vowel]. (/ k / in front of the front vowels remain unchanged and match the background [kʲ].)
  3. CopyStress: The tone set for stressed vowels extends to the consonants that precede them.
  4. SetPhonemeLength: duration is substituted for each background (in conditional "frames"). Two background longitude tables are used - one for stressed syllables and one for unstressed ones.
  5. AdjustLengths: Applies a set of seven rules for adjusting background durations. For example, vowels before voiced consonants are lengthened one and a half times, and consecutive explosive consonants are halved.
  6. ProlongPlosiveStopConsonants: explosive consonants in front of vowels, smooth and fricative consonants are divided into triples of backgrounds. The first background in the three corresponds to a lower intensity of sound, the second to full intensity, the third to silence.
  7. InsertBreath: the phrase is divided .?,-into “exhalations” by “silent” backgrounds ( ) up to 232 frames (this is about 2½ seconds). In a SAM implementation for retro PCs, such a partition was necessary to save memory; in the JavaScript version it makes no sense, and in my fork it is deleted.

Parser outputs three parallel lists: background codes, their tones, and their duration.

Renderer


This component is responsible for the synthesis of speech in the narrow sense of the word. At the input, it receives a list of backgrounds with specified tones and durations, as well as parameters that affect the synthesized voice. At the output, it produces eight parallel lists: the frequencies of formants F 1 –F 3 , their intensities (amplitudes), the main frequency F 0 (tone of voice), and the values sampledConsonant, which will be described in more detail below.

With reference to the SAM instruction , the following examples of voice parameter values are provided :
VoteSpeedPitchThroatMouth
Elf7264110160
Little robot9260190190
Stuffy guy8272110105
Little old lady8232145145
Extra-terrestrial10064150200
SAM7264128128
Dalek120100100200
It is worth noting that the Speed ​​parameter is used not in Renderer, but already at the stage of audio generation: the duration of the sound generated for one frame depends on this parameter. In addition to the Speed ​​parameter, the frame duration also depends on the type of sound, as will be explained below.

Formant speech synthesis is based on the fact that each background is associated with the frequencies and amplitudes of the first few formants. For the synthesis of vowels, the use of two formants is sufficient - for example, a graph of formant frequencies typical of English vowels taken from the website of the University of Manitoba :


To synthesize consonants, additional formants are needed. Moreover, as I mentioned in last year’s article , noisy consonants are characterized by “bursts” in a wide frequency band:



These “bursts” cannot be obtained by pure formant synthesis, therefore SAM reproduces the sounds of noisy consonants from the sample table. The values ​​mentioned above sampledConsonantselect the part of the table corresponding to the particular noisy consonant.

The actions performed by Renderer consist of five steps:

  1. SetMouthThroat: for vowels and sonor backgrounds (codes 5–29 and 48–53), the tabulated values ​​of the frequencies F 1 and F 2 are multiplied by the parameters Mouth and Throat, respectively.
  2. CreateFrames: . , (1–8) Pitch (1 → −32, 6 → 0, 8 → +12). , ( 30 ) , .
  3. CreateTransitions: F0–F3 F1–F3 . , , .
  4. F0 F1, «pitch contour», .
  5. , () , PCM.


From a physical point of view, speech is a glottal pulse train created by the vocal cords (see. Fig.), Which along the way out passes through the mouth and nose ( speech tract ), and those like the resonator amplify certain harmonics in the larynx wave. The frequency of the laryngeal wave - this is the main frequency of the voice F 0 . As a rule, its values ​​are between 100 and 400 Hz: lower for men, higher for women, even higher for children. The voice model used in formant synthesis is that several band-pass filters are applied to the larynx wave, each of which distinguishes one formant. The width of the allocated band depends on the frequency of the formant, and according to experimental data, it is up to 200 Hz: In my SAM demonstration on tyomitch.imtqy.com







this approach is used: with the default value of the parameter Bandwidth = 3, each formant introduces harmonics F 0 into the resulting audio signal within ± 5.9% of the formant frequency. This roughly corresponds to the above graphs: the formant with a frequency of 3 KHz allocates a band width of 177 Hz. In the classical SAM implementation, the generation of the required number of harmonics was approached more inventively: for each formant one wave is generated, but the phase of this wave is zeroed with a frequency F 0 . In my demo, you can switch to the mode that synthesizes one wave for each formant (but without zeroing the phase) by unchecking the Pitch parameter.

The function ProcessFramesin the classic SAM processes deaf and voiced noisy consonants separately from all other backgrounds:

  • . , Speed. ([s]) 105 , ([p] [t]) — 10.4 .
  • Speed16250PCM-, : () F1 F2, ( ) F3. , Speed=72 10.6 .
  • , , 34Speed16250PCM-, . , Speed . , Pitch=64, 1.6 , .. 9.5 .

For noisy consonants, five sample tables are used: one for the alveolar ([t, s, z]), one for the chamber-alveolar ([ʃ, ʒ]), one for the labial and dental ([p, f, v, θ, ð ]), and one for [ç] and [h]. Samples related to the same table differ from each other only in duration and intensity.

In my demo, for the sake of simplicity, a sound of equal duration is generated for all frames, and this duration depends only on the Speed ​​parameter: at its default value, one frame corresponds to 10.4 ms of sound. As experiments show, this “on average” corresponds to the classic SAM, although with respect to it, individual sounds in the synthesized phrase can “move” for units of ms forward or backward.

In conclusion, I will demonstrate three spectrograms of the welcome phrase created by the classic SAM audio generator and my audio generator with the tone synthesis turned on and off:



As you can see, turning off the tone synthesis achieves a compromise between the sound quality and the visibility of the formants on the spectrogram.

All Articles