The Advancement of AI Speech Technology: From Mechanical Synthesis to Photorealistic Conversation Replication

  Since the end of 2022, the “AI treasure box” opened by ChatGPT is still pouring out new magic. Recently, videos such as “Guo Degang speaks cross talk in English” and “Taylor Swift speaks fluent Chinese” have gone viral on social platforms. In these videos, these public figures’ foreign languages ​​are not only accurate in pronunciation, authentic in grammar, and in good mouth shape, but also in tone. They are all very similar to the person, and can almost be mistaken for the real thing.
  This is the “new gameplay” brought by AIGC, a one-click video translation AI tool—HeyGen, from a Chinese company called Shiyun Technology. By logging into its website, free users can upload video files within 5 minutes, and simply select a language to generate high-quality foreign language dubbing videos within tens of seconds to minutes. Under the influence of Guo Degang’s video, when it was most popular, there were tens of thousands of generation tasks queued up on the website, and the charm of AI speech synthesis was once again fully verified.
  The birth of language was once one of the most important turning points in human society. The human voice itself has amazing diversity. No two people’s voices are exactly the same. Coupled with various languages, accents, habits and emotional expressions, it is not easy for machines to synthesize human speech.
  There are three different levels of speech synthesis, understandable, natural, and emotional (cadence). The earliest attempts can be traced back to the 18th and 19th centuries. Scientists at that time mainly used mechanical devices to simulate human voices. For example, in 1791, the Viennese inventor Wolfgang von Kemplin used a machine to imitate what humans need to speak. Various organs—a pair of bellows were used to simulate the lungs, a vibrating reed served as the vocal cords, and animal skins were used to simulate the throat, tongue, and lips. By controlling the shape of the tube and the position of the tongue and lips, the machine can produce some consonants and vowels, but not complete words.
  Obviously, the human vocal system is exquisite and complex, and it is difficult to imitate it mechanically. In 1939, Bell Labs launched the first electronic speech synthesizer (named VODER), which used electronic equipment to simulate the resonance of sound. It’s a fairly complex machine, with 14 piano-like keys, a wrist-controlled joystick, and a foot pedal. Users need to go through a long period of training to master this complex operation. For example, to pronounce the word “concentration”, they must press 13 different sounds continuously, plus move the joystick on their wrist up and down 5 times. times, press the pedal 3 to 5 times.
  In the 1980s, with the development of integrated circuit technology, more complex combined electronic sounders appeared. The representative one is the series/parallel hybrid formant synthesizer released by American scientist Dennis Kratt in 1980. Its principle is to use different mathematical formulas to simulate the three human vocalization links, namely vibration source, vocal cords and vocal tract, and then connect them in series to simulate human vocalization.
  In the 1990s, everyone found that the parameter synthesis method could not improve performance no matter how improved it was, so they began to use a more direct method – waveform splicing method. Taking Chinese as an example, there are more than 1,400 Pinyin syllables with tones. Simply record dozens of samples for each syllable, and use the most suitable samples to splice them together to form speech. Although this method is crude, it is quite effective.
  Since 2014, deep neural networks have also begun to participate in speech synthesis technology, greatly improving the quality of synthesis. From this stage, AI speech is not only easy to hear and understand, but the mechanical flavor has gradually faded away, becoming more and more natural. . Speech synthesis is beginning to develop in a more realistic and interactive direction like real language, becoming an important way for people to communicate with AI.
  Not long ago, ChatGPT launched a voice function, which is surprisingly realistic. For example, it will process the tone according to the context, add emotional tone, and also add some language-organizing words such as “emmm” in the middle of the paragraph. It will find the key points and adjust the speaking speed between words. You can even hear its slight breathing, slurring, and some minor flaws in retroflex and nasal sounds.
  For example, in order to prevent fraud, many people will make a phone call to confirm whether the other party is the person receiving a text transfer or loan message. This method is obviously not feasible now. With the advancement of computing power and algorithms, scammers only need to extract a few seconds of video and audio clips from a person’s social media to simulate his voice. With real-time face-changing tools such as deepfake, even video calls can be Not necessarily true.
  Artificial intelligence can be a force for good, but it also has the potential to turn for bad. Until more complete supervision and identification technologies are developed, remember: seeing is not necessarily believing, and always stay vigilant.

error: Content is protected !!