More
    35.6 C
    Delhi
    Friday, April 19, 2024
    More

      VALL-E : Microsoft Unveils Audio AI That Can Simulate Any Voice From 3-Second Prompts

      Microsoft researchers recently announce VALL-E, a new text-to-speech AI model that can accurately mimic a person’s voice when given a three-second audio sample. When it has learn a specific voice, VALL-E can synthesise audio of that person saying anything while attempting to retain the speaker’s emotional tone.

      When combine with other generative AI models like GPT-3, VALL-E’s creators believe it can be use for high-quality text-to-speech applications, speech editing in which a recording of a person could be edit and alter from a text transcript by making them say something they did not actually say, and audio content creation.

      As per Microsoft, VALL-E is primarily a “neural codec language model,” and is base on EnCodec, which Meta reveal in October 2022.

      VALL-E creates discrete audio codec codes from text and acoustic prompts, as oppose to other text-to-speech methods that typically synthesise speech by manipulating waveforms.

      It processes how a person sounds, breaks the relevant data down into discrete components (referred to as “tokens“) using EnCodec, and then uses training data to match what it “knows” about how that voice might sound if it spoke other phrases beyond the three-second sample.

      Microsoft train VALL-E’s speech synthesis functionalities using Meta’s LibriLight audio library.

      It includes 60,000 hours of English language speech from over 7,000 speakers, source primarily from LibriVox public domain audiobooks.

      The voice in the three-second sample should closely resemble a voice in the learning algorithm for VALL-E to produce a good result.

      The Microsoft offers dozens of audio examples of the AI model in action on the VALL-E example website.

      The “Speaker Prompt” data set is the three-second audio given to VALL-E that it must try to emulate.

      ALSO READ  Assam News : State Government Signs 7 MOUs with Microsoft, Google, Tata Group and Many More

      The “Ground Truth” is a previously record version of that same speaker saying a specific phrase for comparative purposes (sort of like the “control” in the experiment).

      The “Baseline” sample is generate by a traditional text-to-speech synthesis method, and the “VALL-E” sample is generate by the VALL-E model.

      A block diagram of VALL-E as shown in the example website by Microsoft researchers

      Researchers only supply the three-second “Speaker Prompt” sample and a text string (what they would want the voice to say) into VALL-E to get those results.

      Some VALL-E results appear computer-generate, but others could be misunderstand for human speech, which is the model’s goal.

      Because of VALL-E’s potential to fuel wrongdoings and deceit, Microsoft has not made VALL-E code available for others to explore.

      The researchers appear to be aware of the potential social harm that this technology may cause.

      Researchers write in the paper’s conclusion :

      “Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.”

      Related Articles

      LEAVE A REPLY

      Please enter your comment!
      Please enter your name here

      Stay Connected

      18,752FansLike
      80FollowersFollow
      720SubscribersSubscribe
      - Advertisement -

      Latest Articles