Adobe Voco - awesome tech or awful pandora's box?

Adobe Voco allows users to type out what they want someone's (recorded) voice to say, and it comes out sounding remarkably accurate. What could possibly go wrong?
ForgedRealitysays...

Blah. Not impressed. The trickery is in what he's not showing. The software is treating the entire audio clip as a smart object, and it's referring to that for waveforms that it can use or manipulate to be close. Notice how he didn't show us the entire audio clip. I guarantee, he says "Jordan" and "three times" later in the audio. It's merely referencing that index where it detected those words before (speech recognition, in itself, an ancient technology, so not all that impressive), and simply copying them into the new clip. You can't just type in anything willy-nilly and expect results this good. If he typed "motherfucker caterpillar penis", it would have been nothing like this example, if it worked at all.

Paybacksays...

There are only 44 phonemes in English. It probably searches for and grabs all 44 (disjointed English guy says 20 minutes of speech is needed). I think he could get it to say "motherfucker caterpillar penis".

ForgedRealitysaid:

Blah. Not impressed. The trickery is in what he's not showing. The software is treating the entire audio clip as a smart object, and it's referring to that for waveforms that it can use or manipulate to be close. Notice how he didn't show us the entire audio clip. I guarantee, he says "Jordan" and "three times" later in the audio. It's merely referencing that index where it detected those words before (speech recognition, in itself, an ancient technology, so not all that impressive), and simply copying them into the new clip. You can't just type in anything willy-nilly and expect results this good. If he typed "motherfucker caterpillar penis", it would have been nothing like this example, if it worked at all.

ChaosEnginesays...

Speech generation is pretty old tech. Modulating it to someone's particular voice is reasonably clever, but I'd like to know how it deals with inflection, emphasis, etc?

Where I see this being useful is in video games. I can see an AI engine generating verbal responses to situations without needing a voice actor to record several million lines ("he's on/in/by/near the roof/car/doorway", etc)

Esoogsays...

From the full presentation: It takes 20 minutes of prerecorded speech from the person, then you can make them say whatever you want, even if they have never said it before.

Discuss...

🗨️ Emojis & HTML

Enable JavaScript to submit a comment.

Possible *Invocations
discarddeadnotdeaddiscussfindthumbqualitybrieflongnsfwblockednochannelbandupeoflengthpromotedoublepromote

Send this Article to a Friend



Separate multiple emails with a comma (,); limit 5 recipients






Your email has been sent successfully!

Manage this Video in Your Playlists




notify when someone comments
X

This website uses cookies.

This website uses cookies to improve user experience. By using this website you consent to all cookies in accordance with our Privacy Policy.

I agree
  
Learn More