Phoneme Extreme

I forget where I learn this, but a good reminder for current voice cloning: 11Labs isn’t actually CLONING your voice as much as it’s editing a premade voice to sound like yours, which is why some clones don’t seem to come out right

Again, I forgot where I heard this, but apparently the technical explanation for why voice cloning technology seems to turn all voices into generic Americans with a few very standard British speakers, without any further vocal flourishes or effects, is quite literally because the technology doesn’t actually clone your voice but rather fits the closest premade voice to the samples you provide. As a result, at least for version 1, you’ll find those imperfections. A colleague of mine noticed that, despite a particular voice sounding 95% perfect, there was just a single flourish to that voice that didn’t translate at all. If you weren’t paying attention or swapping between the original voice and the cloned voice fairly quickly, you wouldn’t notice it. But a keener ear picked it up and now neither of us can unhear it.

Furthermore, this also explained why some voices that have flourishes that don’t radically change the ferment and timbre of the voice can be translated, but other more radical acts of voice acting won’t be translated at all (such as a very gravely, raspy voice or a very, very squeaky one all being defaulted to the same “flat” voice).

We also cloned so many voices, that we started picking up that some “shared the same voice actor” and only occasionally shifted back into sounding like the cloned voice.

Some of the characters we clone are children; others are very heavily-accented foreigners. The kids almost always sound like either a single kid doing a very slight variation to his voice, or a woman not even trying to sound like a kid. And there is quite literally no possible way to clone an baby’s voice: 11Labs freaks out and turns it into a mechanical demon or super-ethereal elf woman instead. The foreigners either spoke straight standard American English or a very, very standard accent (helped by using foreign words to trigger the accent but sometimes naturally rolled). At the very least, with the addition of the new multilingual tool, we’re able to get just about every voice to speak another language and accent now.

There are roughly enough voices to mask these limitations unless you’re trying to create a massive cast of characters for a serial like we are, so most people probably have never realized this. But once you do, you definitely start to feel the constrictions of the technology’s limitations. And that’s on top of lacking a proper emotion director, voice changer, or temperature setter.

Looking at the voice cloning option, I see that you can professionally and “perfectly” clone a voice, so long as it’s your voice (at least for right now; it’s implied that, in the future, you’ll be able to perfectly clone others’ voices). Personally the only added utility I see out of that is to add those previously unattainable flourishes, because as mentioned, the voices can be so close that if you’re not listening closely for them, you really couldn’t spot the differences. But a perfect voice cloner is definitely welcome, so long as this technology is limited to fanprojects and pure consensual and licensed stuff. Besides, the greater utility will come from both a proper vocal director and a voice changer. A vocal director to add specific emotions and paralinguistic vocalizations would solve pretty much 70% of my current issues, because on top of the instant voice cloning reducing everything to a standard voice, it also struggles to emote. I can type “AHHHHHHHHHHHH!!!!!!!!!!!” all I want, and even if I reroll it 50 times, the best I might get is a half-hearted “ahh…!(bizarro airy noise)”. Literally better to find a stock scream and edit it a bit in Audacity.

The lack of an ability to manipulate the temperature of a roll directly is also a bit annoying. I can tell that some rolls have a higher temperature than others; you can often tell that a particular output will be “perfect” or “good enough” not even a few words in. This seems to be a separate variable from the Stability or Clarity sliders we’re not given access to. If I’m wrong, please correct me.

Thank you for this post, you definitely have fascinating insights and into 11labs and the entire field!

11 definitely has its limits and shortcomings at the moment. It generates interesting results for me but not in every situation. I made 2 different clones of my own voice to talk to each other in ChatGPT generated scripts on a podcast, and that’s a scenario where the oddities of 11 sometimes work very well to serve the weirdness. But for replacing voice actors, I’m sure you’re right and we’re not there just yet.

May I ask what you do, having cloned so many pro voices?

I work at a small dubbing studio, located in Israel. Been trying to implement cloned replacements for mostly informative VO work in English and German, which might work well. I don’t think local voice actors here in Hebrew have anything to worry about for a long time, though. Niche language.

Either way, even here I’m sure we’d need the talents to legally approve using their voices like that. And with people like Tara Strong (must have been fun to hear so much of her work!) their approval would be a must, I imagine, and I don’t know how many would be fine with it, although there are advantages in doing that for them, such as opening up opportunities where they’re not available or even other languages they do not speak, specially when the cloning gets better.

Making a cartoon (right now an audio fan-dramedy until AI animation gets better) using AI. Technically it’s a Loud House fanfiction so there’s good reason to mimic the canon voices. But there are so many characters to voice that yeah we totally ran into the voices starting to sound suspiciously similar to each other. Two entirely separate characters who have very distinct voices in the original show were cloned with ElevenLabs and wound up having almost the exact same cadences and timbres except for a few instances and words where you could kinda hear the way the original voice was supposed to sound.

Because a lot of the characters in this show are young girls/young women (both in the original show and in the fanfic show that has a mostly different cast) we QUICKLY ran into the realization that ElevenLabs only has a VERY limited pool of base voices and the one we kept hearing seemed to be one of two “Generic Young Female” voices that was manipulated to sound either sorta or mostly like the original voices. In fact in at least a couple cases it’s SO obvious that I pretty much have to rewrite the audio-dramedy around these limitations for the time being because some of the plans I had for it literally CAN’T work with characters that lack their distinctive traits or emotiveness or sound too similar to another.

Honestly I think at least some of the problem would be rectified by having at LEAST a few more base voices. It wouldn’t solve ALL the problems but it would at least help expand the range.

But for replacing voice actors, I’m sure you’re right and we’re not there just yet.

You know while I know the technology with a “true” cloner and voice director could easily do that, am I the only one who DOESN’T want voice actors to be replaced? I’d rather them license out their likeness for use in commercial projects. I guess there’s nothing you can do in private and on ultra-anonymous forums like 4chan, but if you’re using a voice actor’s voice for something that is intended on being widely viewed and shared, then I think there should be some regulations in place so that actors and actresses can at least license out their talents. Some way for them to profit as well as keep new talent coming in.

Since I have many friends who are VO actors, I certainly do not want to see the day when they are widely replaced. And I think human performance will remain unique for things that require uniqueness. If not forever then at least for many years to come. I wouldn’t really use the synthetic voices for performance driven dialog that requires precise acting and directing. Not just yet, anyway.

You did say you know the technology could do that and I’m curious to see where it goes. 11 is by far the most natural sounding I have heard as it is.

Either way, if an actor’s voice is cloned, they should most definitely be compensated for every use of it and it should be regulated.

I think you probably took this from one of the many articles about Amazons speech cloning technology they announced last year, the dead grandma thing. This technology may well not work like that. Personally, I was surprised by just how well it captured my accent, I’m British, but my accent is somewhat unique due to years at boarding school with people from all over the country. Also, depending on how I adjust the sliders, sometimes I can get it to reproduce my stammer, or my weird little nervous laugh without any sort of prompt. Obviously the better samples you have the better, I never give mine anything less than five minutes of clear audio.

Never read that article. Actually, I’ve been following synthetic media since 2017 now. I’ve been saying for years now that voice cloning tech has the potential to completely destroy our trust-based society and very negatively affect voice actors (who are not the gold-pavement-walking superstars a lot of people think they are).

Whether reality actually matches up to the idealized dystopia in my head remains to be seen, because reality is often disappointingly mundane. Humans cannot easily visualize exponential growth, but it’s exponentially harder to visualize logarithmic growth because that requires both accepting exponential growth and exponential slowdown of that growth, and we just aren’t well equipped to handle either. Similarly, we can visualize a lot of nifty, dramatic sci-fi scenarios in our head of things going wrong, but we have a tendency to vastly overlook the raw, basic mundane humanity of day to day life and the fact 99% of people don’t live high-octane lives on the bleeding edge or care about all this stuff like some of us do.

That being said, I’ve definitely gotten 11Labs to reproduce a lot of flourishes, especially by playing with the characters, formatting, spelling, and so on. I’ve probably burned through a million characters at this point. That part about some flourishes and details remains true.

The guy I’m helping, he gave me a monstrous number of files. Very clear files, that leave absolutely nothing unclear, of a character voiced by Tara Strong. Probably an hour worth of files altogether.

Was about as effective as the first attempt with about a dozen 5-second long clips. Didn’t even come close to matching her voice except for a tiny few instances where, if you were generous, it sounded somewhat close.

It’s just a limitation of the tech at this point. I’m fine with it, since it’s just version 1, but it can be frustrating when other characters are replicated almost perfectly. Or in that one case of a character who was replicated literally almost perfectly, except for that one very minor and yet very crucial detail of how the voice worked that really “sold” the character. It’s the little edge cases like that which keep 11Labs from being the overwhelming threat it could be in its current form.

was so confused by why ElevenLabs nailed some voices bang on but completely collapsed at others almost arbitrarily. Like, Character A has an extremely distinctive voice but ElevenLabs copies it bang on. Character B actually has a more generic sounding voice, but ElevenLabs doesn’t even attempt to mimic it. Character C has a distinctive voice but with a specific trait (e.g. a rasp, lisp, or accent) and ElevenLabs manages to nearly perfect copy the pitch and timbre, but not the vocal detail. And so on until we realized that the program is simply editing the base voices to sound like the uploaded voices, and if the uploaded voices are too different from the base voices, it doesn’t work.

It does seem to really struggle with certain voices, and I’m not sure why. It doesn’t seem to be entirely related to accent either, since I know of a few very stereotypical Americans who haven’t had much success with it. My voice is probably easier to clone than many because I’m fairly soft spoken and conscious of how I sound when talking to people, and I used to do a bit of voice acting on the side, like a lot of the people this tech was probably trained on. I can definitely see this having a very negative effect on voice actors and society as a whole. I remember Lyrebird being all the rage a few years ago before they were bought up by descript, and that thing sounded terrible and required hours and hours of very precise recordings. I still can’t quite believe how far we’ve come in such a short time.

Eleven can be expressive but it’s not necessarily what and where you’d predict, expect or want. Unless you’re willing to embrace the oddities but for kids’ stories that’s not really a plus hehe But by lowering stability and varying the similiraty, my own clones have been yelling, whispering, almost crying, stuttering, being romantic or sarcastic, a bunch of fun stuff… That doesn’t always fit but for what I’m working on now, it actually adds a special spice.

I can share what I’ve done if you want, it actually has some poem reading, albeit not quite kids stuff.

Hey, just looking through this sub and took a brief listen to one of your episodes. I think with a little bit of editing you can get a lot of separation between the voices. Try panning one character a bit left, another right, leave one centered. You could also experiment with EQ. Keep more low end in one, more mids in another, etc. If you do this consistently (don’t mix and match) that should help distinguish the voices from one another. You could even drop some reverb on a voice or two to add depth and even more separation. Have fun and good luck!

Phoneme Extreme

I forget where I learn this, but a good reminder for current voice cloning: 11Labs isn’t actually CLONING your voice as much as it’s editing a premade voice to sound like yours, which is why some clones don’t seem to come out right

Share this: