Imagine the possibilities opened up if you could create an AI voice from a minority language that was previously unwritten! I’m planning to update this post as I learn more.

At the moment (March 2023), this is beginning to look like a possibility in the near to medium future.

What can be done already?

1. It’s possible to clone a voice with a short piece of audio recording.
2. It’s possible to generate any output with that voice given the (written) input language.

What steps would be required to generate a voice from an previously unwritten language, without having an extensive training corpus?

The details I need to investigate more, however it might need to include:

1. Codifying the language and creating a written script, perhaps using IPA at first.
2. Recording a voice that can produce all the required phonemes of the language.
3. Writing text in the minority language in a suitable script that can be used as the input for the voice generator.

 

Current AI voice generators
I scanned these current voice generator websites but non of them offer anything like this…
murf.ai, play.ht, www.resemble.ai, typecast.ai, uberduck.ai, beta.elevenlabs.io

Murf.ai does have some information on diversity in text to speech but its about accent diversity, still interesting

Meta’s Massively Multilingual Speech Project

Meta’s MMS model is probably these best there is so far. They say:

“Producing good-quality machine learning models for these tasks requires large amounts of labeled data — in this case, many thousands of hours of audio, along with transcriptions. For most languages, this data simply does not exist. For example, existing speech recognition models only cover approximately 100 languages — a fraction of the 7,000+ known languages spoken on the planet. In the Massively Multilingual Speech (MMS) project, we overcome some of these challenges by combining wav2vec 2.0, our pioneering work in self-supervised learning, and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these, such as the Tatuyo language, have only a few hundred speakers, and for most of these languages, no prior speech technology exists.”

Read the full article here, and crucially this model is now open source and available to download – I’m hoping to experiment with it in the future.

Apart from this the only other thing I can find at the moment is MIT’s older attempt at the same thing.