Teaching with Technology

Voice Cloning for Education

By John Orlando
October 28, 2024

Simple and inexpensive software has made it easy for students and teachers to create video and audio for learning content and projects. But this still leaves many students and teachers struggling with the stylistic elements that make multimedia interesting and engaging. In particular, it is hard to avoid falling into a monotone when recording audio, especially when reading from a script. Speaking extemporaneously is more natural, but it’s hard to do without making mistakes and needing to rerecord the session multiple times. One solution is voice cloning. Modern AI software allows the user to feed samples of their voice into the app and then provide a written script that the app uses to create an audio recording in the user’s voice. While early text-to-audio programs produced robotic-sounding outputs, modern software can reproduce the user’s voice with remarkable fidelity. Moreover, these systems can generate the voice inflections and emotions that help sustain listener interest. Not only can voice cloning be used to produce better-quality audio in less time than manual recording, but today’s apps come with a library of voices that users can choose from. A literature professor can draw upon a British voice to create a podcast in the guise of King Lear describing his relationship with his daughters. Students can be assigned to create a hypothetical speech for a historical figure from recordings of their speeches, such as Winston Churchill talking about his plans for postwar Britain. Furthermore, students who are self-conscious due to voice impediments can create audio free of the issue, helping advance the cause of equity in education. Creating a voice clone While voice cloning sounds difficult, it is actually remarkably easy and is not beyond the reach of users with limited technical skills. 11ElevenLabs, probably the best-known and respected tool on the market, serves as a good overview of how these systems work. It allows users to choose from a library of over 300 voices and can generate output in 29 different languages. The free version allows users to produce up to 10 minutes of audio per month in voices chosen from the library. The Starter version ($5 per month) allows users to clone their own voice and generate around 30 minutes of audio per month, while the Pro version ($99 per year) generates higher-quality output and allows for up to 10 hours of audio per month. After registering for an account, the user comes to the homepage, where they can start by surveying the available voice profiles. Each profile comes with a name and description, such as Grandpa Spuds Oxley, a “friendly grandpa who knows how to enthrall his audience with tall tales and fun adventures,” and Mila, a “female, 20s–30s, opinionated and confident, yet soft and empathetic, with a relaxed creak at times. Native German.” Each also comes with a voice sample. A Starter subscription or higher allows the user to create their own voice profile. In fact, they can create up to 10 different profiles by uploading up to 25 voice samples for each profile. Then the user goes to the Text to Speech function, where they copy and paste a passage of up to 5,000 characters into the text field. They can also upload or record an audio sample of the script and have the system translate it into a voice profile of their choosing. Once they have uploaded the text or audio, the user picks a voice profile from the library or one of their own voice profiles and clicks Generate Speech. The audio comes out within seconds, and the user can then download it. The user can listen to the audio and make adjustments by fiddling with the settings. For instance, the Stability setting does two things. First, it inserts some variation to the output even when the same input and settings are used multiple times. This allows the user to compare the outputs and choose the one they like best. More stability creates less variation between outputs, but less stability creates more emotion. The Style Exaggeration setting amplifies the style of the original speaker. Both settings are on sliders, allowing the user to choose anything from 0 to 100 percent. The user plays with the settings and keeps regenerating the audio until they like what they hear, and they can save the settings for future projects. With higher-level subscriptions, the user can choose different settings and voices for different parts of the same recording. This would be helpful if someone is creating a dialogue between two or more people. Take a look at this quick tutorial on how it works. Other tools The are multiple voice cloning tools on the market, and a full review of them all is well beyond the reach of this article, but there are some relevant differences to consider when choosing a product. Briana Brownell found that Descript produced more robotic audio than other systems but that the process of creating an output was fast. Still, there was a real learning curve to get to the point where the process came naturally. Brownell liked how 11ElevenLabs allowed the user to play with the output to create more expressive audio, but she noted that this capacity can be taken too far, to the point where the system inserts “ums,” pauses, breathing, and laughter into the output. Meanwhile, Play.ht allowed the user to generate the audio in multiple parts to work on individually and stitch together at the end, but it also produced more errors than others. It should be noted that all these systems have teams of programmers constantly improving them, and so today’s problems may be tomorrow’s memories. Plus, they are constantly adding features. It is better to treat these reviews as a way to identify features to look for in a voice cloning system rather than the final word on how well they work. Users should consider price, quality, output control, and ease of use to find the system that best combines the features they find most important. But once found, a voice cloning system can help create better multimedia in less time.