Big Voice: The Voice Industry Reality Check

Steve Sammartino explores how new technologies can process human voices, despite its complexity, and produce a new Alexa, reconstruct facial images from audio and clone the word and sounds of a lost loved one.

By Steve Sammartino · 12 Jul 2022

By Steve Sammartino ·
12 Jul 2022 · 5 min read

Comments

Language is humanity’s killer app. We are the only species that can communicate complex ideas, document knowledge in written and digital form, pass it down through the generations, and add informational layers, building on what previous generations uncovered. It’s often said that our ability to communicate in a such a manner has put us atop of the food chain. Big Tech, as we’ve come to know it is, is basically in the language business.

My article last week explored the battle for our biometric data - its weaponisation for security versus marketing purposes. It’s clear that biometric data provides a unique insight to every person. What is less known is what can be gleaned from it, especially from our voices. The insights we can garner from biometric surveillance run far deeper than we might imagine.

The Verbal Reveal

Despite our hearing abilities being inferior to many other animals, we humans can deduce a surprising amount of information from listening to each other speak. We can make pretty good guesses at someone’s gender, age and perhaps even levels of education.

However, machines can infer far more than a human ever could from a voice. In developing the branch of AI that focuses on speech, much has been achieved, even in the early days of understanding sentences and how to respond. Known as Natural Language Processing (NLP), this field of AI can discern with significant accuracy someone’s age, gender, ethnicity, education levels, socioeconomic status, and even uncover health conditions.

What we tend to forget is that the voice recognition system is doing more than just listening to words. It is also cross-referencing what has been said with other information that can be extracted from the device delivering the voice data.

In essence, NPL does what humans do. When we meet with someone, we look at our surrounds to provide clues on how to navigate the conversation. We look at what people are wearing and the venue we are in – everything visual and contextual that might support the verbal interaction. Conversation that comes with context nearly always has better outcomes.

This is part of the reason telephone customer service lines are so often a poor end-user experience. When an NLP engine interacts with enough voices for long enough, it can create new forms of pattern recognition. Sourced from vast data sets incorporating 65 per cent of the world’s population, NPL can match data points to voice prints far beyond what any human ever could. What NLP may uncover is almost limitless.

The Mirror of the Brain

Using our voices is not as simple as it seems. When we speak, a complex process is launched, activating both physical and mental faculties. Involving the lungs, voice box, throat, nose, mouth, lips, sinus and jaw shape. Using your voice activates more than 100 muscles every time it is used. Additionally, your voice is the reflector of the brain - what it knows, believes and how it responds to audio stimuli. As the MIT Media Lab voice researcher Rébecca Kleinberger says: “It is very much the brain.”

Voice to Face

A single voice can now reveal unfathomable volumes of information to an AI engine. Researchers have even been able to generate images of faces based on information ascertained from individuals’ voice data. A 2019 Cornell University research study was able to reconstruct facial images of people using short audio recordings of their speech. The facial image reconstruction was produced through training a deep neural network, utilising millions of YouTube videos of people conversing naturally and without a script.

The network training methodology looked for correlations and co-occurrences of faces and voices. It matched the probability of voice patterns with pixel patterns to guess what an unknown person’s face might look like. The results below are quite astounding, given how nascent this technology is.

Everlasting Conversations

At Amazon’s recent re:MARS conference for Machine Learning, Automation, Robotics and Space, a new Alexa feature was thrillingly unveiled. Amazon’s AI assistant can now impersonate voices of users’ dead relatives. The demonstration featured a child asking his deceased grandmother to read out a bedtime story, whereupon her voice obligingly pours out of a nearby speaker. As you can imagine, this feature was met with both admiration and outrage. Many called it plain creepy.

While we might think that it would take a deep longitudinal data set for Alexa to learn how to mimic a real human voice, Amazon claims that its AI system can learn to imitate a voice from less than a minute of recorded audio. Given how prevalent recordings of people are now on both video and audio, it wouldn’t be difficult to create a voice clone of a loved one. Or pretty much anyone who has ever been on the internet.

While Amazon hasn’t given an indication whether this feature will be rolled out, the technology will surely leak across the web, as it always does. The website Fakeyou.com is a veritable “Choose Your Own Adventure” of actors, celebrities, singers, cartoon characters and public figures whose voices can be manipulated to say whatever you desire. It's possible to upload your own voice and even clone that.

The opportunities bubbling out of this are intriguing. When we have to make that dreaded phone call to the bank or tax department, we may be able to choose from Scarlett Johansen or Ryan Gosling as our customer services operator to make solving the issue a little more pleasant. If we combine already existing music AI systems with voice cloning to create new Beatles tunes, John Lennon may sing again. Animated voice actors for long-running animations like The Simpsons may well be delivered a pink slip. Radio hosts from years gone by could join the podcast bandwagon to cater to more senior audiences.

The Voice Cloning Industry

Voice is the latest in a long line of emergent cloning marketplaces. This creates new value, revenue sources, investment and of course, potential legal conflict. The voice and speech recognition industry (I’m calling it ‘Big Voice’ -- you heard it here first) is estimated to exceed $US20 billion by 2026.

If we thought it was already difficult to distinguish between what is real and what is fake on the internet, it’s only going to get a lot more challenging. We’ve entered a world where everything can be replicated, from social interactions, virtual worlds, voice and people. In many ways, it is a new form of manufacturing where something can be made once and sold many times. It may also be one of the next technology categories we invest in.

Go to Google News, then click "Follow" button to add us.

Share this article and show your support

Frequently Asked Questions about this Article…

What is the significance of voice recognition technology in the investment world?

Voice recognition technology, often referred to as 'Big Voice,' is becoming a significant player in the investment world due to its potential to create new value and revenue sources. With the industry expected to exceed $20 billion by 2026, it presents opportunities for investors to tap into a rapidly growing market that intersects with AI and biometric data.

How does Natural Language Processing (NLP) enhance voice recognition capabilities?

Natural Language Processing (NLP) enhances voice recognition by allowing machines to infer detailed information from a person's voice, such as age, gender, ethnicity, and even health conditions. This advanced AI capability goes beyond human abilities, making it a powerful tool for data analysis and pattern recognition.

What are the ethical considerations surrounding voice cloning technology?

Voice cloning technology raises ethical concerns, particularly regarding privacy and consent. The ability to replicate voices, including those of deceased individuals, can lead to potential misuse and legal conflicts. It's crucial for investors and companies to consider these implications as the technology becomes more prevalent.

How might voice cloning impact industries like entertainment and customer service?

Voice cloning could revolutionize industries such as entertainment and customer service by enabling the creation of new content and enhancing user experiences. For instance, it could allow for the continuation of iconic voices in animation or provide personalized customer service experiences using celebrity voices, potentially increasing engagement and satisfaction.

What are the potential risks of investing in the voice and speech recognition industry?

Investing in the voice and speech recognition industry comes with risks, including technological challenges, ethical concerns, and potential legal issues. As the industry grows, distinguishing between real and fake voices may become more difficult, leading to trust and security challenges that investors need to navigate carefully.