Google has said that initially, Duplex will work on three tasks:
• Asking a business about its holiday hours
• Booking a table at a restaurant &
• Making a hair appointment
It plans to start with what it calls a very limited trial of the first of these three, before expanding it to the others. Making calls to other businesses could follow, though each task would take separate training, using real people to coach the system until it is ready to manage calls on its own.
The Google Duplex system is capable of carrying out sophisticated conversations and it completes the majority of its tasks fully autonomously, without human involvement. The system has a self-monitoring capability, which allows it to recognize the tasks it cannot complete autonomously (e.g., scheduling an unusually complex appointment). In these cases, it signals to a human operator who can complete the task.
How Does It Work?
To train the system in a new domain, it uses real-time supervised training. This is comparable to the training practices of many disciplines, where an instructor supervises a student as they are doing their job, providing guidance as needed, and making sure that the task is performed at the instructor’s level of quality.
In the Duplex system, experienced operators act as the instructors. By monitoring the system as it makes phone calls in a new domain, they can affect the behavior of the system in real-time as needed. This continues until the system performs at the desired quality level, at which point the supervision stops and the system can make calls autonomously.
At the core of Duplex is a recurrent neural network (RNN), built using TensorFlow Extended (TFX). To obtain its high precision, it trained Duplex’s RNN on a corpus of anonymized phone conversation data.
The network uses the output of Google’s automatic speech recognition (ASR) technology, as well as features from the audio, the history of the conversation, the parameters of the conversation (e.g. the desired service for an appointment, or the current time of day) and more. Google trained the understanding model separately for each task but leveraged the shared corpus across tasks.
Finally, they used hyperparameter optimization from TFX to further improve the model. Incoming sound is processed through an ASR system. This produces text that is analyzed with context data and other inputs to produce a response text that is read aloud through the text-to-speech (TTS) system.
How Does It Sound So Natural?
Google uses a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.
One of the key important factor to manage people’s expectation is latency. For example, after people say something simple, e.g., “hello?” they expect an instant response and are more sensitive to latency. When low latency is required, Google uses faster, low-confidence models (e.g. speech recognition or end pointing).
In extreme cases, it doesn’t even wait for RNN, instead use faster approximations (usually coupled with more hesitant responses, as a person would do if they didn’t fully understand their counterpart). This allows it to have less than 100ms of response latency in these situations.
Interestingly, in some situations, it was actually helpful to introduce more latency to make the conversation feel natural — example, when replying to a really complex sentence.
While the feature is still in testing, Google has reportedly been fielding offers from businesses that want to put Duplex to work in call centers. The next customer service rep you speak to could be a machine. Some people are horrified by this future but nevertheless, it is coming.
While it brings certain challenges around what it means to be human, what human-AI interactions should look like, and whether machines should pretend to be human in order to get work done as agents for their creators, it also brings great opportunity.