Designing For Voice

8 min readOct 15, 2020

The growth of digital assistants has led to a surge in organizations working on voice user interfaces as a form of interaction. There’s been a particular focus on making the interaction natural whilst achieving the user’s objective. Although best practice around user experience still applies, dealing with the medium of voice has some unique challenges. Our experimentation led us on a journey of discovery around how to plan the dialogue, consider context, build adaptability, and efficiently execute intent.

Many years ago, I worked for a technology division of the BBC that eventually spun off to become a standalone digital agency. Their offices were such a contrast to the bland corporate workspaces I had worked in previously. Meeting rooms were converted into green screen studios, and large playout control rooms delivered 60+ channels to live television. Amongst all the visual pizazz, was a team responsible for audio describing what was happening on a television show. They used IBM Via Voice to caption and explain what was happening on live TV. It was my first encounter of speech-to-text in action solving a real-world problem. Yet for all its technological sophistication, I was amazed to observe that the speakers had adapted their voices to be robotic to improve accuracy.

A playout control room where voice services play a vital role.

In contrast, I observed another voice service in action in the playout control room of BBC One. They had a person quietly sitting off to the side whom, between programming, would put their coffee and newspaper down, lean into a microphone and in a silky-smooth voice say, “Coming up next on BBC One….” I remember wondering why they didn’t have a pre-recorded collection of phrases for that simple task? Why couldn’t it be digitized? In reality, technology wasn’t quite up to the job back then, nor could it have synthesized into a smooth and silky digital voice.

Almost a decade later, I found myself working with the latest in natural language processing. No robotic voice required, just natural speak. We were on a journey of discovery, attempting to understand the nuances of human and digital assistant interaction. We set up a series of experiments where users would carry out some tasks interacting with our digital assistant.

Keeping A Narrow Scope

The BBC announcers were seasoned professionals that understood the context of what was aired and what they had to introduce next. A reasonably narrow scope. Today we’re expecting digital assistants to do a wide range of things based on natural communication. As humans, we underestimate how complex this is. Dialogue structure, context, tone, body language, eye contact and pauses are just some of what our brains use in everyday communication. Deep learning is providing some progress, but narrowing the purpose and outcomes provides a more realistic strategy. Narrowing the scope of our experiments to focus on a few distinct tasks allowed us to reduce the variables, the dialogue, and therefore the potential for errors.

Increasing scope can increase complexity exponentially for voice interfaces.

“Happy Path” Dialogue

Planning the interaction before cutting code was key to reducing work further on in the experiments. I like Amazon’s use of the term “happy path” when it comes to a dialogue sketch of a journey from intent to outcome. We used this approach to give us a baseline to build from. We then formed a lightweight flow diagram of the dialogue flow, taking into consideration context. This progression allowed us to determine whether we had a narrow enough scope and could deal with all the different variables to achieve the outcome. The creation of the interaction model followed with detailed intent and utterances (human phrases) that would define the variables throughout the user journey.

Start with the “happy path” and build up from there.

Adaptive Dialogue

The initial dialogue in the experiment was designed with the intent of making users feel at ease while conversing with our digital assistant. It was friendly and helpful, but wordy. As participants were trying to get a task done, there was a balance of providing enough dialogue to remain useful and guide the journey whilst getting the job done. This required more consideration of the proficiency of the user, the context, and the minimum information needed to achieve the task. After observing the frustration, the initial attempts caused, our response was to go back to basics with our communication algorithms, changing the approach to only provide the minimum dialogue required to achieve the task. The impact was a rapid reduction in errors and made it less likely that the participants would interrupt it. To balance the shorter responses, we relegated the pleasantries to before and after the core journeys. For example, she might start the interaction with, “Good morning, how can I help you today?” Likewise, she might end the exchange with something like, “All done. If you need anything further, just ask.”

Example of the experience flow we used to develop the dialogue.

Conversational Etiquette

A conversation is not as simple as turns, where “you speak, then I speak.” The reality is that we sometimes speak over each other or respond at the exact right time. The right time being assessed by our many sensors, which consider cues like context, the conversation structure, and body language. In our experiments, we observed our participants often speaking over the digital assistant either because they had made an error, she had made an error, or her response was too wordy. We realized that reducing dialogue reduced some of the frustration, but mistakes were always going to happen. To combat this, we introduced interruption and navigational cues. Participants could say her name “Andrea” or simply “stop”, to make her stop talking and listen for the next response. Likewise, we introduced “go back” allowing users to step back in their journey and “start over” to begin the task again. These navigational utterances returned a level of control to the users and were highly effective.

Confirmation Responses

As we attempt to reduce dialogue and efficiently execute a task, we often overlook the value of confirmation. When I say, “Alexa, set a 90-minute oven timer”, it’s crucial that she confirms back to me what she interpreted. Without this, I have to guess whether my meal is in safe hands, and conversely, Alexa has no error feedback loop. It’s the equivalent of pressing the “OK” button on a graphical user interface. In our experiment, we had the luxury of multiple confirmation paths, which included voice and display. We were successful in reaching a balance in using both methods to achieve confirmation whilst maintaining a short and snappy user journey.

Voice confirmation is akin to the “OK” button for a GUI.

Tonality

Reducing words in dialogue can also have a negative impact. Short answers can sometimes feel abrupt or rude. Finding balance in the dialogue of a digital assistant is as much art as science. Providing an intentionally friendly vocal tone can also help balance the conversation. As we shortened the dialogue for our digital assistant, we spent a considerable amount of time reworking her tone to sound happy, friendly and optimistic.

Understanding human tone and body language can take this to the next level. Our digital assistant had a camera that could see the upper half of the person’s body. Most digital assistants don’t have computer vision, so tone comprehension becomes increasingly essential. Consider how natural language processing could adapt responses based on the tone of the user. Alexa has been working on understanding tone for a while now. Detection is one thing, but the response is equally as vital. I regularly get frustrated when I add an item to the shopping list, and she gets it wrong. Consider the following example:

Human 🙂 - “Alexa, add Cheerios to the shopping list,”

Alexa - “I’ve added chairs to the shopping list.”

Human 😠 - “No, I said to add Cheerios to the shopping list!”

Alexa - “I’ve added Cheerios to your shopping list,”

Rather than repeat her same confirmation, she would be better to say, “Got it, Cheerios,” or “Done, Cheerios.” As a regular user, this would be such an improvement for me. Siri seems to handle this better, which may be a result of it being built initially in the context of a mobile-first interaction.

Role Play

The complexity of forming voice user journeys can be challenging for even the most seasoned teams. We found role-playing was useful in testing and developing a dialogue that was natural whilst captured subtle variations on how different humans might ask for the same thing. I also spent some time testing on other digital assistants to determine how other voice teams handled particular types of dialogue. This included Google Assistant, Alexa, and Siri. Using these approaches and feedback from the experiments themselves, we were able to refine the dialogue and overall experience to be more effective.

Testing on other digital assistants can provide insights into how others have solved the same problem.

Takeaways

Natural language processing is a fascinating field that is progressing rapidly. We have the opportunity to guide this by considering the nuances of human communication. Consider the following as you embark on your voice journey:

· Keep the scope of the skill narrow.

· Use the graduated approach of happy path dialogue, experience flow and interaction models to plan before cutting code.

· Build dialogues that adapt to the proficiency of the user, context, and variations that naturally exist between humans.

· Provide a level of control for errors and general usage that allow the user and digital assistant to correct their journey.

· Confirm via voice, display, or both, essential variables to reinforce the journey path.

· Consider tonality as you play with the dialogue to maintain a friendly interaction.

· Test, test, and test some more through role-play and experimentation.

By considering the components of effective communication, the personas of our users, and their intent, we can start building digital assistants that feel more natural and achieve the task more efficiently.