A failure story to design Voice User Interface #1

Published in

UX Planet

7 min readOct 21, 2017

(The images in this writing are downloaded from web and cannot be used in commercial purpose)

I have experience with UI design for 9 years. It could be a short or long time depends on the point of view. This text will explain the breakdown process of UI which designed one function which has been proceeding for 9 years.

It is a shameful story, painful and regrettable content, but I hope people engaged in the relevant industry look back on it at least once.

Designer of sophisticated in vehicle Voice User Interface

Although the issue was converted into AI, Machine Learning, or IoT I am designing UI for the vehicle, which was hot for a good while. I design the voice user interface which is too strange among them. Isn’t it great? How attractive function(interaction) it is in the vehicle!

Its role is actually nice. It predicts/defines the words people have to say and when they said those words, it judges various conditions and designs the scenario to carry out the function they want.

What is the Voice User Interface people expecting?

Let’s think of the essence(UX) of voice interface. Many people may think of ‘Jarvis’ from the movie ‘Iron Man’, ‘Samantha’ from the film ‘Her’, and ‘Kit(Knight Rider)’ which is old drama. What they have in common is they play a role of ‘secretary’ for the main character very greatly.

They understand the words well, sincerely perform the instruction, and make few errors. Their only shortcoming is they ask the master to explain again because they cannot understand it’. In addition, they make preparation by themselves by predicting the protagonist’s behavior.

‘Jarvis’ from film ‘Iron Man’ (Left), ‘Samantha’ from film ‘Her’ (Right)

Technically speaking, recognizing the voice is a different domain from providing a suitable result. Recognizing the voice is purely recognizing the voice as it is, which shows the performance close to 100%. On the other hand, ‘providing the suitable result’ is the domain of the artificial intelligence(AI) like Alpha-Go, which analyzes a large amount of data and the user’s characteristic, which shows the performance less than 10% depending on the person.

Anyway, ‘Voice User Interface’ people think of is the thing which mixes these two things.

Voice User Interface in reality

Those who have iPhone press the Home button in the middle for long. A screen having a button shaped like microphone will come along. If you say anything, it will give a proper answer. This is ‘Siri’, Apple’s voice secretary. If you say anything, it gives good answer or not-good answer.

If people carrying the Android phone also press the Home button for long, something similar is loaded. Samsung’s phone provides the voice secretary service called Bixby, LG the voice secretary serviced from Google similar to Apple’s Siri (There are Amazon’s ECHO and Microsoft’s Cortana in addition to them).

Apple Siri, Samsung Bixby, Google Assistant

If you use this and that in a wondrous mind, you’ll find out these services are not perfect too. Most results are drawn by analyzing the data in server and users’ information; they do not indicate the most suitable result by the nice voice like in movie.

Voice User Interface designed by the writer

The voice recognition which has been being designed since 9 years ago is more limited than one of the smartphone. If you speak anything to the smartphone, it gives an answer no matter it understands it or not. However, the designed voice user interface should tell the given word or sentence (it is called ‘command’). The more serious thing is that even if one which is not the ‘given word/sentence’ is told, it cannot know if it is given or not. The reason is as follows. It cannot send the voice data to the server because it should work even when the internet is not connected, and the system performance for the judgment is not arranged.

UI should be designed in the situation that the technical limitation is serious like this.

The voice interface screen what I designed

We should improve admitting the technical limit.

As the parts which look natural in some ways faced the technical limit, they were much rebuked by users after the product was launched. It is natural. People already recalled the voice user interface at the level of Jarvis from Iron Man. In this reality, the request to improve the users’ complaints(usability) rushed.

The start is benchmarking no matter what improvement it is. I ardently benchmarked the competitive companies’ cars because the voice user interface was used for the vehicle. I strived day and night with a faint hope that I could discover something (benchmarking 12 cars for a week is really arduous). I tried really hard. I strived to solve whatever because a self-esteem as UI designer worked rather than somebody instructed me (but, although I benchmarked, all of them had the same problem).

First improvement,
People’s speaking ways are all different. Make it like a daily language.

Although I cannot share the content on the benchmarking because it is the company secret, the first improvement was ‘like a daily language’. There is no pattern in the people’s speaking ways and they speak in different ways. It is called ‘natural language, for example; ‘Call John Smith’, ‘John Smith making a call’, ‘Would you call John Smith?’, and ‘Phone call John Smith’.

As explained before, the system is designed to recognize only ‘given word/sentence’. Therefore, to make like ‘daily language’, all expected manners should be set as the ‘given word/sentence’. A great amount of commands come along and the speed of system which should take care of them gets slow remarkably.

When you told somebody to ‘Call John Smith’, he made a call 5 seconds later. Then how do you feel?

Finally, this improvement failed. Most of commands which increased enormously were deleted again.

Second improvement,
Gather commands that people can use. Let them know the command.

As the first improvement fails, there still remains a problem on the ‘natural language’. The system cannot help recognizing only given words; people actually speak in ‘natural language’(it’s natural…). So the next improvement is to let them know what the available commands are. Although the idea was rational, the problem was how to actualize it.

I asserted the usable commands should be shown on the screen and others asserted the commands should be informed by the voice guidance. Although I strongly claimed that they should be indicated on the screen as a man who knew the nature (Visual Information — the attention is highly distracted while driving, Voice Information — it is hard to remember because it is highly volatile) of the visual and voice information, it was eventually decided that informing the command by the voice guidance due to several reasons for not being able to express on the paper.

At this time the opposite opinion was too stuffy. They claimed that the voice recognition in the car must not be marked on the screen because it should be possible without watching the screen while driving. The voice guidance was made more seriously as follows.

“You can say A, B, C, D, E, and F. Please say a command.”

The driver should remember from A to F while driving and tell one of them. What is different from taking the listening test while driving?

I don’t want to take a listening test, while driving!!

Third improvement,
Expound the parts the driver should note when carrying out the voice interface and say.

When feeling too stuffy at the second improvement, I am ordered to have to explain the matters that required attention when telling the commands or the execution way of the voice interface in the same way. I had to do it by voice guidance.

Generally, the detailed guidance uses the way of providing by the user manual or Youtube video. The use frequency is very low in the system, and the user finds it when being faced with a problem hard to solve; it was a problem to provide this guidance while driving.

Finally, I had to make a voice guidance explanation of talking for 3~5 minutes. (You must say only given command, had better close the window when saying, and must say after pressing the voice recognition button and it beeps.)

Fourth improvement,
The recognition rate of the voice the user said falls. Put the procedure of confirming if the device recognized rightly.

The improvement until here was painful but the fourth one was the worst. Although the system informed the command and use method, the user’s complaint continued. “My voice is not recognized properly” it was the complaint. The committee judged it recognized the command differently told by the user and a whimsical result was translated into action, and I had to make a defensive device. (There is no need to ask the user ‘which part did not work’. Although I ask in detail, they don’t know the cause.) if the user says, the defensive device asks again if the told content is right or wrong, and it works only if the answer ‘yes’ comes in.

If it asks you again ‘Do you want me Call John Smith?’ every time you have told ‘Call John Smith’, and you have to answer ‘yes’, will you use it?