Register

What Makes a Conversation 'Usable'?

AI × UX Summit
by StratMinds  
Rebecca Evanhoe
Starting In:
00d
00h
00m
00s
Scroll Down
Immerse yourself in a one-of-a-kind experience, where great AI meets amazing UX!
  • Full Transcript
    What Makes a Conversation 'Usable'?
    Rebecca Evanhoe

    Hey, good morning everybody. I'm really excited to be our first speaker of the day. Yesterday, after all the great talks, I had to tweak my talk so that it's more in conversation with all the cool stuff that we learned and talked about.

    I'm Rebecca (Becca), my pronouns are she/her, and I'm a conversation designer. I'll tell you exactly what that means in the course of this talk. My background is weird. Chemistry and fiction writing makes perfect sense. I would end up in conversational AI, right? But I have worked in this field for 12 years.

    This is a formal title that I held to give you some cred, some context for me. I was at AWS, I was the first conversation designer on their conversational AI team. But I no longer work at AWS. I work at Slang and I'm their conversation design strategist. So I'm their conversation designer and I also own the roadmap for where our conversation is going to go in the future. And I teach at Pratt, which is a design school in Brooklyn. And I co-authored Conversations with Things, which is a conversation design book.

    I will call your attention to the publication date, 2021. In November 2022, it got a little existential because LLMs became visible to the public. I'm really happy to say that our book was based on principles. There is mention of specific kinds of technology, but we feel and what other people are saying is that our book still applies because it's about the principles of conversation and conversation design. So everything I'm going to say applies to both kind of classic NLU speech recognition technology, but also applies to new gen AI.

    In preparation for this conference, I talked with Summer and Anton and their guiding question for this talk and what they wanted us to explore was how to make exceptional user experiences with AI. So I'm going to start by talking about times where I think I noticed something exceptional. In the course of my career, I have made many mediocre things, but I have also made some good things. So I'm going to give some examples of times where I can tell that our product is delighting people.

    This is Tina Jones. I worked for a company called Shadow Health that eventually was acquired by Elsevier. I spent five years working on this product that was training nursing students how to interview people better. So it was a conversation based simulation, but it was for training them how to talk to people. We developed all these virtual characters with rich health histories and Tina Jones was one of them.

    We saw delight when users interacted with Tina. It was an early stage product, so I also saw some hellish things. But eventually we did see people really connecting with Tina. And here's how I knew that. Everybody knew Tina's birthday because, you know, when we go to the doctor and the nurse asks you, they have to confirm several times your date of birth as part of like confirming your identity. So students in the simulation had asked Tina her birthday and classes would throw birthday parties for Tina. They would send us pictures and they'd bring in cupcakes and celebrate Tina.

    This is true. Tina, the first interaction with her, you learn that she has a foot wound. That's the first time she comes into the doctor. She comes in other times over the course of like eight modules. Students really got to know her health history, but also her life. You know, she's very close to the family. She was super hardworking. She had a full time job and she was in school. People really empathize with that. So people got to know her socially and they also made fan art of her foot wound.

    This is actual Tina. I was not in the graphics department and this was 2012. So I know those a little dated, but this is Tina and her foot wound and people would make fan art of the foot wound and like email it to us. So all this to say, graphics aside, people are still connecting this character with this narrative.

    Fast forward about a decade. Now I work at slang and we make a voice AI that answers phones at restaurants. A little bit less emotionally taxing than the previous things that I've worked on. But I really love how concrete this experience is. When you call a restaurant that has our voice agent answering phones, you can book a reservation, start to finish. You can ask questions like, are you seating now? Can we walk in? You can say things like, I have food allergies. Do you have vegetarian stuff? Or I'm having trouble finding you. You can get directions and orientation, things like that.

    I also see signs of delight in this interaction as well. We see a lot of side talk. So somebody will be on their cell phone, often on speaker phone, and you can hear that they are saying to another human, it's like a robot thing, but it worked. So we hear that kind of surprise. We see people say to the agent, thank you, robot. Kind of like, you know, this incredulity in their voice. And we see people say things like, wow, when they actually find that they got a reservation booked, they get a text to confirm.

    Delight is possible, but it's not easy to come by. I think most interactions with things that you call on the phone are not delightful. So the reason that it's so hard to get it good is that conversation is such a remarkable thing that we do. I mean, we participate in it every day, and it is astounding.

    I'm going to define conversation and just spend a little bit of time talking about conversation itself. I define it as the turn-based exchange of language between two parties. So we classically would say between two people, but now we have AI in the mix. And that turn-based part is really significant. We're not just talking at each other simultaneously, right? It's really sequenced, delicate act that we learn to participate in really instinctively. And we're also not just like blasting messages at each other. There is this kind of implied goal in conversation that we're trying to understand others and make ourselves understood to others.

    When we think about the turns of a conversation, we're thinking about how not just the language we're exchanging unfolds over time, but there's also lots of timing components and things like that. So conversation is this rich exchange of data. It's not just the words in what we're saying. It's our facial expressions, our gestures. It's the pitch of our voice. It's the timbre, the quality of our voice. Do we sound tired? Do we sound angry? A million things. There's visual data that we're pulling in, and there's contextual data like power dynamics. All of these things come into play, and we're synthesizing all of this together as we're trying to make meaning.

    Conversation is very behavioral, so that delicate timing of the turns is cultural and social. But it's also things like when is it okay to interrupt someone? How long should we wait before we respond? Those things add meaning or add power dynamics to the conversation. Conversations are directional, and I don't mean that every conversation is transactional. I just mean that even when you're chatting and you're catching up with your best friend and you haven't seen them in a month and you're talking for three hours, there are vectors, there are sequences, there are segments of those conversations.

    Conversations are repairable. People repair a conversation on average within four seconds when you're talking to another person. So repairing and readjusting the conversation is very natural. It's not an edge case. It's part of the fabric of how we do it. Conversations have a point of closure, usually. Have you ever had the experience where the conversation should be over and someone continues to talk and you feel that mounting panic of, like, I'm trying to go? That's because if someone is violating the natural point of closure, we feel that, we experience that.

    So when we're thinking about getting people in AI to talk, over here we have the human who has been practicing this behavior and this meaning-making since birth. This has become instinct for us. We're not even consciously aware how good we are at turn-taking and all of these things. We have trained our existing neurological toolkit even further to be excellent at conversation. And then over on the other side, we have AI who can only do a small fraction of the stuff that we're doing. So the goal is to kind of figure out the power of the one and the semi-power of the other and try to get the conversations as successful as we can.

    Conversation is really different and I'm just going to give one example of why it's such a different interaction than other kinds of interactions with tech. To use the example of search, search uses language. So if I'm searching something in Google, I type in my question in language. I can even hit the little microphone icon and speak it and I can put in something that's pretty natural like, how much did it rain last night? And it outputs language. So these are words arranging grammatical patterns that I can interpret. So it's giving me language back. But it's not turn-taking and my expectations are ultimately that I poke around and I will know when my question is answered. So it's kind of on me to finish the interaction. That's my expectation for search.

    But as soon as we start mimicking conversational interaction, it changes what we expect. I do not want Alexa to tell me some possible answers for the weather. I need the answer for what the weather is. And I also have the expectation that I can ask a follow-up question if I have one. I have another example, but it's about food poisoning. So it's just really different. People and users expect completely different things. The bar is so high. And conversation designers are people who work in any role in conversational tech. Here's the lightly copyedited version of the transcript, preserving all original content:

    It's really easy to focus on the design of the bot side. What is it going to say? What's its personality? And that's part of it, but it's really like only half. So it's really thinking about the system. But for one second, let's focus on the tech and I'll just give you a snapshot of how I think about it.

    To me, conversational AI is this whole collection of technologies that imitate talking and listening. And I have to put those in quotes because the tech is imitating it. It's not doing the same thing that we're doing. But it's imitating us pretty well. And then to go a little further with definition, that collection of technologies includes devices and algorithms. And that they do this imitation thing.

    So when we think about devices, as a conversation designer, I would make different design decisions depending on the device the conversation is occurring on. If it's something multimodal, we have a lot that we can do getting the visuals and voice inputs and outputs to be very flexible, to be very natural in the user's terms. But people also interact with AI through their phones, through websites, even the humble landline. You can pick it up, call about a million numbers and talk to AI. So I think about the device that the person is using to interact with the algorithm.

    And then I also think about the algorithm side because the algorithms are imitating the language part of human intelligence. That's the AI of the AI. And if the language part of human intelligence sounds a little vague, I'll unpack it. We interact with language not just through conversation. If we break it down into some components, our thoughts are electrical impulses. And there's an area of the brain that puts those into words. There's also a different part of the brain that takes the words we want to say and helps our body physically produce speech.

    When you're producing speech, it's all wild to me because you're taking electrical impulses, translating it into words, and then your body is like, your muscles are moving to produce a sound wave that's coming out and is going to travel through the air. And it's going to contain sounds that represent the words you're trying to say. That is the wildest thing I've ever heard of. It blows my mind every time. And then somebody else is interpreting the sound wave.

    And we also interact with language through reading and writing. We can have synchronous conversations through texts. But we also read books written by people from a million years ago. We have this idea of this discourse, which is sort of like the conversation our species is having with itself. And we can participate in that through reading and writing.

    There are a bunch of different things that imitate the language part of human intelligence. There's natural language understanding. That's a whole category of algorithms. Classically, we would work with intent classification. That's the most familiar thing to a lot of conversation designers who have been working in the past. There's text-to-speech, which takes text and turns it into the sound wave, first in that voice. There's speech recognition, which takes the sound wave, tries to put it into the probable words. And then there's natural language generation. LLMs are a subset of natural language generation. NLG is old, too. So LLMs and Gen-AI are kind of like the newest in that camp.

    So given we have this whole stack of technology, we have amazing human capabilities, how do we actually get these things working together? How do we get this close to a human conversational standard? So a usable conversational AI, a conversational AI that's well-designed, generally does these things. It's a good listener, meaning it's like a good interpreter. It can sort of get the point of what the person's trying to say. And it's a good responder. So that means lots of things. That means it answers the question correctly. It means it gives the right amount of information. And it also means that the text is organized in a way that's easy to read or listen to.

    And being a good listener and a good responder are sort of falsely separate, because part of the way you show you're a good listener is by being a good responder. The turn-taking behavior should participate on par with human patterns. Some limitations and constraints there. But you don't want long pauses. You don't want interruption. A usable conversational AI will set expectations. And this, from Sujin's talk yesterday, was really meaningful. Sometimes we set expectations that it is a system. Sometimes it's useful to make it very clear it is not a person. So we're setting the proper expectations for what the system can actually do. We're also setting boundaries. If there is an edge to the conversation, is there something that's not known, it should be clear. And it should account for context, meaning that the responses are nuanced and precise and sort of very much in tune with the user's context and the specificity of their question.

    So it's easy, right? So I talked about all this stuff. I want to tell you very concretely what is conversation designing. People are like, I know we need to hire a conversation designer, but what are you going to do when you get here? So I'm looking at this system. I'm trying to get it to work. It is a form of UX. So I'm doing the UX stuff. I'm understanding people. Who's the audience? Are there segments of the audience? What are their needs, their pain points, user journeys, stuff? But we're also looking at their language patterns. How are they talking about the thing? How are they expressing themselves? Their specialized language they're using or that we should use? Are there many languages or dialects to accommodate all that kind of stuff?

    We're thinking about the device. How do users input? Is it voice? Is it text? Is it both? Can they type? Can they tap? All that. We're thinking about the output. Is it voice, text? If you say, Alexa, turn on my lights, the output is actually the light coming on. So we think about the outputs. We think about when these interactions take place in time and where they take place. Are you in the car? Are you in public? Are you in private? And then we think about what these algorithms are up to. What do they do? What are they good at? What does it look like when they're doing their job the best they can? And then what kind of training do they need for that optimal performance?

    So this is a lot of thinking. And then most concretely, in my role, I own all of this stuff. So there's a component of personality design. Personality should fit the use case. So if your use case is for something really efficient, then you don't want to chatty personality, for example. Personality has a job to do. And the voice should support the personality. So if it has the out loud synthetic voice that you hear, the voice carries a lot of the personality. People perceive a lot in what they hear. So we're using that to set user expectations.

    When in my role, I write the responses. But if I'm working with LLMs, I'm more evaluating their responses potentially. And I'm looking for accuracy, precision, clarity. I'm making sure that the answer isn't too general. If the question was specific, I'm looking for grammar structures that improve comprehension. So we're looking at a lot of components.

    I think about all the pathways we want to support. Sometimes that's finite and sometimes that's infinite. So we try to get a representative sample. I think about the training data, no matter what algorithm is doing the NLU or LLM processing part. So I work a lot with training for intent classification. But if we're working with LLMs, I'm working on the prompt and I'm also evaluating the output and running trials on that. Also thinking about knowledge databases and how structured content, if that's a part of the equation, we're thinking through that.

    Logic and context is a huge part of my job. So I'm doing all this stuff, but as I'm planning out the pathways, I'm thinking about all these different conditions where data might branch the conversation. So are they authenticated? What device are they using? Is it a text-only device? Is it a multimodal device? In a restaurant use case, if somebody says, can we walk in right now? We have to account for whether the restaurant takes walk-in parties, whether they take reservations, if they have a preference. There's all kinds of conditions that would serve up different responses that would be accurate and precise.

    So in my role, I build. I use Dialogflow CX and I build the conversation. I'm making the pages and the pathways, collaborating with my developer partners on the data that gets passed back and forth. But I am the builder. Actually, and this is the first role, I've been the sole builder and it's been so exciting and incredible because I don't have developers telling me that something is too complicated. I can just build it and I love it.

    And then we also think a lot about the term taking. So if we notice our system is cutting people off or people are interrupting it, then we have a problem that we need to fix in the behavioral side. So we're working with the timeout windows. If people are interrupting the bot, it's probably too long of a response. So we're kind of troubleshooting if we see anything like that.

    So you can tell that this job is a lot of different things. And if you're like a nerd who likes really hard puzzles, come my way. So the way that I've learned to be a conversation designer and the way that I'm able to push towards exceptional experiences is through looking at through the people. What are people doing? Is it working for them? So we do usability testing where we watch people use the thing in front of us. That's really valuable. To me, it's irreplaceable. Obviously, you do that at a much smaller volume, but we tend to get such like depth of learning from watching and talking to people about what they experienced. I also look at a lot of transcripts. So that contains the conversation and a lot of data about it. But it's less information than watching somebody do it. But it's still really valuable. I look at transcripts myself. Here's the lightly copyedited version of the transcript, preserving all original content:

    I like to get the lay of the land, what's going on in the kingdom. But we, of course, have some automations in this process. So we can put in little flags where if certain conversational patterns emerge, we know it's good. We don't need to look at it. If there's other collection of data factors, that might be an indication that it's a problem transcript and we do look at it.

    So there is, I think a lot of people go like, that's not scalable. Why are you looking at all these transcripts? And it's like, that's how we know all the problems and then we fix them, right? So to me, that's also a really valuable process. Even if it's an LLM, how are those conversations going? Are they good? You don't know until you look.

    Of course, we look at analytics too. We have a bunch of different measures for an individual session and overall system performance. So we know if like, oh, our match rates are dipping or people are getting a lot more of this one type of conversational error, we're looking at that on a system level and then we can kind of dive in.

    So these are all the pieces of data that my company, when I joined set up processes so that me and my whole team could have so that we can push towards excellence. And it's so exciting to have this much data. I know some of you feel the same way about data, but I've never been so empowered. I can look into anything. I can listen to any call. I can see any transcript. I can pull any number based on any set of factors. So I can really diagnose what our problems are.

    So I'll give you a final thought here. Conversation designed in a nutshell. When someone says something here, no matter what it is, I think especially in a GenAI LLM world, people think like, oh, it just kind of needs to say something. Like something sort of like on topic, right? But that is not what I'm doing. What we're doing. This needs to be something very special. It needs to be very precise. It needs to be nuanced. It needs to be contextual. And it also needs to continue moving the conversation towards the shared goal. All while kind of seeming chill enough in conversation like we are, right? So that's a very hard thing. But thinking about like, what is the most excellent, useful, shared goal oriented conversation response that goes here? That's how you get really good conversations.

    So 21 minutes. We have nine minutes for questions. We'll go with Rob and then somebody else should pick. Rob, maybe you want to kick us off.

    Q&A

    Yeah, I've actually got a few. I'll just go with the one. One of the things I was thinking about is conversational design becomes sort of an IP, right? So your service is really effective at having conversations, right? And you can almost trade on that to kind of expand your service portfolio. And that IP in a way is built on data that's being recorded from customer conversations. Is that a potential privacy issue? Or is there a specific way in which you ask permission to use those conversations in your conversational design?

    Yeah, for our case, since it's phone, there's like a precedent for that. So we say just so you know, this is a recorded call. And people can say like, talk to a person and then get to a person where it's like, not recorded. But so in phone, it's a little bit more straightforward. If you're dealing with like an Alexa device, or I know people are working on like therapy use cases. Obviously, that's a much more intense concern. We can get access to a lot of data, but it's anonymized. And we're, I don't, there's other data privacy people on the team who know more than I do. But yeah, I think, and also kind of like working on the restaurant use case, because it's low risk, like, nobody's sharing the secrets. So the data feels a little bit lower stakes. And that's kind of nice, lets me off the hook a little bit.

    So, a little bit of a controversial question. You've shown the key governance. And we have so many things to figure out, first of all, in this blind award. And I really appreciate you kind of structuring everything for us to have a conceptual framework. But I find it fascinating how little we talk about the culture and the actual other languages. And unfortunately, you know, we're kind of experiencing English centric AI. Yes. And, you know, for example, in Hawaiian language, similar to other indigenous languages like Malapol, there's more concepts about interconnectedness, right? There's nature, there's community, and it's completely different from English. And, you know, language is more than just a tool for communication. And you know, being Ukrainian, I know how cultural identity is a very important thing. And we're seeing kind of like a wash of that identity with all of these English languages. And, you know, my question to you is, how do we design this conversational experience that supports the multilinguality and actual representation of the culture, Hawaiian culture, Malapol culture, Ukrainian culture, without being so English centric?

    I love this question. This is a very important question. And I totally agree. It's very English centric. And I have only lived in the US. I only speak English. I can order food in French, but that's about it. But there's huge biases. If you speak English very fluently, grew up in, you know, the US or the UK, but you because your family group, you have an accent, recognition goes way down. There are studies that show that for black Americans who are speaking African American, vernacular English, recognition goes down. And yeah, there's like this global issue of it all being very English centric.

    I think your question is really important. I don't have a good answer. Do you want to work together? Let's figure it out. But we know these biases exist. And it's also a very hard linguistics problem. Like, you know, if people say, acts instead of ask, for example, like, that's just like one tiny example of pronunciation differences are hard. It's hard to get the sound wave result to the same thing. I'm not excusing it. I'm just saying like it's a complicated problem. And I think it's compounded by where the money is. Like the big US tech company is kind of over-reacted. So we're indexing on the spot language. I think you maybe had something to add.

    Yeah. So one thing that we're doing at Google right now is we know that there's a so one thing we're doing at Google right now, we know that there's a huge disparity between white speech and black speech. And my team currently is actually partnering with Howard University, which is an HBCU out in DC. Howard University is going out and collecting black speech across the United States. We're trying to hit a certain amount of numbers and we're going to be using that data to actually train our models for ASR. So that's just one example of how you can start solving these types of problems. Once we have this data and start training the models, we're going to use this as a use case and work together as a team to figure out how we can scale it to other dialects within the US and globally as well.

    Will we share that data? Will we share that data? The data is actually going to be owned by Howard University. It's not Google owned. That was very intentional. It's a partnership between Howard and Google. We have, you know, it's external news. We've talked about it before. But it's going to be up to Howard as to how they want to share the data.

    Actually, I got, can you introduce yourself? People should know. My name is Aicha. Aicha, cha, cha. I work at Google with Soojin and Pete. I work at Smart Design with Richard. A design consultancy based in New York. I also was on Instagram for about six years. But she's doing the responsible AI. Yeah, I'm responsible AI in the US.

    I want to add something. Yesterday we talked about it, but it's a really important question and something that I emphasize a lot within Google. Now we're developing AI so US-centric. It's not even US-centric. Silicon Valley-centric. How do we make it more global view? One of the activities the user research team is doing is actually traveling a lot. One example, yesterday we talked a lot about collaborative AI. What does collaboration mean different in each country? We travel five different countries and really learn very different.

    Also, a lot of cases companies feel that internationalization is interpreting English to their local language. But it's not. The perception and the quality is very different. For example, Korea, they're an honorific. The conversation versus casual conversation. Even the perfect content, if they start to initiate casual conversation, people feel that. Do you look down on me? Are you above me? There are a lot of things that different culture has different meaning in terms of the perception of the quality. I hear a lot in Japan, they're saying that it's smart and it talks good Japanese, but it talks like foreigner talking to me. I think there are a lot of nuances we have to think about.

    That's one thing within Google. We try hard with engineers and those algorithmic people about how do we bring this culture. It's because the language generation is completely different from the language translation. When people ask me, can't you just translate with JGBT and it doesn't work that way. The way you think, the way you build sentences, the way you represent the goals is different. Even the way you do conversations is different.

    You guys may have seen the research paper that came out earlier this month. It talks about the fact that LLMs, although most of them are currently leading the charts, are trained with the dominant English data, such that even when you ask it to translate French to German, it goes to English and comes back to German. It never goes straight to German. The nativeness of LLM processing. I think it was called LLM Things in English. This was anticipated for a couple of reasons. One is that LLMs are dominantly developed in English speaking countries, as some people mentioned. Here's the lightly copyedited version of the transcript, following the provided guidelines:

    Secondly, the degree of digitization and the degree of global conversations are both dominant in English. This is part of the reason why companies like South Bank are building Japanese LLMs, and why Dubai worked on their versions of it, including Falcon. Korea's SK invested $100-plus million in Anthropic to specifically develop multilingual LLMs. These are things you need to be aware of because, knowingly or unknowingly, we are viewing the world through an English lens and translating sentiments, and there are elements that just don't come across right.

    Regarding business with Japan, none of our partners speak Japanese formally, despite the fact that half of the partners did learn Japanese up to college at some point. We said something to the effect of, "Thank you very much for taking the meeting with us. I hope the meeting was helpful to you." The translation actually ended up being, "I hope the meeting was helpful to me," because the subject and objects are altered, not repeated. If you repeat it, it makes them feel like they're stupid, like "I know you're talking about the meeting." There are certain elements of those things, and there are elements that are very explicit.

    I think the Japanese way of saying thank you has like 60-plus variations. And if you just say thank you without understanding whether you're speaking to someone above you, below you, peers, and so forth, it also sends a signal as to how sophisticated you are in terms of your education and so forth. So yeah, there are a lot of things that you can figure out.

    We published a peer-reviewed paper on finding LLMs to speak Ukrainian, and it's insane, it's so hard. The problem itself is you can't solve that within one large foundation model. You have to build almost specific models for specific types of languages, specific cultures. I think we should consider taking this as one of the unconference topics. There are some really interesting companies that are working in this space.

    Join Swell
Why attend Swell AI?

At StratMinds, we stand by the conviction that the winners of the AI race will be determined by great UX.

As we push the boundaries of what's possible with AI, we're laser-focused on thoughtfully designing solutions that blend right into the real world and people's daily lives - solutions that genuinely benefit humans in meaningful ways.

Who is it for?

Builders

Builders, founders, and product leaders actively creating new AI products and solutions, with a deep focus on user empathy.

Leaders

UX leaders and experts - designers, researchers, engineers - working on AI projects and shaping exceptional AI experiences.

Investors

Investors and VC firms at the forefront of AI.

Organizers

AI × UX
Summit by:
StratMinds

Sponsors

Who is Speaking?

We've brought together a unique group of speakers, including AI builders, UX and product leaders, and forward-thinking investors.