Conversational Assistants and Quality with Watson Assistant – Revisited

By Daniel Toczala

Originally posted on Medium on February 11, 2020 at https://medium.com/@dtoczala/conversational-assistants-and-quality-with-watson-assistant-revisited-123fb3bb9f1f.

Note: I updated the original Conversational Assistants and Quality blog post in February 2020 to add a link to a much better testing notebook that I discovered, and to do a slight rewrite of that section. This blog post is a complete update to that original post – and it restates a lot of what I highlighted in the original post. The BIG difference is the new Python testing notebook – which is located out on GitHub, as CSM-Bot-Kfold-Test.

In early February of 2020 I was informed of this great blog post and Python notebook, on How to Design the Training Data for an AI Assistant. I REALLY liked this Python notebook MUCH better than my original k-fold notebook (from August of 2019). The other nice thing is that you can discover this Python notebook in the catalog in Watson Studio, and just apply it and have it added to your Watson Studio project. The only big difference with this notebook is that you need to have your testing data in a separate CSV file – it doesn’t break up “folds” based on your training data. It didn’t even do folds – just straight training and testing data.

I wasn’t a big fan of that approach, I liked my basic approach of pointing at only a Watson Assistant instance, and using all of the training data in a series of k-fold tests. Nobody wants to manage this data, that data, this file, that file….. it’s an opportunity to screw things up. Most of my customers are NOT AI experts, they just want a suite of tools that they can point at their chatbot engine that will allow them to do some automated testing of their chatbot. I have also noticed that many will use ALL of their training data, and not hold back some as test data. Doing k-fold testing using all of the training data in an existing Watson Assistant instance addresses this.

However, I really liked some of the analysis that they had done of the training data, and some of the other insights that they provided. So I decided to dive in and spend a little time merging the best of both of these approaches together. First, let’s start with some basic “rules” that you should be following if you are developing a chatbot.

Getting Started with Your Conversational Assistant

Back in July of 2019, I was working with a group of like-minded people inside of IBM, and we decided to create an IBM internal chatbot that would capture a lot of the “institutional knowledge” that some of our more experienced members knew, but that didn’t seem to be captured anywhere. We wanted our newer team members to be as effective as our more seasoned members. 

We spent a week or two coming to a common vision for our chatbot.  We also mapped out a “growth path” for our chatbot, and we agreed on our roles.  I cannot begin to stress how important this is – Best Practice #1 – Know the scope and growth path for your chatbot.  We had a good roadmap for the growth of our chatbot.  We mapped out the scope for a pilot, where we wanted to be to release it to our end users, and a couple of additional capabilities that we wanted to add on once we got it deployed.

My boss graciously agreed to be our business sponsor – his role is to constantly question our work and our approach.  “Is this the most cost-effective way to do this?”, and, “Does that add any value to your chatbot?”, are a couple of the questions he constantly challenges us with.  As a technical guy, it’s important to have someone dragging us back to reality – it’s easy to get focused on the technology and lose sight of the end goal.

Our team of “developers” also got a feel for the roles we would play.  I focused on the overall view and dove deeper on technical issues, some of my co-workers served primarily as testers, some as knowledge experts (SME’s), and others as served as UI specialists, focusing on the flow of conversation.  This helped us coordinate our work, and it turned out to be quite important – Best Practice #2 – Know your roles – have technical people, developers, SME’s, architects, and end users represented.  If you don’t have people in these roles, get them.

Starting Out – Building A Work Pipeline

As we started, we came together and worked in a spreadsheet (!?!), gathering the basic questions that we anticipated our chatbot being able to answer.  We cast a pretty wide net looking for “sample” questions to get us kickstarted.  If you are doing something “new”, you’ll have to come up with these utterances yourself.  If you’re covering something that already exists, there should be logs of end user questions that you can use to jumpstart this phase of your project.

Next, we wanted to make sure that we had an orderly development environment.  Since our chatbot was strictly for internal deployment, we didn’t have to worry too much about the separation of environments, so we could use the versioning capabilities of Watson Assistant.  Since our chatbot was going to be deployed on Slack, we were able to deploy our “development” version on Slack, and also deploy our “test” and “production” versions on Slack as well.  These are all tracked on the Versions tab of the Watson Assistant Skill UI.  This gives us the ability to “promote” tested versions of our skill to different environments.  All of this allowed us to have a stable environment that we could work and test in – which leads us to Best Practice #3 – Have a solid dev/test/prod environment set up for your Conversational assistant or chatbot.

How Are We Doing? – K- Fold Testing

As we started out, we began by pulling things together and seeing how our conversational assistant was doing in real-time, using the “Try It” button in the upper right-hand corner of the Watson Assistant skills screen.  Our results were hit and miss at first, so we knew that we needed a good way to test out our assistant. 

We started out with some code from a Joe Kozhaya blog post on Training and Evaluating Machine Learning Models.  I ended up modifying it a little bit, and posting it on my Watson Landing Page GitHub repo.  We also read some good stuff from Andrew Freed (Testing Strategies for Chatbots) and from Anna Chaney (Data DevOps Rules of Engagement),  and used some of those ideas as well.

In February of 2020 I was informed of this great blog post and Python notebook, on How to Design the Training Data for an AI Assistant. I liked this Python notebook MUCH better than my old K-fold notebook, but I liked my approach better. So I went to work combining the best of both worlds into a new Python notebook. My new Python notebook does this – and provides some great insight into your chatbot. Go and find it on GitHub, where it is stored as CSM-Bot-Kfold-Test.

This highlights our next best practice – Best Practice #4 – Automate Your AI Testing Strategy.

Using Feedback

As we let our automated training process take hold, we noted that our results were not what we had hoped, and that updating things was difficult.  We also learned that taking time each week to review our Watson Assistant logs was time well spent. 

It was quite difficult to add new scope to our conversation agent, so we looked at our intents and entities again.  After some in-depth discussions, we decided to try a slightly different focus on what we considered intents.  It allowed us to make better use of the entities that we detected, and it gave us the ability to construct a more easily maintained dialog tree.  We needed to change the way that we were thinking about intents and entities.

All of this brings us to our next piece of wisdom – Best Practice #5 – Be Open-Minded About Your Intents and Entities.  All too often I see teams fall into one of either two traps. 

  • Trap 1 – they try to tailor their intents to the answers that they want to give.  If you find yourself with intents like, “how_to_change_password” and “how_to_change_username”, then you might be describing answers, and not necessarily describing intents. 
  • Trap 2 – teams try to have very focused intents.  This leads in an explosion of intents, and a subsequent explosion of dialog nodes.  If you find yourself with intents like, “change_password_mobile”, “change_password_web”, “change_password_voice”, then you have probably fallen into this trap.

We found that by having more general intents, and then using context variables and entities to specify things with more detail, that we have been able to keep our intents relatively well managed, our dialog trees smaller and better organized, and our entire project is much easier to maintain.  So, if our intent was “find_person”, then we will use context variables and entities to determine what products and roles the person should have.  Someone asking, “How do I find the program manager for Watson Assistant?”, would return an intent of “find_person”, with entities detected for “program manager” and “Watson Assistant”.  In this way, we can add additional scope without adding intents, but only by adding some entities and one dialog node. 

Why K-Fold Isn’t Enough

One thing that we realized early on was that our k-fold results were just one aspect of the “quality” of our conversational assistant.  They helped quantify how well we were able to identify user intents, but they didn’t do a lot for our detection of entities or the overall quality of our assistant.  We found that our k-fold testing told us when we needed to provide additional training examples for our classifier, and this feedback worked well.

We also found that the “quality” of our assistant improved when we gave it some personality.  We provided some random humorous responses to intents around the origin of the assistant, or more general questions like, “How are you doing today?”.  The more of a personality that we injected into our assistant, the more authentic and “smooth” our interactions with it began to feel.  This leads us to Best Practice #6 – Inject Some Personality Into Your Assistant

Some materials from IBM will break this down into greater detail, insisting that you pay attention to tone, personality, chit-chat and proactivity.  I like to keep it simple – it’s all part of the personality that your solution has.  I usually think of a “person” that my solution is – say a 32-year old male from Detroit, who went to college at Michigan, who loves sports and muscle cars, named Bob.  Or maybe a 24-year-old recent college graduate named Cindy who grew up in a small town in Ohio, who has dreams of becoming an entrepreneur in the health care industry someday.  This helps me be consistent with the personality of my solution.

We also noticed that we often needed to rework our Dialog tree and the responses that we were specifying.  We used the Analytics tab in the skill we were developing.  On that Analytics tab, we would often review individual user conversations and see how our skill was handling user interactions.  This led us to make changes to the wording that we used, as well as to the things we were looking for (in terms of entities) and what we were storing (in terms of conversation context).  Very small changes can result in a big change in the end-user perception.  Something as simple as using contractions (like “it’s” instead of “it is”), will result in a more informal conversation style.

The Analytics tab in Watson Assistant is interesting.  It provides a wealth of information that you can download and analyze.  Our effort was small, so we didn’t automate this analysis, but many teams DO automate the collection and analysis of Watson Assistant logs.  In our case, we just spent some time each week reviewing the logs and looking for “holes” in our assistant (questions and topics that our users needed answers for that we did not address), and trends in our data.  It has helped guide our evolution of this solution.

Summary

This blog post identifies some best practices for developing a chatbot with IBM Watson Assistant – but these apply to ANY chatbot development, regardless of technology.

  • Best Practice #1 – Know the scope and growth path for your chatbot
  • Best Practice #2 – Know your roles – have technical people, developers, SME’s, architects, and end users represented
  • Best Practice #3 – Have a solid dev/test/prod environment set up for your Conversational assistant or chatbot
  • Best Practice #4 – Automate Your AI Testing Strategy
  • Best Practice #5 – Be Open Minded About Your Intents and Entities
  • Best Practice #6 – Inject Some Personality Into Your Assistant

Now that you have the benefit of some experience in the development of a conversational assistant, take some time to dig in and begin building a solution that will make your life easier and more productive.

Conversational Assistants and Quality with Watson Assistant

By Daniel Toczala

Note: I updated this blog post in February 2020 to add a link to a much better testing notebook that I discovered, and to do a slight rewrite of that section.

Recently my team inside of IBM decided that we needed to capture some of the “institutional knowledge” that some of our more experienced members knew, but that didn’t seem to be captured anywhere so our newer team members could be as effective as our more seasoned members.  We also wanted an excuse to get our hands dirty with our own technology, some of us had been focused on some of the other Watson technologies and needed to get reintroduced to the Watson Assistant since the migration to the “skills based” UI.

I went looking for something good about developing a chatbot (or conversational assistant), with Watson Assistant, and some good lessons learned.  I found some good information, but not one spot with the kind of experience and tips that I was looking for.  So I thought it might be good to capture it in my own blog post.

Getting Started with Your Conversational Assistant

We spent a week or two coming to a common vision for our chatbot.  We also mapped out a “growth path” for our chatbot, and we agreed on our roles.  I cannot begin to stress how important this is – Best Practice #1 – Know the scope and growth path for your chatbot.  We had a good roadmap for the growth of our chatbot.  We mapped out the scope for a pilot, where we wanted to be to release it to our end users, and a couple of additional capabilities that we wanted to add on once we got it deployed.

My boss graciously agreed to be our business sponsor – his role is to constantly question our work and our approach.  “Is this the most cost-effective way to do this?”, and, “Does that add any value to your chatbot?”, are a couple of the questions he constantly challenges us with.  As a technical guy, it’s important to have someone dragging us back to reality – it’s easy to get focused on the technology and lose sight of the end goal.

Our team of “developers” also got a feel for the roles we would play.  I focused on the overall view and dove deeper on technical issues, some of my co-workers served primarily as testers, some as knowledge experts (SME’s), and others as served as UI specialists, focusing on the flow of conversation.  This helped us coordinate our work, and it turned out to be quite important – Best Practice #2 – Know your roles – have technical people, developers, SME’s, architects, and end users represented.  If you don’t have people in these roles, get them.

Starting Out – Building A Work Pipeline

As we started, we came together and worked in a spreadsheet (!?!), gathering the basic questions that we anticipated our chatbot being able to answer.  We cast a pretty wide net looking for “sample” questions to get us kickstarted.  If you are doing something “new”, you’ll have to come up with these utterances yourself.  If you’re covering something that already exists, there should be logs of end user questions that you can use to jumpstart this phase of your project.

Next, we wanted to make sure that we had an orderly development environment.  Since our chatbot was strictly for internal deployment, we didn’t have to worry too much about the separation of environments, so we could use the versioning capabilities of Watson Assistant.  Since our chatbot was going to be deployed on Slack, we were able to deploy our “development” version on Slack, and also deploy our “test” and “production” versions on Slack as well.  These are all tracked on the Versions tab of the Watson Assistant Skill UI.  This gives us the ability to “promote” tested versions of our skill to different environments.  All of this allowed us to have a stable environment that we could work and test in – which leads us to Best Practice #3 – Have a solid dev/test/prod environment set up for your Conversational assistant or chatbot.

How Are We Doing? – K- Fold Testing

As we started out, we began by pulling things together and seeing how our conversational assistant was doing in real-time, using the “Try It” button in the upper right-hand corner of the Watson Assistant skills screen.  Our results were hit and miss at first, so we knew that we needed a good way to test out our assistant. 

We started out with some code from a Joe Kozhaya blog post on Training and Evaluating Machine Learning Models.  I ended up modifying it a little bit, and you can find the modified notebook in my Watson Landing Page GitHub repo, under notebooks, in a Python notebook stored as ANYBOT_Test-and-Deploy.ipynb.  We also read some good stuff from Andrew Freed (Testing Strategies for Chatbots) and from Anna Chaney (Data DevOps Rules of Engagement),  and used some of those ideas as well.  This led me to create that modified Python notebook, which I used to provide automated k-fold testing of our assistant implementation. 

In February of 2020 I was informed of this great blog post and Python notebook, on How to Design the Training Data for an AI Assistant. I really like this Python notebook MUCH better than my own K-fold notebook mentioned above. The other nice thing is that you can discover this Python notebook in the catalog in Watson Studio, and just apply it and have it added to your Watson Studio project. The only big difference with this notebook is that you need to have your testing data in a separate CSV file – it doesn’t break up “folds” based on your training data. This highlights Best Practice #4 – Automate Your AI Testing Strategy.

After all of this was in place, our team fell into a predictable rhythm of work and review of our work.  Since this was a side project for all of us, some of us contributed some weeks, and didn’t contribute on other weeks.  We were a small team of people (less than 10), so it was easy to have our team manage itself.

Using Feedback

As we let our automated training process take hold, we noted that our results were not what we had hoped, and that updating things was difficult.  We also learned that taking time each week to review our Watson Assistant logs was time well spent. 

It was quite difficult to add new scope to our conversation agent, so we looked at our intents and entities again.  After some in-depth discussions, we decided to try a slightly different focus on what we considered intents.  It allowed us to make better use of the entities that we detected, and it gave us the ability to construct a more easily maintained dialog tree.  We needed to change the way that we were thinking about intents and entities.

All of this brings us to our next piece of wisdom – Best Practice #5 – Be Open-Minded About Your Intents and Entities.  All too often I see teams fall into one of either two traps.  Trap 1 – they try to tailor their intents to the answers that they want to give.  If you find yourself with intents like, “how_to_change_password” and “how_to_change_username”, then you might be describing answers, and not necessarily describing intents.  Trap 2 – teams try to have very focused intents.  This leads in an explosion of intents, and a subsequent explosion of dialog nodes.  If you find yourself with intents like, “change_password_mobile”, “change_password_web”, “change_password_voice”, then you have probably fallen into this trap.

We found that by having more general intents, and then using context variables and entities to specify things with more detail, that we have been able to keep our intents relatively well managed, our dialog trees smaller and better organized, and our entire project is much easier to maintain.  So, if our intent was “find_person”, then we will use context variables and entities to determine what products and roles the person should have.  Someone asking, “How do I find the program manager for Watson Assistant?”, would return an intent of “find_person”, with entities detected for “program manager” and “Watson Assistant”.  In this way, we can add additional scope without adding intents, but only by adding some entities and one dialog node. 

Why K-Fold Isn’t Enough

One thing that we realized early on was that our k-fold results were just one aspect of the “quality” of our conversational assistant.  They helped quantify how well we were able to identify user intents, but they didn’t do a lot for our detection of entities or the overall quality of our assistant.  We found that our k-fold testing told us when we needed to provide additional training examples for our classifier, and this feedback worked well.

We also found that the “quality” of our assistant improved when we gave it some personality.  We provided some random humorous responses to intents around the origin of the assistant, or more general questions like, “How are you doing today?”.  The more of a personality that we injected into our assistant, the more authentic and “smooth” our interactions with it began to feel.  This leads us to Best Practice #6 – Inject Some Personality Into Your Assistant

Some materials from IBM will break this down into greater detail, insisting that you pay attention to tone, personality, chit-chat and proactivity.  I like to keep it simple – it’s all part of the personality that your solution has.  I usually think of a “person” that my solution is – say a 32-year old male from Detroit, who went to college at Michigan, who loves sports and muscle cars, named Bob.  Or maybe a 24-year-old recent college graduate named Cindy who grew up in a small town in Ohio, who has dreams of becoming an entrepreneur in the health care industry someday.  This helps me be consistent with the personality of my solution.

We also noticed that we often needed to rework our Dialog tree and the responses that we were specifying.  We used the Analytics tab in the skill we were developing.  On that Analytics tab, we would often review individual user conversations and see how our skill was handling user interactions.  This led us to make changes to the wording that we used, as well as to the things we were looking for (in terms of entities) and what we were storing (in terms of conversation context).  Very small changes can result in a big change in the end-user perception.  Something as simple as using contractions (like “it’s” instead of “it is”), will result in a more informal conversation style.

The Analytics tab in Watson Assistant is interesting.  It provides a wealth of information that you can download and analyze.  Our effort was small, so we didn’t automate this analysis, but many teams DO automate the collection and analysis of Watson Assistant logs.  In our case, we just spent some time each week reviewing the logs and looking for “holes” in our assistant (questions and topics that our users needed answers for that we did not address), and trends in our data.  It has helped guide our evolution of this solution.

Summary

This blog post identifies some best practices for developing a chatbot with IBM Watson Assistant – but these apply to ANY chatbot development, regardless of technology.

  • Best Practice #1 – Know the scope and growth path for your chatbot
  • Best Practice #2 – Know your roles – have technical people, developers, SME’s, architects, and end users represented
  • Best Practice #3 – Have a solid dev/test/prod environment set up for your Conversational assistant or chatbot
  • Best Practice #4 – Automate Your AI Testing Strategy
  • Best Practice #5 – Be Open Minded About Your Intents and Entities
  • Best Practice #6 – Inject Some Personality Into Your Assistant

Now that you have the benefit of some experience in the development of a conversational assistant, take some time to dig in and begin building a solution that will make your life easier and more productive.