Is My Chatbot Good Enough?
Published by
Larisa Kolesnichenko & Panagiotis Chatzichristodoulou
13.10.2023

Hi! 👋 We are a team of machine learning (ML) engineers at Kindly.

We want to share the hands-on experience we have on best practices in chatbot development (many of them have come from pain and denial, dear friends). We believe that sharing ideas might help other developers. Let’s dive into the world of chatbots.

Best evaluation practices ✅

Talking about chatbots, what we want to avoid most are wrong answers. Nothing is worse than disappointing a customer! Wrong answers lead to fewer conversions, bad user experience, and overall, they cause service abandonment. Technically speaking, what kind of problem sets are we operating in?

Let's define the setting 🏕

To build a bot, we need a big number of question-answer pairs; these pairs will lead the bot to the correct replies once it gets deployed out in the wild. A universal rule here is this: the more quality data is added on this step, the better for us and the model. The query-answer pairs are called Dialogues. Further down the road, the Kindly platform comes up with automatic suggestions on how to enhance the bot created and add better bot samples.

So, what happens on machine learning's side? We treat every Dialogue created as a class with its own label. We define the problem setting as a Multi-Class classification task: a user writes a query in the chatbot form, and we try to predict which Dialogue class this query belongs to. If we don’t reach a certain confidence threshold, the bot goes to Fallback. That means it lets the user know that the answer to their inquiry couldn’t be found. The Pipeline can be decomposed into three basic steps:

  • Preprocessing step: Involves bringing the data in a form that can be used as input to our models
  • Training: The training phase of the model
  • Evaluation: Here we evaluate the performance of the models

*This article is about evaluating and choosing the best configuration of a trained model.

Choosing the right metric to catch it all 🎯

Here the main question comes in: when we say our chatbot is good enough, what is the best graph to capture the performance of a chatbot? Standard metrics such as F1-score, or just precision, are not representative enough.

Tackling data imbalance might become tricky and, most importantly, in our metric we also want to estimate the percentage of user queries for which our bots have not been able to provide the right answer for: this metric is often referred to as Uncertainty of the bot, or, more simply, the percentage of Fallbacks.

To calculate the Uncertainty metric, we iterate through every datapoint and every threshold in the test dataset and check the model’s probability of prediction:

  • If the prediction probability is below the threshold the datapoint gets filtered out meaning that the model is not confident enough for its prediction on this data point.
  • If the prediction is above the threshold then the datapoint is included in the calculation of the precision of the model for this threshold.

After this, we have enough to calculate the optimal confidence threshold for the bot.


This is, of course, not the only metric that can suit the needs of the chatbot performance estimation, but it is clear enough and catches two essential components of the bot performance: how well our chatbot is able to predict the Dialogue class and how often it goes to fallback.

So far so good! What else have we forgotten to mention?

Not all classes are equal 🥇

When estimating the performance of a chatbot, it is important to remember that while some chatbot replies can be unnoticeable and harmless, others can lead to fewer purchases and fewer service bookings, which is crucial for the business: hence when evaluating we need to prioritize which classes are core or a business. Essentially, it is also important to estimate the cost of the error and prioritize accordingly.


Okay. Seems like we’re now confident enough in what we want to evaluate and how we want to evaluate it.

Now it is time to go big. It is high time for:

Setting up the evaluation pipeline 🧰

For our chatbot factory to run smoothly, we need to have the MLOps (MLOps are a set of practices that aims to deploy and maintain ML models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field) infrastructure around models and chatbot evaluation: it facilitates running the experiments, getting the results in a good form, preserves the experiment results for auditing, handles retention policies and more.

A simplified Performance Evaluation pipeline might look like the following:

  1. Collect the data of a bot we want to evaluate
  2. Preprocess the data to be model-ready
  3. Take care of class imbalance
  4. Take care of data leakage on all parts of our system (ElasticSearch, Configuration files, Cloud instances, etc.)
  5. Retrain our model using the carefully preprocessed dataset as input
  6. Fire up all the system components using the updated model. If a subset of components needs to be tested, take care of it, and don’t forget to adjust the metrics if your ML pipeline consists of sequential components and they need to be evaluated both separately and as a flow.
  7. Store the predictions with a preferred retention policy.
  8. Plot predictions and errors with a dashboard (i.e. Weights and Biases).

What you might also want to set up as a regular run:

Listen to the experience of the Delivery team 📦

When testing out the new bot configurations before deployment, the best is to ask the Delivery team to share their feedback and impressions on what is going on. It might be that they have some examples of particularly weird bot replies, or that they have some assumptions or questions on the bot behavior.

Their advice will be very valuable. Like with everything, take it with a pinch of salt. Sometimes the Delivery team’s impression on new methods can be based on a single observation rather than on the automated mass-testing.

Avoid Data Leakage 🚰

In order to gauge a model performance correctly, the model should be trained and evaluated with a different set of data: a train set for training and a validation set for evaluating the model. This process correctly emulates the scenario where the model makes predictions and gets evaluated on unseen data.

Data leakage occurs during the training phase of the model; when the model is accidentally exposed to validation data. This can be a very subtle issue with very important consequences for the model performance, as the model knows information it shouldn’t during training; thus making its predictions invalid.

On our current use case for example, sanitizing Elastic search indexes by deleting test data before evaluating model predictions is one necessary step that should be taken; a practical step that is not written anywhere in the literature. Summarizing this: Take care that your model doesn’t “see” validation data during its training phase.

Thank you for reading this far. Thank you for your attention and good luck with your Natural Language Processing projects! 😉

Continue reading