Monitoring unstructured data for LLM and NLP

A code tutorial on using text descriptors

Image by Author.

Once you deploy an NLP or LLM-based solution, you need a way to keep tabs on it. But how do you monitor unstructured data to make sense of the pile of texts?

There are a few approaches here, from detecting drift in raw text data and embedding drift to using regular expressions to run rule-based checks.

In this tutorial, we’ll dive into one particular approach — tracking interpretable text descriptors that help assign specific properties to every text.

First, we’ll cover some theory:

What is a text descriptor, and when to use them.Examples of text descriptors.How to select custom descriptors.

Next, get to code! You will work with e-commerce review data and go through the following steps:

Get an overview of the text data.Evaluate text data drift using standard descriptors.Add a custom text descriptor using an external pre-trained model.Implement pipeline tests to monitor data changes.

We will use the Evidently open-source Python library to generate text descriptors and evaluate changes in the data.

Code example: If you prefer to go straight to the code, here is the example notebook.

What is a text descriptor?

A text descriptor is any feature or property that describes objects in the text dataset. For example, the length of texts or the number of symbols in them.

You might already have helpful metadata to accompany your texts that will serve as descriptors. For example, e-commerce user reviews might come with user-assigned ratings or topic labels.

Otherwise, you can generate your own descriptors! You do this by adding “virtual features” to your text data. Each helps describe or classify your texts using some meaningful criteria.

Image by Author.

By creating these descriptors, you basically come up with your own simple “embedding” and map each text to several interpretable dimensions. This helps make sense of the otherwise unstructured data.

You can then use these text descriptors:

To monitor production NLP models. You can track the properties of your data in time and detect when they change. For example, descriptors help detect text length spikes or drift in sentiment.To test models during updates. When you iterate on models, you can compare the properties of the evaluation datasets and model responses. For example, you can check that the lengths of the LLM-generated answers remain similar, and they consistently include words you expect to see.To debug data drift or model decay. If you detect embedding drift or directly observe a drop in the model quality, you can use text descriptors to explore where it comes from.

Examples of text descriptors

Here are a few text descriptors we consider good defaults:

Text length

Image by Author. Text overview metric visualization using the Evidently Python library.

An excellent place to start is simple text statistics. For example, you can look at the length of texts measured in words, symbols, or sentences. You can evaluate average and min-max length and look at distributions.

You can set expectations based on your use case. Say, product reviews tend to be between 5 and 100 words. If they are shorter or longer, this might signal a change in context. If there is a spike in fixed-length reviews, this might signal a spam attack. If you know that negative reviews are often longer, you can track the share of reviews above a certain length.

There are also quick sanity checks: if you run a chatbot, you might expect non-zero responses or that there is some minimum length for the meaningful output.

Out-of-vocabulary words

Image by Author. The mean OOV share in the example reference dataset is 5.378%.

Evaluating the share of words outside the defined vocabulary is a good “crude” measure of data quality. Did your users start writing reviews in a new language? Are users talking to your chatbot in Python, not English? Are users filling the responses with “ggg” instead of actual words?

This is a single practical measure to detect all sorts of changes. Once you catch a shift, you can then debug deeper.

You can shape expectations about the share of OOV words based on the examples from “good” production data accumulated over time. For example, if you look at the corpus of previous product reviews, you might expect OOV to be under 10% and monitor if the value goes above this threshold.

Non-letter characters

Related, but with a twist: this descriptor will count all sorts of special symbols that are not letters or numbers, including commas, brackets, hashes, etc.

Sometimes you expect a fair share of special symbols: your texts might contain code or be structured as a JSON. Sometimes, you only expect punctuation marks in human-readable text.

Detecting a shift in non-letter characters can expose data quality issues, like HTML codes leaking into the texts of the reviews, spam attacks, unexpected use cases, etc.


Image by Author. Baseline “sentiment” distribution in the example e-commerce reviews dataset.

Text sentiment is another indicator. It is helpful in various scenarios: from chatbot conversations to user reviews and writing marketing copy. You can typically set an expectation about the sentiment of the texts you deal with.

Even if the sentiment “does not apply,” this might translate to the expectation of a primarily neutral tone. The potential appearance of either a negative or positive tone is worth tracking and looking into. It might indicate unexpected usage scenarios: is the user using your virtual mortgage advisor as a complaint channel?

You might also expect a certain balance: for example, there is always a share of conversations or reviews with a negative tone, but you’d expect it not to exceed a certain threshold or the overall distribution of review sentiment to remain stable.

Trigger words

Image by Author. Reviews consistently mention dresses: no distribution drift is detected.

You can also check whether the texts contain words from a specific list or lists and treat this as a binary feature.

This is a powerful way to encode multiple expectations about your texts. You need some effort to curate lists manually, but you can design many handy checks this way. For example, you can create lists of trigger words like:

Mentions of products or brands.Mentions of competitors.Mentions of locations, cities, places, etc.Mentions of words that represent particular topics.

You can curate (and continuously extend) lists like this that are specific to your use case.

For example, if an advisor chatbot helps choose between products offered by the company, you might expect most of the responses to contain the names of one of the products from the list.

RegExp matches

The inclusion of specific words from the list is one example of a pattern you can formulate as a regular expression. You can come up with others: do you expect your texts to start with “hello” and end with “thank you”? Include emails? Contain known named elements?

If you expect the model inputs or outputs to match a specific format, you can use regular expression match as another descriptor.

Custom descriptors

You can extend this idea further. For example:

Evaluate other text properties: toxicity, subjectivity, the formality of the tone, readability score, etc. You can often find open pre-trained models to do the trick.Count specific components: emails, URLs, emojis, dates, and parts of speech. You can use external models or even simple regular expressions.Get granular with stats: you can track very detailed text statistics if they are meaningful to your use case, e.g., track average lengths of words, whether they are upper or lower case, the ratio of unique words, etc.Monitor personally identifiable information: for example, when you do not expect it to come up in chatbot conversations.Use named entity recognition: to extract specific entities and treat them as tags. Use topic modeling to build a topic monitoring system. This is the most laborious approach but powerful when done right. It is useful when you expect the texts to stay mostly on-topic and have the corpus of previous examples to train the model. You can use unsupervised topic clustering and create a model to assign new texts to known clusters. You can then treat assigned classes as descriptors to monitor the changes in the distribution of topics in the new data.Image by Author. Example of a summary drift report for multiple descriptors.

Here are a few things to keep in mind when designing descriptors to monitor:

It is best to stay focused and try to find a small number of suitable quality indicators that match the use case rather than monitor all possible dimensions. Think of descriptors as model features. You want to find a few strong ones rather than generate a lot of weak or unhelpful features. Many of them are bound to be correlated: language and share of OOV words, length in sentences and symbols, etc. Pick your favorite!Use exploratory data analysis to evaluate text properties in existing data (for example, logs of previous conversations) to test your assumptions before adding them to model monitoring.Learn from model failures. Whenever you face an issue with production model quality that you expect to reappear (e.g., texts in a foreign language), consider how to develop a test case or a descriptor to add to detect it in the future.Mind the computation cost. Using external models to score your texts by every possible dimension is tempting, but this comes at a cost. Consider it when working with larger datasets: every external classifier is an extra model to run. You can often get away with fewer or simpler checks.

Step-by-step tutorial

To illustrate the idea, let’s walk through the following scenario: you are building a classifier model to score reviews that users leave on an e-commerce website and tag them by topic. Once it is in production, you want to detect changes in the data and model environment, but you do not have the true labels. You need to run a separate labeling process to get them.

How can you keep tabs on the changes without the labels?

Let’s take an example dataset and go through the following steps:

Code example: head to the example notebook to follow all the steps.

💻 1. Install Evidently

First, install Evidently. Use the Python package manager to install it in your environment. If you are working in Colab, run !pip install. In the Jupyter Notebook, you should also install nbextension. Check out the instructions for your environment.

You will also need to import a few other libraries like pandas and specific Evidently components. Follow the instructions in the notebook.

🔡 2. Prepare the data

Once you have it all set, let’s look at the data! You will work with an open dataset from e-commerce reviews.

Here is how the dataset looks:

Image by Author.

We’ll focus on the “Review_Text” column for demo purposes. In production, we want to monitor changes in the texts of the reviews.

You will need to specify the column that contains texts using column mapping:

column_mapping = ColumnMapping(
numerical_features=[‘Age’, ‘Positive_Feedback_Count’],
categorical_features=[‘Division_Name’, ‘Department_Name’, ‘Class_Name’],
text_features=[‘Review_Text’, ‘Title’]

You should also split the data into two: reference and current. Imagine that “reference” data is the data for some representative past period (e.g., previous month) and “current” is the current production data (e.g., this month). These are the two datasets that you will compare using descriptors.

Note: it’s important to establish a suitable historical baseline. Pick the period that reflects your expectations about how the data should look in the future.

We selected 5000 examples for each sample. To make things interesting, we introduced an artificial shift by selecting the negative reviews for our current dataset.

reviews_ref = reviews[reviews.Rating > 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)
reviews_cur = reviews[reviews.Rating < 3].sample(n=5000, replace=True, ignore_index=True, random_state=42)

📊 3. Exploratory data analysis

To better understand the data, you can generate a visual report using Evidently. There is a pre-built Text Overview Preset that helps quickly compare two text datasets. It combines various descriptive checks and evaluates overall data drift (in this case, using a model-based drift detection method).

This report also includes a few standard descriptors and allows you to add descriptors using lists of Trigger Words. We’ll look at the following descriptors as part of the report:

Length of textsShare of OOV wordsShare of Non-letter symbolsThe sentiment of the reviewsReviews that include either words “dress” or “gown”Reviews that include either words “blouse” or “shirt”Check out the Evidently docs on Descriptors for details.

Here is the code you need to run this report. You can assign custom names to each descriptor.

text_overview_report = Report(metrics=[
TextOverviewPreset(column_name=”Review_Text”, descriptors={
“Review texts – OOV %” : OOV(),
“Review texts – Non Letter %” : NonLetterCharacterPercentage(),
“Review texts – Symbol Length” : TextLength(),
“Review texts – Sentence Count” : SentenceCount(),
“Review texts – Word Count” : WordCount(),
“Review texts – Sentiment” : Sentiment(),
“Reviews about Dress” : TriggerWordsPresence(words_list=[‘dress’, ‘gown’]),
“Reviews about Blouses” : TriggerWordsPresence(words_list=[‘blouse’, ‘shirt’]),
]), current_data=reviews_cur, column_mapping=column_mapping)

Running a report like this helps explore patterns and shape your expectations about particular properties, such as text length distribution.

The distribution of the “sentiment” descriptor quickly exposes the trick we did when splitting the data. We put reviews with a ranking above 3 in “reference” and more negative reviews in “current” datasets. The results are visible:

Image by Author.

The default report is very comprehensive and helps look at many text properties at once. Up to exploring correlations between descriptors and other columns in the dataset!

You can use it during the exploratory phase, but this is probably not something you’d need to go through all the time.

Luckily, it’s easy to customize.

Evidently Presets and Metrics. Evidently has report presets that quickly generate the reports out of the box. However, there are a lot of individual metrics to choose from! You can combine them to create a custom report. Browse the presets and metrics to understand what’s there.

📈 4. Monitor descriptors drift

Let’s say that based on exploratory analysis and your understanding of the business problem, you decide only to track a small number of properties:

You want to notice when there is a statistical change: the distributions of these properties differ from the reference period. To detect it, you can use drift detection methods implemented in Evidently. For example, for numerical features like “sentiment,” it will, by default, monitor the shift using Wasserstein distance. You can also choose a different method.

Here is how you can create a simple drift report to track changes in the three descriptors.

descriptors_report = Report(metrics=[
ColumnDriftMetric(TriggerWordsPresence(words_list=[‘dress’, ‘gown’]).for_column(“Review_Text”)),
]), current_data=reviews_cur, column_mapping=column_mapping)

Once you run the report, you will get combined visualizations for all chosen descriptors. Here is one:

Image by Author.

The dark green line is the mean sentiment in the reference dataset. The green area covers one standard deviation from the mean. You can notice that the current distribution (in red) is visibly more negative.

Note: In this scenario, it also makes sense to monitor the output drift: by tracking shifts in the predicted classes. You can use categorical data drift detection methods, like JS divergence. We do not cover this in the tutorial, as we focus only on inputs and do not generate predictions. In practice, prediction drift is often the first signal to react to.

😍 5. Add an “emotion” descriptor

Let’s say you decided to track one more meaningful property: the emotion expressed in the review. The overall sentiment is one thing, but it also helps distinguish between “sad” and “angry” reviews, for example.

Let’s add this custom descriptor! You can find an appropriate external open-source model to score your dataset. Then, you will work with this property as an additional column.

We will take the Distilbert model from Huggingface, which classifies the text by five emotions.

You can consider using any other model for your use case, such as named entity recognition, language detection, toxicity detection, etc.

You must install transformers to be able to run the model. Check the instructions for more details. Then, apply it to the review dataset:

from transformers import pipeline
classifier = pipeline(“text-classification”, model=’bhadresh-savani/distilbert-base-uncased-emotion’, top_k=1)
prediction = classifier(“I love using evidently! It’s easy to use”, )
print(prediction)Note: this step will score the dataset using the external model. It will take some time to execute, depending on your environment. To understand the principle without waiting, refer to the “Simple Example” section in the example notebook.

After you add the new column “emotion” to the dataset, you must reflect this in Column Mapping. You should specify that it is a new categorical variable in the dataset.

column_mapping = ColumnMapping(
numerical_features=[‘Age’, ‘Positive_Feedback_Count’],
categorical_features=[‘Division_Name’, ‘Department_Name’, ‘Class_Name’, ’emotion’],
text_features=[‘Review_Text’, ‘Title’] )

Now, you can add the “emotion” distribution drift monitoring to the Report.

descriptors_report = Report(metrics=[
ColumnDriftMetric(TriggerWordsPresence(words_list=[‘dress’, ‘gown’]).for_column(“Review_Text”)),
ColumnDriftMetric(’emotion’), ]), current_data=reviews_cur, column_mapping=column_mapping)

Here is what you get!

Image by Author.

You can see a significant increase in “sad” reviews and a decrease in “joy.”

Does it appear helpful to track over time? You can continue running this check by scoring new data as it comes.

🏗️ 6. Run pipeline tests

To perform regular analysis of your data inputs, it makes sense to package the evaluations as tests. You get a clear “pass” or “fail” result in this scenario. You probably do not need to look at the plots if all tests pass. You’re only interested when things change!

Evidently has an alternative interface called Test Suite that works this way.

Here is how you create a Test Suite to check for statistical distribution in the same four descriptors:

descriptors_test_suite = TestSuite(tests=[
TestColumnDrift(column_name = ’emotion’),
TestColumnDrift(column_name = WordCount().for_column(“Review_Text”)),
TestColumnDrift(column_name = Sentiment().for_column(“Review_Text”)),
TestColumnDrift(column_name = TriggerWordsPresence(words_list=[‘dress’, ‘gown’]).for_column(“Review_Text”)),
]), current_data=reviews_cur, column_mapping=column_mapping)
descriptors_test_suiteNote: we go with defaults, but you can also set custom drift methods and conditions.

Here is the result. The output is neatly structured so you can see which descriptors have drifted.

Image by Author.

Detecting statistical distribution drift is one of the ways to monitor changes in the text property. There are others! Sometimes, it is convenient to run rule-based expectations on the descriptor’s min, max, or mean values.

Let’s say you want to check that all review texts are longer than two words. If at least one review is shorter than two words, you want the test to fail and see the number of short texts in the response.

Here is how you do that! You can pick a TestNumberOfOutRangeValues() check. This time, you should set a custom boundary: the “left” side of the expected range is two words. You must also set a test condition: eq=0. This means you expect the number of objects outside this range to be 0. If it is higher, you want the test to return a fail.

descriptors_test_suite = TestSuite(tests=[
TestNumberOfOutRangeValues(column_name = WordCount().for_column(“Review_Text”), left=2, eq=0),
]), current_data=reviews_cur, column_mapping=column_mapping)

Here is the result. You can also see the test details that show the defined expectation.

Image by Author.

You can follow this principle to design other checks.

Support Evidently

Enjoyed the tutorial? Star Evidently on GitHub to contribute back! This helps us continue creating free, open-source tools and content for the community. ⭐️ Star on GitHub ⟶

Summing up

Text descriptors map text data to interpretable dimensions you can express as a numerical or a categorical attribute. They help describe, evaluate, and monitor unstructured data.

In this tutorial, you learned how to monitor text data using descriptors.

You can use this approach to monitor the behavior of NLP and LLM-powered models production. You can customize and combine your descriptors with other methods, such as monitoring embedding drift.

Are there other descriptors you consider universally useful? Let us know! Join our Discord community to share your thoughts.

Originally published at on June 27, 2023. Thanks to Olga Filippova for co-authoring the article.

Monitoring unstructured data for LLM and NLP was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top