Practical Introduction to Transformer Models: BERT

Photo by Alex Padurariu on Unsplash

Hands-on Tutorials

Hands-on tutorial on how to build your first sentiment analysis model using BERT

Preface: This article presents a summary of information about the given topic. It should not be considered original research. The information and code included in this article have may be influenced by things I have read or seen in the past from various online articles, research papers, books, and open-source code.

Table of Content

Introduction to BERTPre-Training and Fine-TuningHands On: Using BERT for sentiment analysisInterpreting ResultsClosing Thoughts

In NLP, the transformer model architecture has been a revolutionary that greatly enhanced the ability to understand and generate textual information.

In this tutorial, we are going to dig-deep into BERT, a well-known transformer-based model, and provide an hands-on example to fine-tune the base BERT model for sentiment analysis.

Introduction to BERT

BERT, introduced by researchers at Google in 2018, is a powerful language model that uses transformer architecture. Pushing the boundaries of earlier model architecture, such as LSTM and GRU, that were either unidirectional or sequentially bi-directional, BERT considers context from both past and future simultaneously. This is due to the innovative “attention mechanism,” which allows the model to weigh the importance of words in a sentence when generating representations.

The BERT model is pre-trained on the following two NLP tasks:

Masked Language Model (MLM)Next Sentence Prediction (NSP)

and is generally used as the base model for various downstream NLP tasks, such as sentiment analysis which we will cover in this tutorial.

Pre-Training and Fine-Tuning

The power of BERT comes from its two-step process:

Pre-training is the phase where BERT is trained on large amounts of data. As a result, it learns to predict masked words in a sentence (MLM task) and to predict if a sentence follows another one (NSP task). The output of this stage is a a pre-trained NLP model with a general-purpose “understanding” of the languageFine-tuning is where the pre-trained BERT model is further trained on a specific task. The model is initialized with the pre-trained parameters, and the entire model is trained on a downstream task, allowing BERT to fine-tune its understanding of language to the specifics of the task at hand.

Hands On: Using BERT for sentiment analysis

The complete code is available as a Jupyter Notebook on GitHub

In this hands-on exercise, we will train the sentiment analysis model on the IMDB movie reviews dataset [4] (license: Apache 2.0), which comes labeled whether a review is positive or negative. We will also load the model using the Hugging Face’s transformers library.

Let’s load all the libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve, auc
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# Variables to set the number of epochs and samples
num_epochs = 10
num_samples = 100 # set this to -1 to use all data

First, we need to load the dataset and the model tokenizer.

# Step 1: Load dataset and model tokenizer
dataset = load_dataset(‘imdb’)
tokenizer = AutoTokenizer.from_pretrained(‘bert-base-uncased’)

Next, we’ll create a plot to see the distribution of the positive and negative classes.

# Data Exploration
train_df = pd.DataFrame(dataset[“train”])
sns.countplot(x=’label’, data=train_df)
plt.title(‘Class distribution’) 1. Class distribution of the training dataset

Next, we preprocess our dataset by tokenizing the texts. We use BERT’s tokenizer, which will convert the text into tokens that correspond to BERT’s vocabulary.

# Step 2: Preprocess the dataset
def tokenize_function(examples):
return tokenizer(examples[“text”], padding=”max_length”, truncation=True)

tokenized_datasets =, batched=True)

After that, we prepare our training and evaluation datasets. Remember, if you want to use all the data, you can set the num_samples variable to -1.

if num_samples == -1:
small_train_dataset = tokenized_datasets[“train”].shuffle(seed=42)
small_eval_dataset = tokenized_datasets[“test”].shuffle(seed=42)
small_train_dataset = tokenized_datasets[“train”].shuffle(seed=42).select(range(num_samples))
small_eval_dataset = tokenized_datasets[“test”].shuffle(seed=42).select(range(num_samples))

Then, we load the pre-trained BERT model. We’ll use the AutoModelForSequenceClassification class, a BERT model designed for classification tasks.

For this tutorial, we use the ‘bert-base-uncased’ version of BERT, which is trained on lower-case English text, is used for this tutorial.# Step 3: Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(‘bert-base-uncased’, num_labels=2)

Now, we’re ready to define our training arguments and create a Trainer instance to train our model.

# Step 4: Define training arguments
training_args = TrainingArguments(“test_trainer”, evaluation_strategy=”epoch”, no_cuda=True, num_train_epochs=num_epochs)

# Step 5: Create Trainer instance and train
trainer = Trainer(
model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset


Interpreting Results

Having trained our model, let’s evaluate it. We’ll calculate the confusion matrix and the ROC curve to understand how well our model performs.

# Step 6: Evaluation
predictions = trainer.predict(small_eval_dataset)

# Confusion matrix
cm = confusion_matrix(small_eval_dataset[‘label’], predictions.predictions.argmax(-1))
sns.heatmap(cm, annot=True, fmt=’d’)
plt.title(‘Confusion Matrix’)

# ROC Curve
fpr, tpr, _ = roc_curve(small_eval_dataset[‘label’], predictions.predictions[:, 1])
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(1.618 * 5, 5))
plt.plot(fpr, tpr, color=’darkorange’, lw=2, label=’ROC curve (area = %0.2f)’ % roc_auc)
plt.plot([0, 1], [0, 1], color=’navy’, lw=2, linestyle=’–‘)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)
plt.title(‘Receiver operating characteristic’)
plt.legend(loc=”lower right”) 2. Confusion MatrixFig 3. ROC curve

The confusion matrix gives a detailed breakdown of how our predictions measure up to the actual labels, while the ROC curve shows us the trade-off between the true positive rate (sensitivity) and the false positive rate (1 — specificity) at various threshold settings.

Finally, to see our model in action, let’s use it to infer the sentiment of a sample text.

# Step 7: Inference on a new sample
sample_text = “This is a fantastic movie. I really enjoyed it.”
sample_inputs = tokenizer(sample_text, padding=”max_length”, truncation=True, max_length=512, return_tensors=”pt”)

# Move inputs to device (if GPU available)

# Make prediction
predictions = model(**sample_inputs)
predicted_class = predictions.logits.argmax(-1).item()

if predicted_class == 1:
print(“Positive sentiment”)
print(“Negative sentiment”)

Closing Thoughts

By walking through an example of sentiment analysis on IMDb movie reviews, I hope you’ve gained a clear understanding of how to apply BERT to real-world NLP problems. The Python code I’ve included here can be adjusted and extended to tackle different tasks and datasets, paving the way for even more sophisticated and accurate language models.


[1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

[3] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., … & Rush, A. M. (2019). Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.

[4] Lhoest, Q., Villanova del Moral, A., Jernite, Y., Thakur, A., von Platen, P., Patil, S., Chaumond, J., Drame, M., Plu, J., Tunstall, L., Davison, J., Šaško, M., Chhablani, G., Malik, B., Brandeis, S., Le Scao, T., Sanh, V., Xu, C., Patry, N., McMillan-Major, A., Schmid, P., Gugger, S., Delangue, C., Matussière, T., Debut, L., Bekman, S., Cistac, P., Goehringer, T., Mustar, V., Lagunas, F., Rush, A., & Wolf, T. (2021). Datasets: A Community Library for Natural Language Processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 175–184). Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. Retrieved from

Thanks for reading. If you have any feedback, please feel to reach out by commenting on this post, messaging me on LinkedIn, or shooting me an email (smhkapadia[at]

If you enjoyed this article, visit my other articles

Domain Adaption: Fine-Tune Pre-Trained NLP ModelsThe Evolution of Natural Language ProcessingRecommendation System in Python: LightFMEvaluate Topic Models: Latent Dirichlet Allocation (LDA)

Practical Introduction to Transformer Models: BERT was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.


Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top