Understanding the Pipeline Function in Transformers for Sentiment Analysis
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeIf you found this summary useful, consider buying us a coffee. It would help us a lot!
Introduction
In this article, we delve into the inner workings of the pipeline function provided by the Transformers library, focusing specifically on the sentiment analysis capabilities it offers. By dissecting the process into its three fundamental stages—tokenization, model processing, and post-processing—we will gain a clearer understanding of how raw textual inputs are transformed into meaningful sentiment labels with associated scores. If you have been curious about how these complex models manage to evaluate sentiments accurately, you are in the right place.
The Three Stages of the Pipeline
The sentiment analysis pipeline consists of three critical stages:
- Tokenization - Transforming raw text into numerical representations.
- Model Processing - Utilizing pretrained models to derive outputs.
- Post-Processing - Converting model outputs into interpretable labels and scores.
Let’s explore each stage in detail.
Stage 1: Tokenization
The first step in the sentiment analysis pipeline is tokenization, which involves breaking down input text into manageable chunks.
How Tokenization Works
- Splitting the Text: The text is split into smaller units called tokens. These can be complete words, subwords, or punctuation marks.
- Adding Special Tokens: Some models expect certain special tokens. For instance, a CLS (classification) token is placed at the beginning of each input, and a SEP (separator) token is added at the end.
- Mapping Tokens to IDs: Each token is then matched to a unique ID from the model's vocabulary based on its pretrained configuration.
To facilitate the tokenization process, the Transformers library utilizes the AutoTokenizer
API. Here’s how to set this up:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
Once the tokenizer is instantiated, we can feed it our sentences to process:
input_texts = ["I love programming!", "I hate bugs."]
tokenized_inputs = tokenizer(input_texts, padding=True, truncation=True, return_tensors='tf')
Output from Tokenization
The result will be a dictionary that includes:
- input_ids: The numerical IDs of each token, with padding applied where necessary.
- attention_mask: This indicates which parts of the input are padding and should be ignored by the model during processing.
Stage 2: Model Processing
The next step in the pipeline is processing the tokenized inputs through the model, which effectively maps the token IDs to sentiment outputs.
Utilizing the Model
For this, we use the TFAutoModelForSequenceClassification
class, which includes the classification head necessary for our specific task. The setup looks like this:
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
outputs = model(tokenized_inputs)
Here, the outputs will typically be logits—raw scores generated by the model. Each sentence will yield a vector of scores corresponding to each potential sentiment label.
Understanding the Model Output
While the initial outputs are not probabilities, they represent the sentiment predictions for each sentence:
- Each sentence will typically output scores across classes (e.g., positive and negative).
- At this stage, it's essential to know that these logits need to be further transformed to interpret them.
Stage 3: Post-Processing
The final stage involves converting the logits into probabilities that can be easily understood and interpreted.
Applying SoftMax
To convert the logits into probabilities, we apply the SoftMax function:
import tensorflow as tf
probabilities = tf.nn.softmax(outputs.logits, axis=-1)
Now, we have probabilities for each sentiment, which sum up to 1, indicating the model's confidence in each class.
Mapping Probabilities to Labels
To determine the actual sentiment labels, we need to refer to the model configuration that contains an id2label
mapping:
- Typically, the first index corresponds to the negative sentiment, while the second corresponds to the positive sentiment.
label_map = model.config.id2label
predicted_labels = [label_map[i] for i in tf.argmax(probabilities, axis=1)]
Conclusion
Understanding what happens inside the pipeline function of the Transformers library is crucial for effectively implementing sentiment analysis tasks. Through tokenization, model processing, and post-processing, we can convert raw text into interpretable sentiment scores. Now that you grasp the intricacies of this process, you can confidently experiment with it in your own NLP projects. Whether you're looking to analyze tweets, customer reviews, or any sequence of text, the pipeline function provides a robust framework for extracting sentiment insights.
What happens inside the pipeline function? In this video, we will look at what actually happens when we use the pipeline function of the Transformers library. More specifically, we will look at the sentiment analysis pipeline, and
how it went from the two following sentences to the positive labels with their respective scores. As we have seen in the pipeline presentation, there are three stages in the pipeline. First, we convert the raw texts to numbers the model can make sense of, using a tokenizer.
Then, those numbers go through the model, which outputs logits. Finally, the post-processing steps transforms those logits into labels and scores. Let's look in detail at those three steps, and how to replicate them using the Transformers library,
beginning with the first stage, tokenization. The tokenization process has several steps. First, the text is split into small chunks called tokens. They can be words, parts of words or punctuation symbols. Then the tokenizer will had some special tokens (if the model expect them). Here the model
uses expects a CLS token at the beginning and a SEP token at the end of the sentence to classify. Lastly, the tokenizer matches each token to its unique ID in the vocabulary of the pretrained model. To load such a tokenizer, the Transformers library provides the AutoTokenizer API.
The most important method of this class is from_pretrained, which will download and cache the configuration and the vocabulary associated to a given checkpoint. Here, the checkpoint used by default for the sentiment analysis pipeline is distilbert base uncased finetuned sst2 english.
We instantiate a tokenizer associated with that checkpoint, then feed it the two sentences. Since those two sentences are not of the same size, we will need to pad the shortest one to be able to build an array. This is done by the tokenizer with the option padding=True.
With truncation=True, we ensure that any sentence longer than the maximum the model can handle is truncated. Lastly, the return_tensors option tells the tokenizer to return a TensorFlow tensor. Looking at the result, we see we have a dictionary with two keys.
Input IDs contains the IDs of both sentences, with 0s where the padding is applied. The second key, attention mask, indicates where padding has been applied, so the model does not pay attention to it. This is all what is inside the tokenization step. Now let's have a look at the second step,
the model. As for the tokenizer, there is an TFAutoModel API, with a from_pretrained method. It will download and cache the configuration of the model as well as the pretrained weights. However, the TFAutoModel API will only instantiate the body of the model,
that is, the part of the model that is left once the pretraining head is removed. It will output a high-dimensional tensor that is a representation of the sentences passed, but which is not directly useful for our classification problem.
Here the tensor has two sentences, each of sixteen tokens and the last dimension is the hidden size of our model 768. To get an output linked to our classification problem, we need to use the TFAutoModelForSequenceClassification class. It works exactly as the AutoModel class,
except that it will build a model with a classification head. There is one auto class for each common NLP task in the Transformers library. Here, after giving our model the two sentences, we get a tensor of size two by two: one result for each sentence and for each possible label. Those
outputs are not probabilities yet (we can see they don't sum to 1). This is because each model of the Transformers library returns logits. To make sense of those logits, we need to dig into the third and last step of the pipeline: post-processing. To convert logits into probabilities, we need to
apply a SoftMax layer to them. As we can see, this transforms them into positive numbers that sum up to 1. The last step is to know which of those corresponds to the positive or the negative label. This is given by the id2label field of the model config. The first probabilities
(index 0) correspond to the negative label, and the seconds (index 1) correspond to the positive label. This is how our classifier built with the pipeline function picked those labels and computed those scores. Now that you know how each steps works, you can easily tweak them to your needs.