Sequence Labeling — The Basis of NLP
In this blog post, we will introduce Sequence Labeling in the field of Natural Language Processing (NLP). We will start by talking about the use of this technique in real world applications and then go deeper by listing the main tasks that we can find : Named Entity Recognition, Part-Of-Speech Tagging and Chunking.
Sequence labeling is a fundamental technique in NLP that is used to identify and label the components of a sequence, such as words or phrases in a sentence. It is often used as a preprocessing step for other NLP tasks, such as part-of-speech tagging or named entity recognition, and is a key component of many NLP applications.
Sequence labeling has a number of practical applications. It can be used in text classification to categorize documents and in sentiment analysis to determine the sentiment of a text. In information retrieval, it helps to clarify the context and meaning of a query. Additionally, sequence labeling is employed in machine translation to identify the grammatical structure of a sentence and to facilitate the translation process.
With that said, we can now move on to examine how it works with the help of some code examples.
Main Tasks of Sequence Labeling
Sequence labeling is used for a wide range of NLP tasks, such as Part-Of-Speech Tagging tagging, Named Entity Recognition, Chunking and Semantic Role Labeling, let’s look at each of them…
Part-Of-Speech Tagging
Part-of-speech (POS) tagging is the process of labeling the parts of speech (such as nouns, verbs, and adjectives) in a sentence.
In information retrieval, it can help the search engine to provide more relevant and accurate results by distinguishing different parts of speech.
Here is an example :
import nltk
def tag_sentence(sentence):
tokens = nltk.word_tokenize(sentence)
tagged_tokens = []
for token in tokens:
if token.endswith('ed'):
tagged_tokens.append((token, 'VERB'))
elif token.istitle():
tagged_tokens.append((token, 'PROPER NOUN'))
else:
tagged_tokens.append((token, 'NOUN'))
return tagged_tokens
sentence = "The cat chased the mouse."
tagged_tokens = tag_sentence(sentence)
print(tagged_tokens)
# Output: [('The', 'PROPER NOUN'), ('cat', 'NOUN'), ('chased', 'VERB'),
# ('the', 'PROPER NOUN'), ('mouse', 'NOUN'), ('.', 'NOUN')]
Named Entity Recognition
Named entity recognition (NER) is the task of identifying and classifying named entities (such as people, organizations, and locations) in text.
As POS tagging, it can be used to extract information from a large corpus of texts and help identify more quickly what is wanted.
Here is an example using the pre-trained en_core_web_sm model from Spacy :
import spacy
nlp = spacy.load('en_core_web_sm')
def extract_entities(text):
doc = nlp(text)
entities = []
for ent in doc.ents:
entities.append((ent.label_, ent.text))
return entities
text = "Barack Obama was born in Hawaii."
print(extract_entities(text))
# Output : [('PERSON', 'Barack Obama'), ('GPE', 'Hawaii')]
The model has been pre-trained to recognize a wide range of named entities, and it can be fine-tuned for specific tasks or languages by training on additional annotated data.
Chunking
Chunking is a task of sequence labeling that involves dividing a sequence of words into chunks or non-overlapping sub-sequences. These chunks are typically tagged with a label that indicates their type or role in the sequence.
It is often used as a preprocessing step for other natural language processing tasks, such as named entity recognition or part-of-speech tagging.
Suppose we have the following sentence: “The quick brown fox jumps over the lazy dog.”
Using chunking, we might divide the sentence into the following chunks:
- “The quick brown fox” (noun phrase)
- “jumps” (verb)
- “over the lazy dog” (noun phrase)
This is one way to chunk by identifying the noun phrases in the sentence. The resulting sequence of chunks and tags would be:
(NP, The quick brown fox) (V, jumps) (NP, over the lazy dog)
Semantic Role Labeling
Semantic Role Labeling (SR) goes beyond identifying the grammatical role of words as Part-Of-Speech Tagging but focuses on determining their meaning and the relationships between them.
It typically involves the following steps :
- Identifying the predicate in a sentence
- Identifying the arguments of the predicate
- Labeling the arguments with their corresponding roles
Here is an example using Stanford CoreNLP model/parser :
import os
from nltk.parse.corenlp import CoreNLPParser
# Set the path to the Stanford CoreNLP jar file and the Stanford CoreNLP models
os.environ['CORENLP_HOME'] = '/path/to/stanford-corenlp'
# Create the parser
parser = CoreNLPParser(url='http://localhost:9000')
# Parse a sentence and extract the SRL information
sentence = "The boy kicked the ball."
parse_tree = next(parser.raw_parse(sentence))
srl_triples = list(parse_tree.srl_triples())
# Print the SRL triples
for triple in srl_triples:
print(triple)
# Output :
# (('kicked', 'VBD'), 'nsubj', ('boy', 'NN'))
# (('kicked', 'VBD'), 'dobj', ('ball', 'NN'))
Each triple consists of a predicate (the first element of the triple), a role label (the second element of the triple), and an argument (the third element of the triple).
In this example, the predicate is “kicked”, the role label “nsubj” indicates that “boy” is the subject of the predicate, and the role label “dobj” indicates that “ball” is the direct object of the predicate.
Now that we have introduced the most common sequence labeling tasks, let’s review the main approaches used…
Approaches to Sequence Labeling Tasks
There are multiple ways to perform those tasks, and the method chosen can significantly impact the performance and outcome.
- Rule-based approaches : These rely on a set of manually-defined rules going from predefined rules to tag each word in a sentence or identifying named entities in text. These do the job for simple tasks but can be error-prone and time consuming.
- Machine learning-based approaches : These approaches use machine learning techniques to learn the patterns for the given tasks from annotated training data.
They range from Stochastic approaches to Deep learning-based approaches such as Transformers. - Hybrid approaches : These approaches combine the strengths of rule-based and statistical approaches, using a combination of hand-written rules and machine learning techniques to identify arguments and roles.
Before concluding, I wanted to give you some interesting packages that can be helpful to start in NLP.
Python Libraries You Need to know
Excluding general machine learning libraries such as Scikit Learn, Keras, Tensorflow, and Pytorch, the following libraries are specifically focused on natural language processing :
- NLTK : The Natural Language Toolkit (NLTK) is a comprehensive library for NLP that includes tools for sequence labeling, such as part-of-speech taggers and chunkers.
- spaCy : spaCy is an NLP library that provides various tools for sequence labeling, including part-of-speech taggers, named entity recognition, and dependency parsers.
- GenSim : GenSim is a library for NLP that includes tools for word embeddings and language modeling, as well as tools for part-of-speech tagging and named entity recognition.
- Polyglot : Polyglot is a library that provides support for a wide range of natural language processing tasks, including part-of-speech tagging, named entity recognition, and language detection.
- TextBlob : TextBlob is a library for NLP that includes tools for part-of-speech tagging, noun phrase chunking, and sentiment analysis.
Thank you for taking the time to read the article, I hope you had a great time learning about Sequence labeling. Please feel free to share this post with your friends and if you are interested in data science or machine learning, check out my other articles here.