Linguistics & Text Processing Overview¶
As well as providing specialist tools to support linguistics teaching an learning, some of the tools may have more general purpose relevance. For example, named entity recognition can be used to identify and extract names of people, companies and geopolitical entities, as well as dates and monetary amounts, within a text; supporting visualisation tools can also be used to highlight those entities within a displayed text.
A variety of tools exist that can be used to support linguistic analysis of texts.
In Python, this includes:
(
nltk
), the Natural Language Toolkit (NLTK), a rich an powerful toolkit for analysing texts that includes a wide range of reference texts and corpora;spaCy
is an “industrial strength natural language processing toolkit” that provides very easy to use, yet very fast and very powerful, language processing features.
Natural Language Toolkit (NLTK)¶
NLTK is a long-lived Python project that was arguably the dominant NLP toolkit for many years.
%%capture
try:
import nltk
except:
%pip install nltk
import nltk
sentence = """This is demonstration of how
the "NLTK" package can tokenise a sentence; simple, but effective."""
tokens = nltk.word_tokenize(sentence)
tokens
['This',
'is',
'demonstration',
'of',
'how',
'the',
'``',
'NLTK',
"''",
'package',
'can',
'tokenise',
'a',
'sentence',
';',
'simple',
',',
'but',
'effective',
'.']
The tokens can also be tagged according the the part of speech (POS) they represent:
nltk.download('averaged_perceptron_tagger')
nltk.pos_tag(tokens)
[('This', 'DT'),
('is', 'VBZ'),
('demonstration', 'NN'),
('of', 'IN'),
('how', 'WRB'),
('the', 'DT'),
('``', '``'),
('NLTK', 'NNP'),
("''", "''"),
('package', 'NN'),
('can', 'MD'),
('tokenise', 'VB'),
('a', 'DT'),
('sentence', 'NN'),
(';', ':'),
('simple', 'NN'),
(',', ','),
('but', 'CC'),
('effective', 'JJ'),
('.', '.')]
spaCy
¶
With its growing ecosystem of plugins, and simplicity of use, the spaCy
package increasingly provides an effective toolkit for working with natural language texts in a growing number of languages.
Another attractive feature when authoring rich tests is the good range of visualisers it provides.
%%capture
try:
import spacy
except:
%pip install spacy
text = """
The spaCy package uses a range of pretrained models to parse provided sentences.
Named entity recognition allows names such as John Smith, Managing Director of FooBar Ltd. in the UK,
who earned £1,000,000 in 2021, to be easily extracted from a text.
"""
Consider the following text:
print(text)
The spaCy package uses a range of pretrained models to parse provided sentences.
Named entity recognition allows names such as John Smith, Managing Director of FooBar Ltd. in the UK,
who earned £1,000,000 in 2021, to be easily extracted from a text.
We can parse the document and display any named entities it contains:
import spacy
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
# Parse the document
doc = nlp(text)
# Extract entities
for entity in doc.ents:
print(entity.text, entity.label_)
John Smith PERSON
FooBar Ltd. ORG
UK GPE
1,000,000 MONEY
2021 DATE
Visualisation tools exist that can highlight named entities within a text:
from spacy import displacy
# Currently there is no option to set the background color for this visualisation,
# but there is an open issue that may address this
displacy.render(doc, style="ent")
The spaCy
package can also be used to diagram the connected parts of speech in a sentence.
doc = nlp("The cat sat on the mat.")
displacy.render(doc, style="dep")
A more compact view is also available:
displacy.render(doc, style="dep", options={"compact":True, "bg":"ivory"})