Classical Languages — Grammar

When discussing and analysing texts in terms of grammar, a one piece generative document workflow provides us with a means for reducing the number of errors in terms of incorrectly presented matters of fact (for example, the analysis of a specific piece of text).

For example, inflection patterns of arbitrary regular verbs and nouns can be generated directly for a particular root word (lemma), or when analysing the syllabic structure of a piece of text.

Where the means of production are shared with learners, the ability to check declensions and conjugations for arbitrary words, or analyse a text for its syllabic structure, provides an opportunity to support curiosty driven, self-directed learning.

To provide a few examples of what’s possible, let’s use the cltk package and explore some simple Latin texts.

from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus('lat')
path = '/Users/tonyhirst/cltk_data/lat/text/lat_text_latin_library'

corpus_downloader.import_corpus('lat_models_cltk')

Inflection Patterns

We can automatically generate the inflection (declension / conjugation) for a given word / lemma.

(Lemma - “the canonical form of an inflected word”.)

The morphological character of a word is encoded using a nine character code string (- is used as the null character):

1: 	part of speech
	n	noun
	v	verb
	t	participle
	a	adjective
	d	adverb
	c	conjunction
	r	preposition
	p	pronoun
	m	numeral
	i	interjection
	e	exclamation
	u	punctuation
2: 	person
	1	first person
	2	second person
	3	third person
3: 	number
	s	singular
	p	plural
4: 	tense
	p	present
	i	imperfect
	r	perfect
	l	pluperfect
	t	future perfect
	f	future
5: 	mood
	i	indicative
	s	subjunctive
	n	infinitive
	m	imperative
	p	participle
	d	gerund
	g	gerundive
	u	supine
6: 	voice
	a	active
	p	passive
7:	gender
	m	masculine
	f	feminine
	n	neuter
8: 	case
	n	nominative
	g	genitive
	d	dative
	a	accusative
	b	ablative
	v	vocative
	l	locative
9: 	degree
	c	comparative
	s	superlative

Via: https://github.com/cltk/latin_treebank_perseus#readme

Consider amo. How does it go?

from cltk.morphology.lat import CollatinusDecliner
decliner = CollatinusDecliner()

decliner.decline("amo")[:20]
[('amo', 'v1spia---'),
 ('amas', 'v2spia---'),
 ('amat', 'v3spia---'),
 ('amamus', 'v1ppia---'),
 ('amatis', 'v2ppia---'),
 ('amant', 'v3ppia---'),
 ('amabam', 'v1siia---'),
 ('amabas', 'v2siia---'),
 ('amabat', 'v3siia---'),
 ('amabamus', 'v1piia---'),
 ('amabatis', 'v2piia---'),
 ('amabant', 'v3piia---'),
 ('amabo', 'v1sfia---'),
 ('amabis', 'v2sfia---'),
 ('amabit', 'v3sfia---'),
 ('amabimus', 'v1pfia---'),
 ('amabitis', 'v2pfia---'),
 ('amabunt', 'v3pfia---'),
 ('amavi', 'v1sria---'),
 ('amavisti', 'v2sria---')]

Or how anout canis?

decliner.decline("canis")
[('canis', '--s----n-'),
 ('canis', '--s----v-'),
 ('canem', '--s----a-'),
 ('canis', '--s----g-'),
 ('cani', '--s----d-'),
 ('cane', '--s----b-'),
 ('canes', '--p----n-'),
 ('canes', '--p----v-'),
 ('canes', '--p----a-'),
 ('canum', '--p----g-'),
 ('canibus', '--p----d-'),
 ('canibus', '--p----b-')]

We can decode the strings to more easily describe the morphological character of a word.

#Taken from https://github.com/alpheios-project/pyperseus-treebank/blob/master/pyperseus_treebank/latin.py#L44#
#Maybe use https://github.com/jazzband/inflect for natural language code2text description?
import re

# Conversion table for CONLL
# Thanks to @epageperron
#??Some divergence from README?
_CONLL_LA_CONV_DICT = { "a": "adjective", "c": "conjunction",
                        "d": "adverb", "e": "exclamation", "g": "PART",
                        "i": "interjection", "l": "DET",
                        "m": "numeral", "n": "noun","p": "pronoun",
                        "r": "preposition", "t": "VERB", "u": "punctuation",
                        "v": "verb", "x": "X" }

_NUMBER = {"s": "singular", "p": "plural"}
_TENSE = {"p": "present", "f": "future", "r": "perfect", "l": "pluperfect",
          "i": "imperfect", "t": "future perfect"}
_MOOD = {"i": "indicative", "s": "subjunctive", "m": "imperative", 'd':'gerund',
         "g": "gerundive", "p": "participle", "u": "supine", "n": "infinitive"}
_VOICE = {"a": "active", "p": "passive", "d": "Dep"}
_GENDER = {"f": "feminine", "m": "masculine", "n": "neuter", "c": "Com"}
_CASE = {"g": "genitive", "d": "dative", "a": "accusative", "v": "vocative",
         "n": "nominative", "b": "ablative", "i": "Ins", "l": "locative"}
_DEGREE = {"p": "Pos", "c": "comparative", "s": "superlative"}

_PERSON = {"1":'first person', "2":'second person', "3":'third person'}

NOTWORD = re.compile("^\W+$")

_NULL_CHAR="-"

def parse_features(features):
    """ Parse features from the POSTAG of Perseus Latin XML
    .. example :: self.parse_features("n-p---na-")
    :param features: A string containing morphological information
    :type features: str
    :return: Parsed features
    :rtype: dict
    """

    if features is None or features.lower()=='unk':
        return {}
    
    features = features.lower()
    
    feats = {}

    feats['POS'] = _CONLL_LA_CONV_DICT[features[0]]

    # Person handling : 3 possibilities
    if features[1] != _NULL_CHAR:
        feats["Person"] = _PERSON[features[1]]

    # Number handling : two possibilities
    if features[2] != _NULL_CHAR:
        feats["Number"] = _NUMBER[features[2]]

    # Tense
    if features[3] != _NULL_CHAR:
        feats["Tense"] = _TENSE[features[3]]

    # Mood
    if features[4] != _NULL_CHAR:
        feats["Mood"] = _MOOD[features[4]]

    # Voice
    if features[5] != _NULL_CHAR:
        feats["Voice"] = _VOICE[features[5]]

    # Tense
    if features[6] != _NULL_CHAR:
        feats["Gender"] = _GENDER[features[6]]

    # Tense
    if features[7] != _NULL_CHAR:
        feats["Case"] = _CASE[features[7]]

    # Degree
    if features[8] != _NULL_CHAR:
        feats["Degree"] = _DEGREE[features[8]]

    return feats

For example, how should we interpret the following morphological data feature string?

#Example
parse_features('v3plia---')
{'POS': 'verb',
 'Person': 'third person',
 'Number': 'plural',
 'Tense': 'pluperfect',
 'Mood': 'indicative',
 'Voice': 'active'}

Looking up words in the decliner provides a way of getting the morphological data for a word. For example, we could look up amabitis and get back something like ('amo', 'v2pfia---'):

#hacky way that assumes you know the root
def lookupInflection(word, lemma):
    ''' Find the inflection of a given word, given its lemma. '''
    result=[]
    if lemma is None:
        return result
    
    lemma = [lemma] if isinstance(lemma,str) else lemma
    for l in lemma:
        try:
            words = decliner.decline(l)
            result.append([(w,d) for w,d in words if w==word])
        except:
            result.append((l, None))
    return result

If we know the root, we can lookup the inflection:

lookupInflection('amabitis', 'amo')
[[('amabitis', 'v2pfia---')]]

Lemmatizing a Word

Let’s see if we can find the root of a word with a simple lemmatizer. The lemmatizer works with tokens, so we need a recipe for generating tokens out of words:

from cltk.tokenizers.lat.lat import LatinWordTokenizer

latin_word_tokenizer = LatinWordTokenizer()

latin_word_tokenizer.tokenize('amabitis')
['amabitis']

If we create a lemmatizer:

from cltk.lemmatize.lat import LatinBackoffLemmatizer

latin_lemmatizer = LatinBackoffLemmatizer()

We can then see what it makes of amabitis:

latin_lemmatizer.lemmatize(latin_word_tokenizer.tokenize('amabitis'))
[('amabitis', 'amo')]

We can also lemmatize all the words in a sentence.

As before, we need to tokenize the words we present to the lemmatizer, so let’s convert our sentence to a list of separate (word) tokens:

sentence = 'Progeniem sed enim Troiano a sanguine duci audierat'

sentence_tokens = latin_word_tokenizer.tokenize(sentence.lower())
sentence_tokens
['progeniem', 'sed', 'enim', 'troiano', 'a', 'sanguine', 'duci', 'audierat']

Then we can lemmatize those tokens:

latin_lemmatizer.lemmatize(sentence_tokens)
[('progeniem', 'progenies'),
 ('sed', 'sed'),
 ('enim', 'enim'),
 ('troiano', 'troiano'),
 ('a', 'ab'),
 ('sanguine', 'sanguis'),
 ('duci', 'duco'),
 ('audierat', 'audio')]

We can also lemmatize Roman numerals:

from cltk.lemmatize.lat import RomanNumeralLemmatizer

#Lemmatizer for identifying roman numerals in Latin text based on regex.
lemmatizer = RomanNumeralLemmatizer()

lemmatizer.lemmatize("i ii iii iv v vi vii vii ix x xx xxx xl l lx c cc".split())
[('i', 'NUM'),
 ('ii', 'NUM'),
 ('iii', 'NUM'),
 ('iv', 'NUM'),
 ('v', 'NUM'),
 ('vi', 'NUM'),
 ('vii', 'NUM'),
 ('vii', 'NUM'),
 ('ix', 'NUM'),
 ('x', 'NUM'),
 ('xx', 'NUM'),
 ('xxx', 'NUM'),
 ('xl', 'NUM'),
 ('l', 'NUM'),
 ('lx', 'NUM'),
 ('c', 'NUM'),
 ('cc', 'NUM')]

Syllables

One way of helping students read a text is to split the syllables out.m

with open(f'{path}/vergil/aen1.txt') as f:
    aeneid_1 = f.read()
#Here's a manual way of doing a concordance, though we need to clean it for the tokeniser?
from cltk.alphabet.text_normalization import remove_non_ascii
from cltk.alphabet.text_normalization import remove_non_latin

aen1_clean = remove_non_ascii(aeneid_1)
aen1_clean = remove_non_latin(aen1_clean)
print(aen1_clean[:1000])
Vergil Aeneid I        P VERGILI MARONIS AENEIDOS LIBER PRIMVS  Arma virumque cano Troiae qui primus ab oris Italiam fato profugus Laviniaque venit litora multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram multa quoque et bello passus dum conderet urbem     inferretque deos Latio genus unde Latinum Albanique patres atque altae moenia Romae  Musa mihi causas memora quo numine laeso quidve dolens regina deum tot volvere casus insignem pietate virum tot adire labores     impulerit Tantaene animis caelestibus irae  Urbs antiqua fuit Tyrii tenuere coloni Karthago Italiam contra Tiberinaque longe ostia dives opum studiisque asperrima belli quam Iuno fertur terris magis omnibus unam     posthabita coluisse Samo hic illius arma hic currus fuit hoc regnum dea gentibus esse si qua fata sinant iam tum tenditque fovetque Progeniem sed enim Troiano a sanguine duci audierat Tyrias olim quae verteret arces     hinc populum late regem belloque superbum venturum excidio Li
from nltk.text import Text

tokens = latin_word_tokenizer.tokenize(aen1_clean)
textList = Text(tokens)
textList.concordance('Libyae')
Displaying 7 of 7 matches:
ello -que superbum venturum excidio Libyae sic volvere Parcas Id metuens veter
a litora cursu contendunt petere et Libyae vertuntur ad oras Est in secessu lo
ulos sic vertice caeli constitit et Libyae defixit lumina regnis Atque illum t
e per aera magnum remigio alarum ac Libyae citus adstitit oris Et iam iussa fa
o -que supersunt Ipse ignotus egens Libyae deserta peragro Europa atque Asia p
e pater optime Teucrum pontus habet Libyae nec spes iam restat Iuli at freta S
uidem per litora certos dimittam et Libyae lustrare extrema iubebo si quibus e
from cltk.languages.example_texts import get_example_text

example_lat = get_example_text('lat')
example_lat
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones. Belgae ab extremis Galliae finibus oriuntur, pertinent ad inferiorem partem fluminis Rheni, spectant in septentrionem et orientem solem. Aquitania a Garumna flumine ad Pyrenaeos montes et eam partem Oceani quae est ad Hispaniam pertinet; spectat inter occasum solis et septentriones.'
from cltk.sentence.lat import LatinPunktSentenceTokenizer

latin_splitter = LatinPunktSentenceTokenizer()

sentences = latin_splitter.tokenize(example_lat)

sentences
['Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur.',
 'Hi omnes lingua, institutis, legibus inter se differunt.',
 'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.',
 'Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt.',
 'Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt.',
 'Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones.',
 'Belgae ab extremis Galliae finibus oriuntur, pertinent ad inferiorem partem fluminis Rheni, spectant in septentrionem et orientem solem.',
 'Aquitania a Garumna flumine ad Pyrenaeos montes et eam partem Oceani quae est ad Hispaniam pertinet; spectat inter occasum solis et septentriones.']
from cltk.prosody.lat.syllabifier import Syllabifier

syllabifier = Syllabifier()

clean_sentence = remove_non_ascii(remove_non_latin(sentences[0])).lower()

#Extract syllables for each word
for word in latin_word_tokenizer.tokenize(clean_sentence):
    syllables = syllabifier.syllabify(word)
    print(word, syllables)
gallia ['gal', 'li', 'a']
est ['est']
omnis ['om', 'nis']
divisa ['di', 'vi', 'sa']
in ['in']
partes ['par', 'tes']
tres ['tres']
quarum ['qua', 'rum']
unam ['u', 'nam']
incolunt ['in', 'co', 'lunt']
belgae ['bel', 'gae']
aliam ['a', 'li', 'am']
aquitani ['a', 'qui', 'ta', 'ni']
tertiam ['ter', 'ti', 'am']
qui ['qui']
ipsorum ['ip', 'so', 'rum']
lingua ['lin', 'gua']
celtae ['cel', 'tae']
nostra ['nos', 'tra']
galli ['gal', 'li']
appellantur ['ap', 'pel', 'lan', 'tur']

Pipeline processing

example_lat
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones. Belgae ab extremis Galliae finibus oriuntur, pertinent ad inferiorem partem fluminis Rheni, spectant in septentrionem et orientem solem. Aquitania a Garumna flumine ad Pyrenaeos montes et eam partem Oceani quae est ad Hispaniam pertinet; spectat inter occasum solis et septentriones.'
from cltk.languages.pipelines import LatinPipeline

pipeline = LatinPipeline()

pipeline.description
'Pipeline for the Latin language'
pipeline.language
Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a')
pipeline.language.name
'Latin'
pipeline.processes
[cltk.alphabet.processes.LatinNormalizeProcess,
 cltk.dependency.processes.LatinStanzaProcess,
 cltk.embeddings.processes.LatinEmbeddingsProcess,
 cltk.stops.processes.StopsProcess,
 cltk.ner.processes.LatinNERProcess,
 cltk.lexicon.processes.LatinLexiconProcess]
# This doesn't work?
#cltk_nlp = NLP(language='lat')
#cltk_doc = cltk_nlp.analyze(text=example_lat)

# Absolutely no idea what to do here...
#cltk_doc.stanza_doc.to_dict()[0][:3]
# Also: cltk_doc.raw
#cltk_doc.normalized_text
#cltk_doc.sentences_strings
#cltk_doc.sentences_tokens
#[ p for p in zip(cltk_doc.tokens, cltk_doc.lemmata,
#                 cltk_doc.pos, cltk_doc.morphosyntactic_features)]