Classical Languages — Corpora

Some tools to support classics related subject matter, such as retrieving classical texts and searching throuhg them, analysing grammar, inflectng words (declensions, conjugations).

The intention, as with the other notebooks in this collection, is to explore ways in which we might create educational resources that are “reproducible with modification” through making available the means of production of various analyses, diagrams, etc along with the produced resource.

A secondary benefit is that by automating the generation of particular assets or examples, it becomes easier for authors to make use of them, which may open up new teaching lines. A tertiary benefit is that learners may use the same production methods to allow them to explore the topics themselves.

cltk

cltk, the Classical Language Toolkit, is a natural language processing (NLP) package designed for use with the languages of Ancient, Classical, and Medieval Eurasia.

cltk provides access to a variety of classical texts in a variety of languages including Latin, Greek, and Old and Middle English, and as such provides a way for learners to access such texts themselves, if we can find a way of accessing a reliable index to them, or search through metadata provided for them.

The natural language processing tools in the package make it easy to search texts, as well as analyse them in some languages.

There are also language specific tools, such as a declension generator in Latin, that might be useful for helping check declensions and conjugations, or display particular person/tense combinations for a particular word.

OpenLearn units to explore:

%%capture
try:
    import cltk
except:
    %pip install matplotlib
    %pip install cltk

Language Examples

To make it easy to demonstrate processing capabilities for a partcular language, we can handily obtain an example piece of text for any of the supported languages, simply by providing the appropriate language code.

We can look up language codes from their common name:

import cltk

cltk.languages.utils.find_iso_name('Middle English')
['enm']

We can also get a full list of languages, including a general indication of original locale of each language using metadata from the Glottolog project.

from cltk.languages.glottolog import LANGUAGES

for l in list(LANGUAGES)[:10]:
    print(f'{LANGUAGES[l].name} ({l}) [{LANGUAGES[l].latitude}, {LANGUAGES[l].longitude}]')
Aequian (xae) [0.0, 0.0]
Aghwan (xag) [40.374444, 47.126667]
Akkadian (akk) [33.1, 44.1]
Alanic (xln) [0.0, 0.0]
Ancient Greek (grc) [39.8155, 21.9129]
Ancient Hebrew (hbo) [31.7761, 35.1725]
Ancient Ligurian (xlg) [0.0, 0.0]
Ancient Macedonian (xmk) [0.0, 0.0]
Ancient North Arabian (xna) [0.0, 0.0]
Ancient Zapotec (xzp) [0.0, 0.0]

To view the area of the world from which the languages arise, we can plot them on a map:

import folium

m = folium.Map()

for lang in list(LANGUAGES):
    folium.Marker(location=[LANGUAGES[lang].latitude, LANGUAGES[lang].longitude],
                            popup=f'{LANGUAGES[lang].name} (iso-code: {lang})').add_to(m)

m
Make this Notebook Trusted to load map: File -> Trust Notebook

Example Texts

To simplify the creation of worked examples, simple example texts can be retrieved for all supported languages.

from cltk.languages.example_texts import get_example_text

get_example_text('lat')
'Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septentriones. Belgae ab extremis Galliae finibus oriuntur, pertinent ad inferiorem partem fluminis Rheni, spectant in septentrionem et orientem solem. Aquitania a Garumna flumine ad Pyrenaeos montes et eam partem Oceani quae est ad Hispaniam pertinet; spectat inter occasum solis et septentriones.'

Or here’s another example:

get_example_text('grc')
'ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ᾽ οὖν καὶ αὐτὸς ὑπ᾽ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον. καίτοι ἀληθές γε ὡς ἔπος εἰπεῖν οὐδὲν εἰρήκασιν. μάλιστα δὲ αὐτῶν ἓν ἐθαύμασα τῶν πολλῶν ὧν ἐψεύσαντο, τοῦτο ἐν ᾧ ἔλεγον ὡς χρῆν ὑμᾶς εὐλαβεῖσθαι μὴ ὑπ᾽ ἐμοῦ ἐξαπατηθῆτε ὡς δεινοῦ ὄντος λέγειν. τὸ γὰρ μὴ αἰσχυνθῆναι ὅτι αὐτίκα ὑπ᾽ ἐμοῦ ἐξελεγχθήσονται ἔργῳ, ἐπειδὰν μηδ᾽ ὁπωστιοῦν φαίνωμαι δεινὸς λέγειν, τοῦτό μοι ἔδοξεν αὐτῶν ἀναισχυντότατον εἶναι, εἰ μὴ ἄρα δεινὸν καλοῦσιν οὗτοι λέγειν τὸν τἀληθῆ λέγοντα: εἰ μὲν γὰρ τοῦτο λέγουσιν, ὁμολογοίην ἂν ἔγωγε οὐ κατὰ τούτους εἶναι ῥήτωρ. οὗτοι μὲν οὖν, ὥσπερ ἐγὼ λέγω, ἤ τι ἢ οὐδὲν ἀληθὲς εἰρήκασιν, ὑμεῖς δέ μου ἀκούσεσθε πᾶσαν τὴν ἀλήθειαν—οὐ μέντοι μὰ Δία, ὦ ἄνδρες Ἀθηναῖοι, κεκαλλιεπημένους γε λόγους, ὥσπερ οἱ τούτων, ῥήμασί τε καὶ ὀνόμασιν οὐδὲ κεκοσμημένους, ἀλλ᾽ ἀκούσεσθε εἰκῇ λεγόμενα τοῖς ἐπιτυχοῦσιν ὀνόμασιν—πιστεύω γὰρ δίκαια εἶναι ἃ λέγω—καὶ μηδεὶς ὑμῶν προσδοκησάτω ἄλλως: οὐδὲ γὰρ ἂν δήπου πρέποι, ὦ ἄνδρες, τῇδε τῇ ἡλικίᾳ ὥσπερ μειρακίῳ πλάττοντι λόγους εἰς ὑμᾶς εἰσιέναι. καὶ μέντοι καὶ πάνυ, ὦ ἄνδρες Ἀθηναῖοι, τοῦτο ὑμῶν δέομαι καὶ παρίεμαι: ἐὰν διὰ τῶν αὐτῶν λόγων ἀκούητέ μου ἀπολογουμένου δι᾽ ὧνπερ εἴωθα λέγειν καὶ ἐν ἀγορᾷ ἐπὶ τῶν τραπεζῶν, ἵνα ὑμῶν πολλοὶ ἀκηκόασι, καὶ ἄλλοθι, μήτε θαυμάζειν μήτε θορυβεῖν τούτου ἕνεκα. ἔχει γὰρ οὑτωσί. νῦν ἐγὼ πρῶτον ἐπὶ δικαστήριον ἀναβέβηκα, ἔτη γεγονὼς ἑβδομήκοντα: ἀτεχνῶς οὖν ξένως ἔχω τῆς ἐνθάδε λέξεως. ὥσπερ οὖν ἄν, εἰ τῷ ὄντι ξένος ἐτύγχανον ὤν, συνεγιγνώσκετε δήπου ἄν μοι εἰ ἐν ἐκείνῃ τῇ φωνῇ τε καὶ τῷ τρόπῳ  ἔλεγον ἐν οἷσπερ ἐτεθράμμην, καὶ δὴ καὶ νῦν τοῦτο ὑμῶν δέομαι δίκαιον, ὥς γέ μοι δοκῶ, τὸν μὲν τρόπον τῆς λέξεως ἐᾶν—ἴσως μὲν γὰρ χείρων, ἴσως δὲ βελτίων ἂν εἴη—αὐτὸ δὲ τοῦτο σκοπεῖν καὶ τούτῳ τὸν νοῦν προσέχειν, εἰ δίκαια λέγω ἢ μή: δικαστοῦ μὲν γὰρ αὕτη ἀρετή, ῥήτορος δὲ τἀληθῆ λέγειν.'

Note

Corpora in a wide range of classical languages are available. For a list, see here.

We can obtain a list of available Ancient Greek corpora:

from cltk.data.fetch import FetchCorpus

FetchCorpus('grc').list_corpora  # Latin: lat, Greek: grc
['grc_software_tlgu',
 'grc_text_perseus',
 'phi7',
 'tlg',
 'greek_proper_names_cltk',
 'grc_models_cltk',
 'greek_treebank_perseus',
 'greek_treebank_gorman',
 'greek_lexica_perseus',
 'greek_training_set_sentence_cltk',
 'greek_word2vec_cltk',
 'greek_text_lacus_curtius',
 'grc_text_first1kgreek',
 'grc_text_tesserae']

Or Latin corpora:

corpus_downloader = FetchCorpus('lat')
corpus_downloader.list_corpora
['lat_text_perseus',
 'lat_treebank_perseus',
 'lat_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'lat_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia',
 'lat_text_tesserae',
 'cltk_lat_lewis_elementary_lexicon']

We can download a corpus from the list of available corpora associated with the selected language:

corpus_downloader.import_corpus('lat_text_latin_library')

Note

By default, the data is download to ~/cltk_data

If we download the Latin corpora, we can find corpus files in:

~/cltk_data/lat/text/lat_text_latin_library/

path = '/Users/tonyhirst/cltk_data/lat/text/lat_text_latin_library'
!ls $path/vergil
aen1.txt  aen2.txt  aen6.txt  ec1.txt   ec4.txt   ec8.txt   geo3.txt
aen10.txt aen3.txt  aen7.txt  ec10.txt  ec5.txt   ec9.txt   geo4.txt
aen11.txt aen4.txt  aen8.txt  ec2.txt   ec6.txt   geo1.txt
aen12.txt aen5.txt  aen9.txt  ec3.txt   ec7.txt   geo2.txt

We can open a sample text as we would any text file.

For example, here’s a fragment from Vergil’s Aeneid:

with open(f'{path}/vergil/aen1.txt') as f:
    aeneid_1 = f.read()

# Display a fragment of the file
print(aeneid_1[1000:1200])
Tyrias olim quae verteret arces;    20 
hinc populum late regem belloque superbum 
venturum excidio Libyae: sic volvere Parcas. 
Id metuens, veterisque memor Saturnia belli, 
prima quod ad Troiam pro 

Natural Language Processing With CLTK

Processing pipelines are available to support the processing of languages based on the actual language:

from cltk import NLP

# Load the default Pipeline for Latin
cltk_nlp = NLP(language="lat")
‎𐤀 CLTK version '1.0.14'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.

Pipelines are also available for other languages:

# Pipelines are also supported for processing certain specific languages
from cltk.nlp import iso_to_pipeline

iso_to_pipeline
{'akk': cltk.languages.pipelines.AkkadianPipeline,
 'ang': cltk.languages.pipelines.OldEnglishPipeline,
 'arb': cltk.languages.pipelines.ArabicPipeline,
 'arc': cltk.languages.pipelines.AramaicPipeline,
 'chu': cltk.languages.pipelines.OCSPipeline,
 'cop': cltk.languages.pipelines.CopticPipeline,
 'enm': cltk.languages.pipelines.MiddleEnglishPipeline,
 'frm': cltk.languages.pipelines.MiddleFrenchPipeline,
 'fro': cltk.languages.pipelines.OldFrenchPipeline,
 'gmh': cltk.languages.pipelines.MiddleHighGermanPipeline,
 'got': cltk.languages.pipelines.GothicPipeline,
 'grc': cltk.languages.pipelines.GreekPipeline,
 'hin': cltk.languages.pipelines.HindiPipeline,
 'lat': cltk.languages.pipelines.LatinPipeline,
 'lzh': cltk.languages.pipelines.ChinesePipeline,
 'non': cltk.languages.pipelines.OldNorsePipeline,
 'pan': cltk.languages.pipelines.PanjabiPipeline,
 'pli': cltk.languages.pipelines.PaliPipeline,
 'san': cltk.languages.pipelines.SanskritPipeline}

A range of process steps are provided by a pipeline:

cltk_nlp.pipeline.processes
[cltk.alphabet.processes.LatinNormalizeProcess,
 cltk.dependency.processes.LatinStanzaProcess,
 cltk.embeddings.processes.LatinEmbeddingsProcess,
 cltk.stops.processes.StopsProcess,
 cltk.ner.processes.LatinNERProcess,
 cltk.lexicon.processes.LatinLexiconProcess]