NLP Tools Overview

1. Stanza

Model background

Stanza is a Python NLP package that applies neural networks to text analysis tasks, including tokenisation, lemmatisation, POS, and NER. Seventy languages are supported with pre-trained models, and users can train their own models as well.

Installation

Stanza can be installed with a pip command in Terminal with $ pip install stanza. Once the library has been imported to a python file, a language model can be selected and a neural pipeline constructed.

import stanza
stanza.download('la') # download Latin model
nlp = stanza.Pipeline('la') # initialise Latin neural pipeline

Notes

Stanza offers five pre-trained Latin language models. Each model is trained on a different UD data set:

ITTB model: Index Thomisticus Treebank, medieval Latin
LLCT model: Late Latin Charter Treebank, medieval Latin
PROEIL model: PROEIL Treebank, medieval and classical Latin
Perseus model: Latin Dependency Treebank, classical Latin
UDante model: Latin texts of Dante Alighieri, medieval Latin

Comparison of model performances is available here.

For my experiment, I used the ITTB model, as it has the highest performance rating across tasks and consists of medieval Latin, like my datasets.

2. The Classical Language Toolkit (CLTK)

Background

CLTK is an NLP Python library designed to support low-resource classical languages like Latin and Greek. For Latin analysis, CLTK relies on Stanza's neural pipeline and ITTB model to provide certain functionalities.

Installation

CLTK can be installed through pip in Terminal with $ pip install cltk.

Once installed, the NLP function should be imported to build a pipeline.

from cltk import NLP
cltk_nlp = NLP(language="lat") # Load default Latin pipeline

Notes

In order for CLTK to analyse Latin text, it requires access to a lexicon. While running the analyse function, a prompt automatically appeared in my terminal to download the lexicon to my computer for use. The lexicon (Charlton T. Lewis’s An Elementary Latin Dictionary (1890)) appeared automatically in the CLTK resources folder cltk_data after approximately three minutes of downloading time.

3. UDPipe

Background

UDPipe is an NLP pipeline for a diverse variety of tasks. Servicing 50 languages, UDPipe can be accessed through a variety of programming languages including R, Python, and Perl, as well as through its REST API. Like Stanza, UDPipe constructs an artificial neural network to process data and offers the opportunity to train new models.

Installation

UDPipe can be installed through pip in Terminal with $ pip install ufal-udpipe.

Once installed, the model can be imported and the pipeline constructed as shown. Unlike Stanza and CLTK, it is necessary to download the language model to one's computer and provide the path.

import ufal.udpipe
model_path = "./" # provide path to language model
model = Model.load(model_path)
pipeline = Pipeline(model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")

Notes

UDPipe offers a similar variety of trained Latin models to Stanza. For this experiment, I also selected the ITTB model for consistency with Stanza and CLTK.

UDPipe is developed to process .CONLLU files. Because I was working with .CSV files, it was necessary for me to add some checks in my code to prevent errors (see workbook). However, this was ultimately easy to account for.

4. TreeTagger

Background

TreeTagger is a downloadable tool for POS-tagging and lemmatizing developed at the University of Stuttgart using binary trees. It has a variety of language supports, can be trained for new languages, and can be adapted for multiple programming languages through the support of wrappers. Its Latin language package was trained on the PROIEL, PERSEUS, and Index Thomisticus databanks. RNNTagger, another popular language model that I didn't include in this experiment, was developed by the same team and extends TreeTagger's capabilities with a Deep Learning library.

Installation

The TreeTagger home page offers multiple download options based on computer type. I initially downloaded the ARM-64 package for my MacBook; however, running my code led me to discover that this package does not support the Apple M3 chip. Downloading the MacOSX-Intel package worked as an alternative.

To use the TreeTagger package within a Python script, I downloaded the python TreeTagger wrapper with pip: $ pip install treetaggerwrapper.

I took the following steps to initialise the model:

treetagger_dir = "PATH/TO/DOWNLOAD"
tagger = treetaggerwrapper.TreeTagger(TAGLANG='la', TAGDIR=treetagger_dir, TAGOPT='-token -lemma -sgml -quiet')

Notes

TreeTagger uses its own complex POS tags set rather than UPOS. As a result, I used ChatGPT to write an additional function in the TreeTagger Jupyter Notebook to convert as many tags as possible into UPOS. I was not able to account for all cases, however, and it was not possible to find a list of all TreeTagger tags online.

5. LatinCy

Background

LatinCy is a set of pipelines trained for NLP in Latin by Patrick J. Burns. Based on the NLP platform SpaCy, LatinCy offers tools for tokenisation, lemmatisation,POS-tagging and other tasks such as NER. LatinCy offers three models based on size: 'la-core-web-sm', 'la-core-web-md', and 'la-core-web-lg'. It was not possible for me to load the LatinCy model due to issues with the installation of SpaCy (see Notes). Therefore, I did not end up testing the model.

Installation

In order to use LatinCy, it is first necessary to install the spaCy library with $pip install -U spacy.

Once spacy is installed, it is necessary to download a Latin model with one of the following commands:

$pip install "la-core-web-sm @ https://huggingface.co/latincy/la_core_web_sm/resolve/main/la_core_web_sm-any-py3-none-any.whl"

$pip install "la-core-web-md @ https://huggingface.co/latincy/la_core_web_md/resolve/main/la_core_web_md-any-py3-none-any.whl"

$pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl"

Within the python file, the model can be accessed with the following command:

nlp = spacy.load('la_core_web_lg')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NLP Tools Overview

Table of Contents

1. Stanza

Model background

Installation

Notes

2. The Classical Language Toolkit (CLTK)

Background

Installation

Notes

3. UDPipe

Background

Installation

Notes

4. TreeTagger

Background

Installation

Notes

5. LatinCy

Background

Installation

Notes

6. Collatinus

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally