-
Notifications
You must be signed in to change notification settings - Fork 0
NLP Tools Overview
Stanza is a Python NLP package that applies neural networks to text analysis tasks, including tokenisation, lemmatisation, POS, and NER. Seventy languages are supported with pre-trained models, and users can train their own models as well.
Stanza can be installed with a pip command in Terminal with $ pip install stanza.
Once the library has been imported to a python file, a language model can be selected and a neural pipeline constructed.
import stanza
stanza.download('la') # download Latin model
nlp = stanza.Pipeline('la') # initialise Latin neural pipeline
Stanza offers five pre-trained Latin language models. Each model is trained on a different UD data set:
- ITTB model: Index Thomisticus Treebank, medieval Latin
- LLCT model: Late Latin Charter Treebank, medieval Latin
- PROEIL model: PROEIL Treebank, medieval and classical Latin
- Perseus model: Latin Dependency Treebank, classical Latin
- UDante model: Latin texts of Dante Alighieri, medieval Latin
Comparison of model performances is available here.
For my experiment, I used the ITTB model, as it has the highest performance rating across tasks and consists of medieval Latin, like my datasets.
CLTK is an NLP Python library designed to support low-resource classical languages like Latin and Greek. For Latin analysis, CLTK relies on Stanza's neural pipeline and ITTB model to provide certain functionalities.
CLTK can be installed through pip in Terminal with $ pip install cltk.
Once installed, the NLP function should be imported to build a pipeline.
from cltk import NLP
cltk_nlp = NLP(language="lat") # Load default Latin pipeline
In order for CLTK to analyse Latin text, it requires access to a lexicon. While running the analyse function, a prompt automatically appeared in my terminal to download the lexicon to my computer for use. The lexicon (Charlton T. Lewis’s An Elementary Latin Dictionary (1890)) appeared automatically in the CLTK resources folder cltk_data after approximately three minutes of downloading time.
UDPipe is an NLP pipeline for a diverse variety of tasks. Servicing 50 languages, UDPipe can be accessed through a variety of programming languages including R, Python, and Perl, as well as through its REST API. Like Stanza, UDPipe constructs an artificial neural network to process data and offers the opportunity to train new models.
UDPipe can be installed through pip in Terminal with $ pip install ufal-udpipe.
Once installed, the model can be imported and the pipeline constructed as shown. Unlike Stanza and CLTK, it is necessary to download the language model to one's computer and provide the path.
import ufal.udpipe
model_path = "./" # provide path to language model
model = Model.load(model_path)
pipeline = Pipeline(model, "tokenize", Pipeline.DEFAULT, Pipeline.DEFAULT, "conllu")
UDPipe offers a similar variety of trained Latin models to Stanza. For this experiment, I also selected the ITTB model for consistency with Stanza and CLTK.
UDPipe is developed to process .CONLLU files. Because I was working with .CSV files, it was necessary for me to add some checks in my code to prevent errors (see workbook). However, this was ultimately easy to account for.
TreeTagger is a downloadable tool for POS-tagging and lemmatizing developed at the University of Stuttgart using binary trees. It has a variety of language supports, can be trained for new languages, and can be adapted for multiple programming languages through the support of wrappers. Its Latin language package was trained on the PROIEL, PERSEUS, and Index Thomisticus databanks. RNNTagger, another popular language model that I didn't include in this experiment, was developed by the same team and extends TreeTagger's capabilities with a Deep Learning library.
The TreeTagger home page offers multiple download options based on computer type. I initially downloaded the ARM-64 package for my MacBook; however, running my code led me to discover that this package does not support the Apple M3 chip. Downloading the MacOSX-Intel package worked as an alternative.
To use the TreeTagger package within a Python script, I downloaded the python TreeTagger wrapper with pip: $ pip install treetaggerwrapper.
I took the following steps to initialise the model:
treetagger_dir = "PATH/TO/DOWNLOAD"
tagger = treetaggerwrapper.TreeTagger(TAGLANG='la', TAGDIR=treetagger_dir, TAGOPT='-token -lemma -sgml -quiet')
TreeTagger uses its own complex POS tags set rather than UPOS. As a result, I used ChatGPT to write an additional function in the TreeTagger Jupyter Notebook to convert as many tags as possible into UPOS. I was not able to account for all cases, however, and it was not possible to find a list of all TreeTagger tags online.
LatinCy is a set of pipelines trained for NLP in Latin by Patrick J. Burns. Based on the NLP platform SpaCy, LatinCy offers tools for tokenisation, lemmatisation,POS-tagging and other tasks such as NER. LatinCy offers three models based on size: 'la-core-web-sm', 'la-core-web-md', and 'la-core-web-lg'. It was not possible for me to load the LatinCy model due to issues with the installation of SpaCy (see Notes). Therefore, I did not end up testing the model.
In order to use LatinCy, it is first necessary to install the spaCy library with $pip install -U spacy.
Once spacy is installed, it is necessary to download a Latin model with one of the following commands:
$pip install "la-core-web-sm @ https://huggingface.co/latincy/la_core_web_sm/resolve/main/la_core_web_sm-any-py3-none-any.whl"
$pip install "la-core-web-md @ https://huggingface.co/latincy/la_core_web_md/resolve/main/la_core_web_md-any-py3-none-any.whl"
$pip install "la-core-web-lg @ https://huggingface.co/latincy/la_core_web_lg/resolve/main/la_core_web_lg-any-py3-none-any.whl"
Within the python file, the model can be accessed with the following command:
nlp = spacy.load('la_core_web_lg')