Skip to content

How to import parallel corpora? #128

@vintagentleman

Description

@vintagentleman

Hi,

I’m struggling to convert TEI-encoded parallel corpora with Pepper.

The most straightforward approach proposed by TEI seems to involve constructing link groups connecting the aligned linguistic units together. Such is the approach I have witnessed in the Opus-MontenegrinSubs corpus, where along with the English and Montenegrin texts themselves there is a separate file containing nothing but the alignment links:

<linkGrp xmlns="http://www.tei-c.org/ns/1.0" type="alignment"
    corresp="opusmonte_en.ana.xml opusmonte_cnr.ana.xml">
  <link n="0:0" target="#Damages.S1.dam0101.SL1-en #Damages.S1.dam0101.SL1-cnr"/>
  ...
</linkGrp>

Additionally, every aligned segment has a @corresp attribute pointing to the @xml:id of its translation equivalent, like this:

<ab n="10" xml:id="Damages.S1.dam0101.SL15-cnr"
    corresp="#Damages.S1.dam0101.SL15-en">
  ...
</ab>

However, the TEI importer fails to process this corpus with the errors of this kind:

Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_cnr.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
Cannot map 'salt:/0/OpusMonte.TEI/opusmonte_en.ana' with module 'TEIImporter', because of a mapping result was 'FAILED'.
An exception was thrown by the mapper threads 'Thread[TEIImporter_mapper(salt:/OpusMonte.TEI/opusmonte_cnr.ana),5,TEIImporter_mapperGroup]'.
org.corpus_tools.pepper.modules.exceptions.PepperModuleXMLResourceException: Cannot read xml-file'file:/D:/Users/k.sipunin/Downloads/OpusMonte.TEI/opusmonte_cnr.ana.xml', because of a nested exception.
        at org.corpus_tools.pepper.common.PepperUtil.readXMLResource(PepperUtil.java:661)
        at org.corpus_tools.pepper.impl.PepperMapperImpl.readXMLResource(PepperMapperImpl.java:278)
        at org.corpus_tools.peppermodules.TEIModules.TEIMapper.mapSDocument(TEIMapper.java:58)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.map(PepperMapperControllerImpl.java:251)
        at org.corpus_tools.pepper.impl.PepperMapperControllerImpl.run(PepperMapperControllerImpl.java:188)
Caused by: org.corpus_tools.salt.exceptions.SaltInsertionException: Cannot insert object 'lemma=opasni' into container 'SStructureImpl(null)[lemma=opasni], salt::unit=word], ana=mte:Agpfpny]'.  Because an id already exists: lemma=opasni.

What might be the problem? And more generally, what is the proper way to encode parallel corpora importable into ANNIS (the presence of a sample here suggests that it’s doable)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions