--------------------------
Latin Dependency Treebank
--------------------------
The Ancient Greek and Latin Dependency Treebank (AGLDT) is the
earliest treebank for Ancient Greek and Latin. The project started at
Tufts University in 2006 and is currently under development and
maintenance at Leipzig University-Tufts University.
The Ancient Greek and Latin Dependency Treebanks are built from the
work of dedicated students and researchers from across the world. Over
200 people have annotated texts; the hard work of those who have
contributed their annotations as part of the official treebanks are
within the data.
Cited from https://perseusdl.github.io/treebank_data/
September 13, 2017
For Latin, the following texts are included:
Author Text
---------------------------------------
Augustus Res Gestae
Caesar Commentarii de Bello Gallico
Cicero In Catilinam
Jerome Vulgata
Vergil Aeneid
Ovid Metamorphoses
Petronius Satyricon
Phaedrus Fabulae
Propertius Elegiae
Sallust Bellum Catilinae
Suetonius Life of Augustus
Tacitus Historiae
HistCorp inclusion date
------------------------
January 30, 2017
Website
--------
https://perseusdl.github.io/treebank_data/
Licence
--------
Creative Commons Attribution-ShareAlike 3.0 United States
https://creativecommons.org/licenses/by-sa/3.0/us/
The HistCorp files
-------------------
On the HistCorp page, the Latin texts from the AGLDT corpus are
provided in a plain text format ('txt'), a tokenised format ('tok'),
and in a morphologically and syntactically annotated format ('anno').
The plain text files were created from the original AGLDT xml files,
by extracting the text parts of the XML files, and also adding
metadata from the XML files in a TEI-compatible format at the top of
each file. In addition, the number of tokens for each file has been
calculated based on the tokenised version of the file.
The tokenised files were created by extracting the words and sentence
boundaries from the original XML files.
The tagged and parsed files are unchanged from the ones found on the
AGLDT webpage. Information from the README file in the AGLDT package:
-----
The data have been semi-automatically annotated. The full tagset can
be consulted in TAGSET.xml. Each word is specified for a number of
attributes describing it. The @pos attribute is a 9-character long
string where each character has a particular meaning depending on its
position. In TAGSET.xml this logic is documented in all detail (the
file is derived from the one used in Arethusa, the online annotation
environment used for annotation). In TAGSET.txt there is a more easily
readable version of the tagset.
Data have been annotated using the following guidelines:
* [Guidelines for the Syntactic Annotation of Latin Treebanks
(1.3)](http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf)
(GSALT)
In the present release the following new texts have been added:
* Res Gestae
* Historiae
Res Gestae, which were treebanked following a different annotation
scheme (as for syntactic labels), have been automatically converted to
the common annotation scheme of the GSALT (aporiae in the conversion
may be present, in that). The original syntactic labels (see
Harrington-tagset.pdf and Harrington-tagset-instructions.pdf) have
been preserved in the attribute @hrngtn.
The following texts have undergone a major revision in order to
improve their form and consistency within themselves and with the most
recently annotated texts, i.e., Fabulae, Life of Augustus, and
Historiae:
* In Catilinam
* Aeneis
* Commentarii de Bello Gallico
* Elegiae
* In Catilinam
More precisely, these texts have been modified thus:
* Addition of punctuation
* Addition of missing sentences and paragraphs
* Sentences restored in their correct order
* Enclitic particles (-que, -ve, -ne) restored in their correct position
* univerbated coordinating elements (neque, nec)
* Part of speech chosen on the basis of Lewis-Short's A Latin
Dictionary and - if problems arise - Allen and Greenough’s A New
Latin Grammar
* Tagset correction for gerund and gerundive
* Some corrections related to the distinction adjective/pronouns and
deponent/passive
* APOS is annotated as appositive and not as apposition (i.e., the
label is on the noun considered to be the appositive)
* Some corrections related to verbal valency, the distinction between
adverbial/attributive participles, personal constructions (e.g.,
videor)
* Normalization of the use of AuxY and auxZ
The following texts lack the preceding modifications, but punctuation
has been added:
* Bellum Catilinae
* Metamorphoses
* Satyricon
* Vulgata
The structure of the original XML files (i.e., the one according to
the XML schema which is digested in the Perseids platform, where
annotations are peformed) has been changed in order to make it more
informative and easier to query. The treebank
root
element identifies the version of the release (@version
)
and the cts for each text (@cts
). The (pseudo-TEI)
header
element contains information/credits about the
creation of the file. The biblStruct
element contains
information about the ancient author and text, which helps
interpretation of @cts
.
The original structure of sentence
and word
elements is preserved with some normalization concerning
non-linguistically relevant nodes: @span
has been deleted
and some normalization has been applied to the display of cts:urn
values within sentence (these values are available on a
sentence level, and sometimes also on a word level).
-----
Size: 12 texts, with a total of 79,121 tokens.