--------------------------
Latin Dependency Treebank
--------------------------

The Ancient Greek and Latin Dependency Treebank (AGLDT) is the
earliest treebank for Ancient Greek and Latin. The project started at
Tufts University in 2006 and is currently under development and
maintenance at Leipzig University-Tufts University. 

The Ancient Greek and Latin Dependency Treebanks are built from the
work of dedicated students and researchers from across the world. Over
200 people have annotated texts; the hard work of those who have
contributed their annotations as part of the official treebanks are
within the data.

	Cited from https://perseusdl.github.io/treebank_data/
	September 13, 2017

For Latin, the following texts are included:

Author     Text
---------------------------------------
Augustus   Res Gestae   
Caesar     Commentarii de Bello Gallico
Cicero     In Catilinam
Jerome     Vulgata      
Vergil     Aeneid
Ovid       Metamorphoses        
Petronius  Satyricon
Phaedrus   Fabulae
Propertius Elegiae
Sallust    Bellum Catilinae     
Suetonius  Life of Augustus
Tacitus    Historiae


HistCorp inclusion date
------------------------
January 30, 2017


Website
--------
https://perseusdl.github.io/treebank_data/


Licence
--------
Creative Commons Attribution-ShareAlike 3.0 United States
https://creativecommons.org/licenses/by-sa/3.0/us/


The HistCorp files
-------------------
On the HistCorp page, the Latin texts from the AGLDT corpus are
provided in a plain text format ('txt'), a tokenised format ('tok'),
and in a morphologically and syntactically annotated format ('anno').

The plain text files were created from the original AGLDT xml files,
by extracting the text parts of the XML files, and also adding
metadata from the XML files in a TEI-compatible format at the top of
each file. In addition, the number of tokens for each file has been
calculated based on the tokenised version of the file.

The tokenised files were created by extracting the words and sentence
boundaries from the original XML files.

The tagged and parsed files are unchanged from the ones found on the
AGLDT webpage. Information from the README file in the AGLDT package:

-----
The data have been semi-automatically annotated. The full tagset can
be consulted in TAGSET.xml. Each word is specified for a number of
attributes describing it. The @pos attribute is a 9-character long
string where each character has a particular meaning depending on its
position. In TAGSET.xml this logic is documented in all detail (the
file is derived from the one used in Arethusa, the online annotation
environment used for annotation). In TAGSET.txt there is a more easily
readable version of the tagset.

Data have been annotated using the following guidelines:
* [Guidelines for the Syntactic Annotation of Latin Treebanks
(1.3)](http://nlp.perseus.tufts.edu/syntax/treebank/ldt/1.5/docs/guidelines.pdf)
(GSALT) 

In the present release the following new texts have been added:

* Res Gestae
* Historiae

Res Gestae, which were treebanked following a different annotation
scheme (as for syntactic labels), have been automatically converted to
the common annotation scheme of the GSALT (aporiae in the conversion
may be present, in that). The original syntactic labels (see
Harrington-tagset.pdf and Harrington-tagset-instructions.pdf) have
been preserved in the attribute @hrngtn. 

The following texts have undergone a major revision in order to
improve their form and consistency within themselves and with the most
recently annotated texts, i.e., Fabulae, Life of Augustus, and
Historiae: 

* In Catilinam
* Aeneis
* Commentarii de Bello Gallico
* Elegiae
* In Catilinam

More precisely, these texts have been modified thus:

* Addition of punctuation
* Addition of missing sentences and paragraphs
* Sentences restored in their correct order
* Enclitic particles (-que, -ve, -ne) restored in their correct position
* univerbated coordinating elements (neque, nec) 
* Part of speech chosen on the basis of Lewis-Short's A Latin
  Dictionary and - if problems arise - Allen and Greenough’s A New
  Latin Grammar 
* Tagset correction for gerund and gerundive  
* Some corrections related to the distinction adjective/pronouns and
  deponent/passive 
* APOS is annotated as appositive and not as apposition (i.e., the
  label is on the noun considered to be the appositive) 
* Some corrections related to verbal valency, the distinction between
  adverbial/attributive participles, personal constructions (e.g.,
  videor) 
* Normalization of the use of AuxY and auxZ

The following texts lack the preceding modifications, but punctuation
has been added: 

* Bellum Catilinae
* Metamorphoses
* Satyricon
* Vulgata

The structure of the original XML files (i.e., the one according to
the XML schema which is digested in the Perseids platform, where
annotations are peformed) has been changed in order to make it more
informative and easier to query. The <code>treebank</code> root
element identifies the version of the release (<code>@version</code>)
and the cts for each text (<code>@cts</code>). The (pseudo-TEI)
<code>header</code> element contains information/credits about the
creation of the file. The <code>biblStruct</code> element contains
information about the ancient author and text, which helps
interpretation of <code>@cts</code>. 

The original structure of <code>sentence</code> and <code>word</code>
elements is preserved with some normalization concerning
non-linguistically relevant nodes: <code>@span</code> has been deleted
and some normalization has been applied to the display of cts:urn
values within </code>sentence</code> (these values are available on a
sentence level, and sometimes also on a word level).  
-----


Size: 12 texts, with a total of 79,121 tokens.