-------------------- The PROIEL Treebank -------------------- The PROIEL Treebank is a treebank of ancient Indo-European languages, including Latin and Ancient Greek. It uses a refined version of dependency grammar and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. The PROIEL Treebank is one of three treebanks that use the same annotation system, follow the same principles and are available under the same license. The PROIEL Treebank covers Ancient Greek and Latin, as well as the translations of the New Testament into Gothic, Classical Armenian and Old Church Slavonic. The TOROT Treebank covers Old Church Slavonic, Old Russian and Middle Russian, while the ISWOC Treebank includes texts in Old English, Old French, Portuguese and Spanish. The complete collection currently has 928,185 tokens, all of which has been manually annotated with morphological and syntactic analyses. Parts of the treebank also have information-structure annotation and the New Testament texts include text alignment. All the PROIEL-family treebanks can be browsed and queried using INESS Search. Cited from https://proiel.github.io/ August 24, 2020 HistCorp inclusion date ------------------------ August 24, 2020 Website -------- https://proiel.github.io/ Licence -------- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (https://creativecommons.org/licenses/by-nc-sa/4.0/) Cite ----- Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34. The HistCorp files ------------------- On the HistCorp page, the Latin texts from the Proiel Treebank are provided in a plain text format ('txt'), a tokenised format ('tok'), and a tab-separated CoNLL-format with linguistic annotation ('ling'). The following three texts are included: 1) caes-gal = Caesar, Commentarii belli Gallici (ed. Holmes 1914) 2) cic-off = Cicero, De officiis (ed. Miller 1913) 3) cic-att = Cicero, Epistulae ad Atticum (ed. Purser 1901) 4) latin-nt = Jerome's Vulgate 5) pal-agr = Palladius, Opus agriculturae (ed. Schmitt 1898) 6) per-aeth = Peregrinatio Aetheriae (ed. Heraeus 1908) The plain text files were created from the original Proiel CoNLL files, by extracting the words and sentences from the CoNLL structure, adding one sentence on each line in the resulting plain text file, and also adding the metadata from the README file in a TEI-compatible format at the top of each file. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. Similarly, the tokenised files were created from the original Proiel CoNLL files, by extracting the words, one on each line, with sentence boundaries marked by an empty line. The linguistically annotated files are the same as the original Proiel CoNLL files, except that metadata has been added at the top of each file. Size: 6 texts, with a total of 219,035 tokens.