----- HGDS ----- The project Hungarian Generative Diachronic Syntax was hosted by the Research Institute for Linguistics of the Hungarian Academy of Sciences. It ran between April 1 2009 and August 31 2013, and was funded by the Hungarian Scientific Research Fund (OTKA No. 78074). One aim of the project was to construct an annotated corpus comprising all extant texts from the Old Hungarian period (896–1526), which could provide answers to linguistically relevant problems. The corpus includes only documents containing coherent texts in Hungarian, not including so-called sporadic records, documents containing isolated occurrences of Hungarian words or names. Cited from http://omagyarkorpusz.nytud.hu/en-descr.html August 16, 2017 HistCorp inclusion date ------------------------ February 3, 2017 Website -------- http://omagyarkorpusz.nytud.hu/en-codices.html Licence -------- Free, please cite Eszter Simon: Corpus building from Old Hungarian codices. In: Katalin É. Kiss (ed.): The Evolution of Functional Left Peripheries in Hungarian Syntax. Oxford University Press. The HistCorp files ------------------- On the HistCorp page, the Hungarian texts from 'HGDS' are provided in a plain text format ('txt'), and a tokenised format ('tok'). A subset of the texts are also available in a tagged and parsed tab-separated format ('anno'), and/or in their normalised spelling ('norm'). The plain text files are the same as in the original HGDS package, except that metadata has been added in a TEI-compatible format at the top of each txt file. The metadata information was mainly extracted from the metadata stated on the HGDS site (http://omagyarkorpusz.nytud.hu/en-codices.html). In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Hungarian language model provided as a baseline model in the CoNLL17 Shared Task (hungarian-ud-2.0-conll17-170315.udpipe). The parsed files are part of the original HGDS package, and are presented in a tab-separated column format, with information on sentence boundaries, word forms, lemmas, part-of-speech tags, morphological tags, and dependency information. Parts of the corpus are manually normalised regarding spelling, and these files are provided in a separate download marked 'norm'. In the download for normalised texts, there is also a subdirectory named 'experiments', with a subset of the mappings of historical spelling to modern spelling, divided into training, development and test sets identical to the data sets used by Pettersson et. al in the paper 'A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text' (http://aclweb.org/anthology//W/W14/W14-0605.pdf). Size: 50 texts, with a total of 2,217,071 tokens. Genre: Codices