-------- IcePaHC -------- The Icelandic Parsed Historical Corpus (IcePaHC) is a project that has built a diachronic corpus with samples of written Icelandic from all periods from the 12th century to modern times. The corpus is mostly compatible with the corpora of historical English developed at UPenn. For historical texts spelling is modernized for phonological change. If using this corpus in your research, please cite: Wallenberg, Joel C., Anton Karl Ingason, Einar Freyr Sigurðsson and Eiríkur Rögnvaldsson. 2011. Icelandic Parsed Historical Corpus (IcePaHC). Version 0.9. Cited from http://www.linguist.is/icelandic_treebank/Download June 28, 2017 HistCorp inclusion date ------------------------ June 27, 2017 Website -------- http://www.linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC) Licence -------- GNU Lesser General Public License (https://en.wikipedia.org/wiki/GNU_Lesser_General_Public_License) The HistCorp files ------------------- On the HistCorp page, the IcePaHC texts are provided in a plain text format ('txt'), a tokenised format ('tok'), a format with historical spelling mapped to the corresponding modern spelling ('norm'), a format with morphological annotation ('morph'), and a format with syntactic annotation (syntax'). (The morphologically and syntactically annotated files are both found in the 'anno' package on the HistCorp page.) The plain text files are the same as in the original IcePaHC package, only with metadata added in a TEI-compatible format at the top of each file. The metadata information was automatically extracted from the info-files connected to each text in the original IcePaHC donwload. In addition, the number of tokens for each file has been calculated based on the tokenised files. In the tokenised files, the texts are split into one token on each line. The tokenised files were created by simply extracting the first column of the tagged files in the IcePaHC package, using the Unix-based 'cut -f1' command. Parts of the corpus are manually normalised regarding spelling, and these files are provided in a separate download marked 'norm'. In these files the historical spelling is mapped to its corresponding modern spelling, and the files are divided into training, development and test sets identical to the data sets used by Pettersson et. al in the paper 'A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text' (http://aclweb.org/anthology//W/W14/W14-0605.pdf). The morphologically analysed files are provided in a tab-separated format, with the word form in the first column, its part-of-speech tag in the second column, and lemma in the third column. The tagset used is described here: http://linguist.is/icelandic_treebank/Tagset. The syntactically annotated files are provided in their original parenthesis-based format. The annotation scheme used is described further here: http://www.linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)#Annotation_guidelines Size: 61 texts, with a total of 1,015,569 tokens. Genres: bible text, biography, fiction, law, narrative, religious, science and sermon.