------------------------------------------- IMP reference corpus of historical Slovene ------------------------------------------- The reference corpus of historical Slovene contains text from 1,100 pages (about 300,000 tokens) sampled from the IMP collection with hand-validated linguistic annotation. Each word token (e.g. "lubesni") in the corpora is annotated with: - modernised form ("ljubezni") - lemma ("ljubezen") - MSD tag ("Ncm"); the tagset is defined in the IMP morphosyntactic - specifications. Cited from http://nl.ijs.si/imp/index-en.html October 24, 2017 HistCorp inclusion date ------------------------ September 29, 2017 Website -------- http://nl.ijs.si/imp/index-en.html Citation: Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language resources and evaluation, ISSN 1574-020X, doi: 10.1007/s10579-015-9294-7. Licence -------- Creative Commons Attribution (CC BY 4.0) The HistCorp files ------------------- On the HistCorp page, the Slovene texts from the Reference Corpus of historical Slovene are provided in a plain text format ('txt'), a tokenised format ('tok'), a linguistically annotated format ('anno') and a normalised format ('norm'). For the tokenised files ('tok'), the original corpus was divided into one file for each text in the corpus, and only the first column (the word form column) was extracted to each file. Furthermore, metadata were added in a TEI-compatible format at the top of each file. The metadata information was partly extracted from the metadata stated in the original corpus file, and partly from information in the adhering readme files and from the corpus website (http://nl.ijs.si/imp/index-en.html). In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. The linguistically annotated files ('anno') are basically unchanged from the original corpus, except that the corpus was divided into one text for each file, and that the metadata headers were replaced by metadata in the HistCorp format. The normalised files are the same as provided by Scherrer and Erjavec at http://nl.ijs.si/imp/experiments/jnle-dataset/. These historical-to-modern mapping files were automatically extracted from the IMP historical corpora, and contain hand-validated entries encoded as tab-separated UTF-8 files with the following columns: 1) the wordform as it appears in the corpus, but lowercased 2) the normalised wordform, i.e. converted to contemporary alphabet 3) the modernised word-form: a) if it is not in the contemporary lexicon it has a * suffix; b) if this is an orthographic normalisation of an otherwise extinct (archaic) word, the suffix is ! (or *!) 4) frequency in the corpus Citation for the normalised files: Scherrer, Yves and Erjavec, Tomaž (2015): Modernising historical Slovene words. In: Natural language engineering, doi: 10.1017/S1351324915000236, url: http://dx.doi.org/10.1017/S1351324915000236. Size: 76 texts, with a total of 358,036 tokens.