---------------------------- IMP Slovene Digital Library ---------------------------- The digital library contains over 650 units (books, newspapers and some manuscripts) from the end of the 16th century to 1918 with the majority from 1850 onwards. The identifiers of the documents mark their origin: WIKI: The largest part of the library, comprising books, newspaper articles and installments, as well as some manuscripts from the Wikisource project ‘Slovene literary classics’, containing literature by Slovene authors (1776-1918); FPG: the AHLib collection of books translated into Slovene from German (1848–1918); NUK: older books (1750-1820) and selected issues of one newspaper (1850-1900) prepared in the scope of the EU IMPACT project by the National and University Library of Slovenia (NUK); ZRC: small samples of three religious texts (1584, 1695, 1784) prepared by the Scientific Research Center of the Slovene Academy of Sicences and Arts. Cited from http://nl.ijs.si/imp/index-en.html November 7, 2017 HistCorp inclusion date ------------------------ September 29, 2017 Website -------- http://nl.ijs.si/imp/index-en.html Citation: Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language resources and evaluation, ISSN 1574-020X, doi: 10.1007/s10579-015-9294-7. Licence -------- Creative Commons Attribution (CC BY 4.0) The HistCorp files ------------------- On the HistCorp page, the Slovene texts from the Digital Library are provided in a plain text format ('txt'), a tokenised format ('tok'), a linguistically annotated format ('anno'), where the latter also includes (automatic) spelling normalisation. For the plain text files and the tokenised files, the original corpus was divided into one file for each text in the corpus, and only the first column (the word form column) was extracted to each file. Furthermore, metadata were added in a TEI-compatible format at the top of each file. The metadata information was partly extracted from the metadata stated in the original corpus file, and partly from information in the adhering readme files and from the corpus website (http://nl.ijs.si/imp/index-en.html). In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. The linguistically annotated files ('anno') are basically unchanged from the original corpus, except that the corpus was divided into one text for each file, and that the metadata headers were replaced by metadata in the HistCorp format. The distributors point out that the annotation, including spelling normalisation, was performed automatically, using noSketch Engine. Size: 621 texts, with a total of 17,723,566 tokens. Genres: beekeeping, cookbook, drama, non-fiction, poetry, prose and religion.