-------------------------------------------
IMP reference corpus of historical Slovene 
-------------------------------------------

The reference corpus of historical Slovene contains text from 1,100
pages (about 300,000 tokens) sampled from the IMP collection with
hand-validated linguistic annotation. Each word token (e.g. "lubesni")
in the corpora is annotated with: 

- modernised form ("ljubezni")
- lemma ("ljubezen")
- MSD tag ("Ncm"); the tagset is defined in the IMP morphosyntactic
- specifications. 


	Cited from http://nl.ijs.si/imp/index-en.html
	October 24, 2017


HistCorp inclusion date
------------------------
September 29, 2017


Website
--------
http://nl.ijs.si/imp/index-en.html


Citation:
Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language resources and evaluation, ISSN 1574-020X, doi: 10.1007/s10579-015-9294-7. 


Licence
--------
Creative Commons Attribution (CC BY 4.0)


The HistCorp files
-------------------
On the HistCorp page, the Slovene texts from the Reference Corpus of
historical Slovene are provided in a plain text format ('txt'), a
tokenised format ('tok'), a linguistically annotated format ('anno')
and a normalised format ('norm').

For the tokenised files ('tok'), the original corpus was divided into
one file for each text in the corpus, and only the first column (the
word form column) was extracted to each file. Furthermore, metadata
were added in a TEI-compatible format at the top of each file. The
metadata information was partly extracted from the metadata stated in
the original corpus file, and partly from information in the adhering
readme files and from the corpus website
(http://nl.ijs.si/imp/index-en.html). In addition, the number of
tokens for each file has been calculated based on the tokenised
version of the file.

The linguistically annotated files ('anno') are basically unchanged
from the original corpus, except that the corpus was divided into one
text for each file, and that the metadata headers were replaced by
metadata in the HistCorp format.

The normalised files are the same as provided by Scherrer and Erjavec
at http://nl.ijs.si/imp/experiments/jnle-dataset/. These
historical-to-modern mapping files were automatically extracted from
the IMP historical corpora, and contain hand-validated entries encoded
as tab-separated UTF-8 files with the following columns:

  1) the wordform as it appears in the corpus, but lowercased

  2) the normalised wordform, i.e. converted to contemporary alphabet

  3) the modernised word-form:
     a) if it is not in the contemporary lexicon it has a * suffix;
     b) if this is an orthographic normalisation of an otherwise
     	extinct (archaic) word, the suffix is ! (or *!)  
   
  4) frequency in the corpus

Citation for the normalised files:

Scherrer, Yves and Erjavec, Tomaž (2015): Modernising historical
  Slovene words. In: Natural language engineering, doi:
  10.1017/S1351324915000236, url:
  http://dx.doi.org/10.1017/S1351324915000236. 


Size: 76 texts, with a total of 358,036 tokens.