------------------------------
Middle Russian Corpus (RNC)
------------------------------

The Middle Russian Corpus included on the HistCorp platform is
retrieved from the Old Russian section of the Universal Dependcies
treebanks, containing a subset of the Middle Russian corpus
(1300-1700), a part of the Russian National Corpus.


HistCorp inclusion date
------------------------
November 10, 2020


Website
--------
https://github.com/UniversalDependencies/UD_Old_Russian-RNC/blob/master/README.md


Licence
--------
Creative Commons BY-NC-SA 4.0
(https://creativecommons.org/licenses/by-nc-sa/4.0/)


The HistCorp files
-------------------
On the HistCorp page, the Russian texts from 'The Middle Russian
Corpus' are provided in a plain text format, a  tokenised format and a
linguistically annotated CoNLL-U format.  

The linguistically annotated files ('anno') contain information on
part-of-speech tags, lemma, morphology and syntax (expressed as
dependency relations), following the same CoNLL-U format as on the
Universal Dependencies site from which the files were extracted,
except that metadata has been added in a TEI-compatible format at the
top of each file. The metadata information was mainly extracted from
the metadata stated in the README file on the Old Russian section of
the Universal Dependencies site
(https://github.com/UniversalDependencies/UD_Old_Russian-RNC/blob/master/README.md).  

The plain text files ('txt') contain one sentence on each line. The
sentences were automatically extracted from the CoNLL-U files.

In the tokenised files ('tok'), the texts are split into one token on each
line. The tokenised files were automatically created, by extracting
the first and second columns only (word id and word form) from the
CoNLL-U files.


Size: 25,822 tokens.