--------
diakorp
--------

The Diakorp corpus is the diachronic section of the Czech National
Corpus.


HistCorp inclusion date
------------------------
February 2, 2017


Website
--------
https://wiki.korpus.cz/doku.php/en:cnk:diakorp


Contact person
---------------
Martin Stluka (martin.stluka@ff.cuni.cz)


Citation
---------
Kučera, K. – Stluka, M.: DIAKORP: Diachronní korpus, version 5 from 21 Feb 2011. Ústav Českého národního korpusu FF UK, Praha 2011. Available on-line: http://www.korpus.cz


Licence
--------
http://creativecommons.org/licenses/by-nc-sa/4.0/


The HistCorp files
-------------------

On the HistCorp page, the texts from the Diakorp corpus are provided
in a plain text format ('txt'), and a tokenised format ('tok').

The plain text files are derived from the Diakorp text file, by
segmenting theoriginal file into one file for each subtext, and by
'detokenising' the file, so that the file contains sentences, rather
than one token on each line.

In the tokenised files, the texts are split into one token on each
line.

Metadata has also been added in a TEI-compatible format at the top of
each txt file. The metadata information was mainly extracted from the
metadata stated on the corpus website. In addition, the number of
tokens for each file has been calculated based on the tokenised
version of the file. Genres have been translated into English using
the Lexilogos Lingea dictionary
(https://slovniky.lingea.cz/Anglicko-cesky/) combined with Google
Translate (https://translate.google.se/).

Size: 116 texts, with a total of 4,148,986 tokens.

Genres: drama, informal, non-fiction, opinion, periodical, poetry,
prose, reflection, speech.