-------- diakorp -------- The Diakorp corpus is the diachronic section of the Czech National Corpus. HistCorp inclusion date ------------------------ February 2, 2017 Website -------- https://wiki.korpus.cz/doku.php/en:cnk:diakorp Contact person --------------- Martin Stluka (martin.stluka@ff.cuni.cz) Citation --------- Kučera, K. – Stluka, M.: DIAKORP: Diachronní korpus, version 5 from 21 Feb 2011. Ústav Českého národního korpusu FF UK, Praha 2011. Available on-line: http://www.korpus.cz Licence -------- http://creativecommons.org/licenses/by-nc-sa/4.0/ The HistCorp files ------------------- On the HistCorp page, the texts from the Diakorp corpus are provided in a plain text format ('txt'), and a tokenised format ('tok'). The plain text files are derived from the Diakorp text file, by segmenting theoriginal file into one file for each subtext, and by 'detokenising' the file, so that the file contains sentences, rather than one token on each line. In the tokenised files, the texts are split into one token on each line. Metadata has also been added in a TEI-compatible format at the top of each txt file. The metadata information was mainly extracted from the metadata stated on the corpus website. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. Genres have been translated into English using the Lexilogos Lingea dictionary (https://slovniky.lingea.cz/Anglicko-cesky/) combined with Google Translate (https://translate.google.se/). Size: 116 texts, with a total of 4,148,986 tokens. Genres: drama, informal, non-fiction, opinion, periodical, poetry, prose, reflection, speech.