---------------------- Czech Gutenberg texts ---------------------- The Czech Gutenberg texts on the HistCorp page are a subset of the texts provided by Project Gutenberg (http://www.gutenberg.org). HistCorp inclusion date ------------------------ May 18, 2017 Website -------- http://www.gutenberg.org Contact information -------------------- http://www.gutenberg.org/wiki/Gutenberg:Contact_Information Licence -------- http://www.gutenberg.org/license The HistCorp files ------------------- The Czech Gutenberg texts on the HistCorp page are a subset of the texts included by Project Gutenberg (http://www.gutenberg.org), and are provided in a plain text format ('txt'), and in a tokenised format ('tok'). The plain text files have been semi-automatically stripped from Gutenberg-specific metadata, and extratextual information such as page numbering, footnotes and underscore signs marking emphasis etc. Metadata is instead given in a TEI-compatible format at the top of each file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Czech language model provided as a baseline model in the CoNLL17 Shared Task (czech-ud-2.0-conll17-170315.udpipe). Metadata has also been added in a TEI-compatible format at the top of each txt file. The metadata information was mainly extracted from the metadata stated on the corpus website. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. Size: 3 texts, with a total of 292,851 tokens. Genre: books (see metadata for each file, for more detailed information on the genres included).