----------------------
Czech Gutenberg texts
----------------------

The Czech Gutenberg texts on the HistCorp page are a subset of the
texts provided by Project Gutenberg (http://www.gutenberg.org). 


HistCorp inclusion date
------------------------
May 18, 2017


Website
--------
http://www.gutenberg.org


Contact information
--------------------
http://www.gutenberg.org/wiki/Gutenberg:Contact_Information


Licence
--------
http://www.gutenberg.org/license


The HistCorp files
-------------------
The Czech Gutenberg texts on the HistCorp page are a subset of the
texts included by Project Gutenberg (http://www.gutenberg.org), and
are provided in a plain text format ('txt'), and in a tokenised format
('tok'). 

The plain text files have been semi-automatically stripped from
Gutenberg-specific metadata, and extratextual information such as page
numbering, footnotes and underscore signs marking emphasis
etc. Metadata is instead given in a TEI-compatible format at the top
of each file.

In the tokenised files, the texts are split into one token on each
line. Tokenisation was performed using the UDPipe tokeniser
(https://ufal.mff.cuni.cz/udpipe) with the Czech language model
provided as a baseline model in the CoNLL17 Shared Task
(czech-ud-2.0-conll17-170315.udpipe).

Metadata has also been added in a TEI-compatible format at the top of
each txt file. The metadata information was mainly extracted from the
metadata stated on the corpus website. In addition, the number of
tokens for each file has been calculated based on the tokenised
version of the file.


Size: 3 texts, with a total of 292,851 tokens.

Genre: books (see metadata for each file, for more detailed
information on the genres included).