------------------------ Italian Gutenberg texts ------------------------ The Italian Gutenberg texts on the HistCorp page are a subset of the texts provided by Project Gutenberg (http://www.gutenberg.org). HistCorp inclusion date ------------------------ January 27, 2017 Website -------- http://www.gutenberg.org Contact information -------------------- http://www.gutenberg.org/wiki/Gutenberg:Contact_Information Licence -------- http://www.gutenberg.org/license The HistCorp files ------------------- The Italian Gutenberg texts on the HistCorp page are a subset of the texts included by Project Gutenberg (http://www.gutenberg.org), and are provided in a plain text format ('txt'), and in a tokenised format ('tok'). The plain text files have been semi-automatically stripped from Gutenberg-specific metadata, and extratextual information such as page numbering, footnotes and underscore signs marking emphasis etc. Metadata is instead given in a TEI-compatible format at the top of each file. When assigning metadata, the number of tokens has been calculated based on the tokenised version of the file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Italian language model provided as a baseline model in the CoNLL17 Shared Task (italian-ud-2.0-conll17-170315.udpipe). Size: 90 texts, with a total of 9,546,840 tokens. Genre: books (see metadata for each file, for more detailed information on the genres included).