--------------------- Deutsches TextArchiv --------------------- The DTA core corpus contains texts from different genres and text types, compiled with the aim of creating a balanced historical reference corpus for German. The DTA repository of the HistCorp site contains texts from the time period 1600--1899, downloaded from Deutsches TextArchiv 2017-03-06 (http://www.deutschestextarchiv.de/download) HistCorp inclusion date ------------------------ March 06, 2017 Website -------- http://www.deutschestextarchiv.de/download Licence -------- Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (http://creativecommons.org/licenses/by-nc/3.0/) The HistCorp files ------------------- On the HistCorp page, the texts from Deutsches TextArchiv are provided in a plain text format ('txt'), and a tokenised format ('tok'). The plain text files were created from the original xml files provided by Deutsches TetxArchiv, by extracting the text parts of the XML files, and also adding metadata from the XML files in a TEI-compatible format at the top of each file. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the German language model provided as a baseline model in the CoNLL17 Shared Task (german-ud-2.0-conll17-170315.udpipe). Size: 1,350 texts, with a total of 145,911,684 tokens.