---------------------
Deutsches TextArchiv 
---------------------

The DTA core corpus contains texts from different genres and text
types, compiled with the aim of creating a balanced historical
reference corpus for German. The DTA repository of the HistCorp site
contains texts from the time period 1600--1899, downloaded from
Deutsches TextArchiv 2017-03-06
(http://www.deutschestextarchiv.de/download)


HistCorp inclusion date
------------------------
March 06, 2017


Website
--------
http://www.deutschestextarchiv.de/download


Licence
--------
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
(http://creativecommons.org/licenses/by-nc/3.0/)


The HistCorp files
-------------------
On the HistCorp page, the texts from Deutsches TextArchiv are provided
in a plain text format ('txt'), and a tokenised format ('tok').

The plain text files were created from the original xml files provided
by Deutsches TetxArchiv, by extracting the text parts of the XML
files, and also adding metadata from the XML files in a TEI-compatible
format at the top of each file. In addition, the number of tokens for
each file has been calculated based on the tokenised version of the
file.

In the tokenised files, the texts are split into one token on each
line. Tokenisation was performed using the UDPipe tokeniser
(https://ufal.mff.cuni.cz/udpipe) with the German language model
provided as a baseline model in the CoNLL17 Shared Task
(german-ud-2.0-conll17-170315.udpipe).

Size: 1,350 texts, with a total of 145,911,684 tokens.