--------
GerManC
--------

Following the model of the ARCHER corpus and given the aim of
representativeness, the GerManC corpus consists of text samples of
about 2000 words from eight genres: drama, newspapers, sermons and
personal letters (to represent orally oriented registers) and
narrative prose (fiction or non-fiction), scholarly (i.e. humanities),
scientific and legal texts (to represent more print-oriented
registers). In order to facilitate tracing historical developments,
the whole period was divided into fifty year sections (in this case
1650-1700, 1700-1750 and 1750-1800), and an equal number of texts from
each genre was selected for each of these sub-periods. 

	Cited from http://ota.ox.ac.uk/desc/2544
	June 17, 2017


HistCorp inclusion date
------------------------
January 17, 2017


Website
--------
http://ota.ox.ac.uk/desc/2544


Licence
--------
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
(http://creativecommons.org/licenses/by-nc-sa/3.0/) 


The HistCorp files
-------------------
On the HistCorp page, the German texts from 'GerManC' are provided in
a plain text format ('txt'), a tokenised format ('tok'), and a parsed
tab-separated format ('ling').

The plain text files are the same as in the original GerManC package,
except that metadata has been added in a TEI-compatible format at the
top of each txt file. The metadata information was mainly extracted
from the metadata stated in the TEI files that are part of the GerManC
package. In addition, the number of tokens for each file has been
calculated based on the tokenised version of the file.

The parsed files are part of the original GerManC package, and are
presented in a tab-separated column format, with information on
sentence boundaries, word forms, normalised spelling, lemmas,
part-of-speech tags, morphological tags, and dependency
information. Note that annotation was mainly performed automatically,
without subsequent manual correction. See further the GerManC
documentation (GerManC_Documentation.pdf).

In the tokenised files, the texts are split into one token on each
line. These files were created by storing the second column from the
parsed files, using the 'cut -f2' command.

Size: 336 texts, with a total of 774,375 tokens.

Genres: drama, newspapers, sermons, personal letters, narrative prose
(fiction or non-fiction), scholarly (i.e. humanities), scientific and
legal texts.