-------- GerManC -------- Following the model of the ARCHER corpus and given the aim of representativeness, the GerManC corpus consists of text samples of about 2000 words from eight genres: drama, newspapers, sermons and personal letters (to represent orally oriented registers) and narrative prose (fiction or non-fiction), scholarly (i.e. humanities), scientific and legal texts (to represent more print-oriented registers). In order to facilitate tracing historical developments, the whole period was divided into fifty year sections (in this case 1650-1700, 1700-1750 and 1750-1800), and an equal number of texts from each genre was selected for each of these sub-periods. Cited from http://ota.ox.ac.uk/desc/2544 June 17, 2017 HistCorp inclusion date ------------------------ January 17, 2017 Website -------- http://ota.ox.ac.uk/desc/2544 Licence -------- Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (http://creativecommons.org/licenses/by-nc-sa/3.0/) The HistCorp files ------------------- On the HistCorp page, the German texts from 'GerManC' are provided in a plain text format ('txt'), a tokenised format ('tok'), and a parsed tab-separated format ('ling'). The plain text files are the same as in the original GerManC package, except that metadata has been added in a TEI-compatible format at the top of each txt file. The metadata information was mainly extracted from the metadata stated in the TEI files that are part of the GerManC package. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. The parsed files are part of the original GerManC package, and are presented in a tab-separated column format, with information on sentence boundaries, word forms, normalised spelling, lemmas, part-of-speech tags, morphological tags, and dependency information. Note that annotation was mainly performed automatically, without subsequent manual correction. See further the GerManC documentation (GerManC_Documentation.pdf). In the tokenised files, the texts are split into one token on each line. These files were created by storing the second column from the parsed files, using the 'cut -f2' command. Size: 336 texts, with a total of 774,375 tokens. Genres: drama, newspapers, sermons, personal letters, narrative prose (fiction or non-fiction), scholarly (i.e. humanities), scientific and legal texts.