--------------------------------------- Reference Corpus of Middle High German --------------------------------------- The Reference Corpus of Middle High German (ReM) is a corpus of diplomatically transcribed and annotated texts from Middle High German (1050--1350) with a size of around 2 million word forms. It originated from the research projects "Reference Corpus of Middle High German" and "Middle High German grammar". Cited from https://www.linguistics.rub.de/rem/ June 27, 2017 A detailed overview of the texts contained in the corpus may be found here: https://www.linguistics.rub.de/rem/corpus/details.html The transcriptions of the texts comprise two separate layers. The diplomatic layer records historical graphemes and conserves original word boundaries. Layout information, such as page or line breaks, refers to this layer. The second layer adapts word boundaries to the conventions of modern German and serves as the basis for all further linguistic annotations. The texts have been annotated with part-of-speech tags (using the HiTS tagset), morphology, lemmas and other information. Cited from http://islrn.org/resources/332-536-136-099-5/ June 27, 2017 If you use this corpus in your work, please cite it as follows: Klein, Thomas; Wegera, Klaus-Peter; Dipper, Stefanie; Wich-Reif, Claudia (2016). Referenzkorpus Mittelhochdeutsch (1050–1350), Version 1.0, https://www.linguistics.ruhr-uni-bochum.de/rem/. ISLRN 332-536-136-099-5. HistCorp version ----------------- December 22, 2016 Website -------- https://www.linguistics.rub.de/rem/ Licence -------- Creative Commons Attribution-ShareAlike 4.0 (https://creativecommons.org/licenses/by-sa/4.0/) The HistCorp files ------------------- On the HistCorp page, the texts from the Reference Corpus of Middle High German are provided in a plain text format ('txt'), a tokenised format ('tok'), and in a linguistically annotated format ('anno'). The plain text files were automatically created by extracting the text fields from the original XML files. In the tokenised files, the texts are split into one token on each line. The tokenised files were created from the original xml files provided in the ReM package, by extracting each token from the XML structure to a text file, and also adding metadata from the XML files in a TEI-compatible format at the top of each file. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. The linguistically annotated files are the same as the original xml files in the ReM package, with information on spelling normalisation, lemma, part-of-speech, and morphology. Size: 399 texts, with a total of 2,537,168 tokens. Genres: everyday life, law, literature, poetry, religion, and science.