---------------------------------------
Reference Corpus of Middle High German 
---------------------------------------

The Reference Corpus of Middle High German (ReM) is a corpus of
diplomatically transcribed and annotated texts from Middle High German
(1050--1350) with a size of around 2 million word forms. It originated
from the research projects "Reference Corpus of Middle High German"
and "Middle High German grammar".

	Cited from https://www.linguistics.rub.de/rem/
	June 27, 2017

A detailed overview of the texts contained in the corpus may be found
here: https://www.linguistics.rub.de/rem/corpus/details.html

The transcriptions of the texts comprise two separate layers. The
diplomatic layer records historical graphemes and conserves original
word boundaries. Layout information, such as page or line breaks,
refers to this layer. The second layer adapts word boundaries to the
conventions of modern German and serves as the basis for all further
linguistic annotations. The texts have been annotated with
part-of-speech tags (using the HiTS tagset), morphology, lemmas and
other information.

	Cited from http://islrn.org/resources/332-536-136-099-5/
	June 27, 2017

If you use this corpus in your work, please cite it as follows:

Klein, Thomas; Wegera, Klaus-Peter; Dipper, Stefanie; Wich-Reif, Claudia
(2016). Referenzkorpus Mittelhochdeutsch (1050–1350), Version 1.0,
https://www.linguistics.ruhr-uni-bochum.de/rem/. ISLRN
332-536-136-099-5.


HistCorp version
-----------------
December 22, 2016


Website
--------
https://www.linguistics.rub.de/rem/


Licence
--------
Creative Commons Attribution-ShareAlike 4.0 
(https://creativecommons.org/licenses/by-sa/4.0/)


The HistCorp files
-------------------
On the HistCorp page, the texts from the Reference Corpus of Middle
High German are provided in a plain text format ('txt'), a tokenised
format ('tok'), and in a linguistically annotated format ('anno').

The plain text files were automatically created by extracting the text
fields from the original XML files.

In the tokenised files, the texts are split into one token on each
line. The tokenised files were created from the original xml files
provided in the ReM package, by extracting each token from the XML
structure to a text file, and also adding metadata from the XML files
in a TEI-compatible format at the top of each file. In addition, the
number of tokens for each file has been calculated based on the
tokenised version of the file.

The linguistically annotated files are the same as the original xml
files in the ReM package, with information on spelling normalisation,
lemma, part-of-speech, and morphology.

Size: 399 texts, with a total of 2,537,168 tokens.

Genres: everyday life, law, literature, poetry, religion, and science.