---------------------------------
Medieval Charter Sections Corpus
---------------------------------

This package provides an evaluation framework, training and test data
for semi-automatic recognition of sections of historical diplomatic
manuscripts. The data collection consists of 57 Latin charters issued
by the Royal Chancellery of 7 different types. Documents were created
in the era of John the Blind, King of Bohemia (1310–1346) and Count of
Luxembourg. Manuscripts were digitized, transcribed, and typical
sections of medieval charters ('corroboratio', 'datatio',
'dispositio', 'inscriptio', 'intitulatio', 'narratio', and
'publicatio') were manually tagged. Manuscripts also contain
additional metadata, such as manually marked named entities and short
Czech abstracts.

Recognition models are first trained using manually marked sections in
training documents and the trained model can then be used for
recognition of the sections in the test data. The parsing script
supports methods based on Cosine Distance, TF-IDF weighting and
adapted Viterbi algorithm.

	Cited from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1952
	March 18, 2020


HistCorp inclusion date
------------------------
March 18, 2020


Website
--------
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1952


Cite
-----
Galuščáková, Petra and Neužilová, Lucie, 2018, Medieval Charter
Sections Corpus, LINDAT/CLARIAH-CZ digital library at the Institute of
Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and
Physics, Charles University


Licence
--------
Creative Commons Attribution NonCommercial ShareAlike 4.0
International (CC BY-NC-SA 4.0) 
(http://creativecommons.org/licenses/by-nc-sa/4.0/)


The HistCorp files
-------------------
On the HistCorp page, the texts from the Medieval Charter Sections
Corpus are provided in a plain text format ('txt'), a tokenised format
('tok'), and a linguistically annotated format ('anno').  

The linguistically annotated files are essentially unchanged from the
original charter corpus, i.e. in an XML format with linguistic
information mainly including named entities. The only modification in
the HistCorp files is that the original training, test and heldout
sets have been split into the individual 57 charters, following the
markup in the XML files.

The plain text files are derived from the XML files, by automatically
extracting the text from the XML files. The parts of the charters
marked as 'abstract' have been removed in the Latin plain text files
(and in the tokenised files), as these are written in Czech. (These
could instead be accessed from the Czech version of the Medieval
Charter Sections on the HistCorp page.) 

In the tokenised files, the texts are split into one token on each
line. Tokenisation was performed using the UDPipe tokeniser
(https://ufal.mff.cuni.cz/udpipe) with the Latin language model
provided as a baseline model in the CoNLL17 Shared Task
(latin-ud-2.0-conll17-170315.udpipe).

Metadata has been added in a TEI-compatible format at the top of each
file. The metadata information was mainly extracted from the metadata
stated on the corpus website. In addition, the number of tokens for
each file has been calculated based on the tokenised version of the
file.

Size: 57 texts, with a total of 8,609 tokens.

Genre: charters