--------------------------------- Medieval Charter Sections Corpus --------------------------------- This package provides an evaluation framework, training and test data for semi-automatic recognition of sections of historical diplomatic manuscripts. The data collection consists of 57 Latin charters issued by the Royal Chancellery of 7 different types. Documents were created in the era of John the Blind, King of Bohemia (1310–1346) and Count of Luxembourg. Manuscripts were digitized, transcribed, and typical sections of medieval charters ('corroboratio', 'datatio', 'dispositio', 'inscriptio', 'intitulatio', 'narratio', and 'publicatio') were manually tagged. Manuscripts also contain additional metadata, such as manually marked named entities and short Czech abstracts. Recognition models are first trained using manually marked sections in training documents and the trained model can then be used for recognition of the sections in the test data. The parsing script supports methods based on Cosine Distance, TF-IDF weighting and adapted Viterbi algorithm. Cited from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1952 March 18, 2020 HistCorp inclusion date ------------------------ March 18, 2020 Website -------- https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1952 Cite ----- Galuščáková, Petra and Neužilová, Lucie, 2018, Medieval Charter Sections Corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University Licence -------- Creative Commons Attribution NonCommercial ShareAlike 4.0 International (CC BY-NC-SA 4.0) (http://creativecommons.org/licenses/by-nc-sa/4.0/) The HistCorp files ------------------- On the HistCorp page, the texts from the Medieval Charter Sections Corpus are provided in a plain text format ('txt'), a tokenised format ('tok'), and a linguistically annotated format ('anno'). The linguistically annotated files are essentially unchanged from the original charter corpus, i.e. in an XML format with linguistic information mainly including named entities. The only modification in the HistCorp files is that the original training, test and heldout sets have been split into the individual 57 charters, following the markup in the XML files. The plain text files are derived from the XML files, by automatically extracting the text from the XML files. The parts of the charters marked as 'abstract' have been removed in the Latin plain text files (and in the tokenised files), as these are written in Czech. (These could instead be accessed from the Czech version of the Medieval Charter Sections on the HistCorp page.) In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Latin language model provided as a baseline model in the CoNLL17 Shared Task (latin-ud-2.0-conll17-170315.udpipe). Metadata has been added in a TEI-compatible format at the top of each file. The metadata information was mainly extracted from the metadata stated on the corpus website. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. Size: 57 texts, with a total of 8,609 tokens. Genre: charters