------------------------------------------ Middle Polish Diachrone Lemmatised Corpus ------------------------------------------ The PolDiLemma corpus is a diachronic corpus made of political, religious, scientific and historical texts from different authors of the Middle Polish period (16th-18th century). Characteristic for this period is the slow development of a supra-regional standard language, a process of standardisation on the basis of the variety of the Polish nobility, under the influence of Latin and other foreign languages as well as different social or regional varieties. All texts (free licenses) are gathered from Federacja Bibliotek Cyfrowych (Digital Library Federation). The Middle Polish texts illustrate the history of the language and give the opportunity to explore some first-hand evidence of the development of Polish in its historical context. Studying the history of the language is a way to familiarize oneself with aspects of the history of Poland in general. It also helps to build up valuable methodological knowledge in diachronic linguistics and philology. Cited from http://fedora.clarin-d.uni-saarland.de/poldilemma/ February 6, 2020 HistCorp inclusion date ------------------------ February 6, 2020 Website -------- http://fedora.clarin-d.uni-saarland.de/poldilemma/ Licence -------- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/) Citation --------- Burzlaff, Paul and Meyer, Roland (2014). The PolDiLemma Middle Polish Diachronic Lemmatised Corpus. http://hdl.handle.net/11858/00-246C-0000-0023-8CD2-B The HistCorp files ------------------- On the HistCorp page, the Polish texts from 'PolDiLemma' are provided in a plain text format ('txt'), a tokenised format ('tok') and in a lemmatised format ('anno'). The plain text files ('txt') and the lemmatised files ('anno') are the same as in the original PolDiLemma package, except that metadata has been added in a TEI-compatible format at the top of each file. The metadata information was mainly extracted from the metadata stated on the PolDiLemma site (http://fedora.clarin-d.uni-saarland.de/ fedora/objects/clarind-uds:poldilemma/datastreams/CMDI/content). In addition, the number of tokens and sentences for each file has been calculated based on the tokenisation. The lemmatised files follow a tab-separated format, with one token on each line, succeeded by a tab and the lemma connected to the token. The tokenised files ('tok') were created by extracting the first column only from the lemmatised file. In addition, the files were reformatted to follow the HistCorp format for tokenised files. In the PolDiLemma downloads, the same files are most often available both in the plain text version and in the lemmatised version. There are however a few exceptions. The following subdirectories are only available in the plain text version: - 1265 - 22683 - 230806 - 232107 - 232108 - 27393 - 27478 - 2782 - 29538 - 400 - 4404 - 545 - 561 Likewise, the following subdirectories are only available in the lemmatised version: - 1265 - 22683 - 230806 - 232107 - 232108 - 27393 - 27478 - 2782 - 29538 - 400 - 4404 - 545 - 561 The size stated below is calculated based on the tokenised files. Size: 11,395 texts, with a total of 4,247,229 tokens. Genres: political, religious and scientific