------------------------------------------
Middle Polish Diachrone Lemmatised Corpus
------------------------------------------

The PolDiLemma corpus is a diachronic corpus made of political, religious, scientific and historical texts from different authors of the Middle Polish period (16th-18th century).

Characteristic for this period is the slow development of a supra-regional standard language, a process of standardisation on the basis of the variety of the Polish nobility, under the influence of Latin and other foreign languages as well as different social or regional varieties.

All texts (free licenses) are gathered from Federacja Bibliotek Cyfrowych (Digital Library Federation). The Middle Polish texts illustrate the history of the language and give the opportunity to explore some first-hand evidence of the development of Polish in its historical context.

Studying the history of the language is a way to familiarize oneself with aspects of the history of Poland in general. It also helps to build up valuable methodological knowledge in diachronic linguistics and philology.

	Cited from http://fedora.clarin-d.uni-saarland.de/poldilemma/
	February 6, 2020


HistCorp inclusion date
------------------------
February 6, 2020


Website
--------
http://fedora.clarin-d.uni-saarland.de/poldilemma/


Licence
--------
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/)


Citation
---------
Burzlaff, Paul and Meyer, Roland (2014). The PolDiLemma Middle Polish Diachronic Lemmatised Corpus. http://hdl.handle.net/11858/00-246C-0000-0023-8CD2-B


The HistCorp files
-------------------

On the HistCorp page, the Polish texts from 'PolDiLemma' are provided
in a plain text format ('txt'), a tokenised format ('tok') and in a
lemmatised format ('anno').

The plain text files ('txt') and the lemmatised files ('anno') are the
same as in the original PolDiLemma package, except that metadata has
been added in a TEI-compatible format at the top of each file. The
metadata information was mainly extracted from the metadata stated on
the PolDiLemma site (http://fedora.clarin-d.uni-saarland.de/
fedora/objects/clarind-uds:poldilemma/datastreams/CMDI/content). In
addition, the number of tokens and sentences for each file has been
calculated based on the tokenisation.
 
The lemmatised files follow a tab-separated format, with one token on
each line, succeeded by a tab and the lemma connected to the
token. The tokenised files ('tok') were created by extracting the
first column only from the lemmatised file. In addition, the files
were reformatted to follow the HistCorp format for tokenised files.

In the PolDiLemma downloads, the same files are most often available
both in the plain text version and in the lemmatised version. There
are however a few exceptions.

The following subdirectories are only available in the plain text
version:

  - 1265 
  - 22683
  - 230806
  - 232107 
  - 232108 
  - 27393
  - 27478
  - 2782 
  - 29538 
  - 400 
  - 4404 
  - 545 
  - 561

Likewise, the following subdirectories are only available in the
lemmatised version:

  - 1265 
  - 22683
  - 230806
  - 232107 
  - 232108 
  - 27393
  - 27478
  - 2782 
  - 29538 
  - 400 
  - 4404 
  - 545 
  - 561 

The size stated below is calculated based on the tokenised files.

Size: 11,395 texts, with a total of 4,247,229 tokens.

Genres: political, religious and scientific