On this page, we gather a wide range of historical corpora for
different languages. If you use the resources provided on this page in your research, we would be very happy if you refer to the following paper:
Eva Pettersson and Beáta Megyesi (2018)
The HistCorp Collection of Historical Corpora and Resources.
In Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018. [pdf]
In the table below, you may download historical corpora for seventeen different languages. For more information about these language-specific corpora, and for download, click on the name of the language of interest to you. All resources hereunder are provided on a "AS-IS", “WHEREIS,” and “WITH ALL FAULTS” basis, without warranty of any kind, expressed or implied.
Latest News: (all updates are listed in the archive)
The following corpus is currently available for Coptic:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Dipl | Dipl-Clean | Complete | Utils | All | ||||||
Coptic Scriptorium | — | mixed | [txt] | [txt] | [complete] | [utils] | [all] | www | www | [readme] |
The following corpora are currently available for historical Czech:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
charters | 1310–1346 | charters | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
DIAKORP | 1350–1939 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Gutenberg | 1890–1897 | fiction | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all Czech corpora files (including readme files) here: all-czech-corpora.zip
The following corpora are currently available for historical Dutch:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
BaB | 1661–1783 | letters | — | — | — | — | — | www | register | [readme] |
EDGeS | 1360–1939 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Gutenberg | 14nn–1875 | fiction | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Compilation | 1236–1938 | chancellery, narrative | — | — | — | — | — | www | Free for research | [readme] |
You may also download all Dutch corpora files (including readme files) here: all-dutch-corpora.zip
The following corpora are currently available for historical English:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
CLMET | 1710–1920 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
EDGeS | 1395–1890 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
lampeter | 1640–1740 | tracts | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all English corpora files (including readme files) here: all-english-corpora.zip
The following corpora are currently available for historical French:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Paris | 1296–1790 | vernacular speech, tax-rolls | [txt] | [tok] | — | — | [all] | www | www | [readme] |
SRCMF | 842–1325 | — | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all French corpora files (including readme files) here: all-french-corpora.zip
The following corpora are currently available for historical German:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
DTA | 1600–1899 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
EDGeS | 1460–1871 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
GeMi | 1500–1690 | medicine | [txt] | [tok] | — | — | [all] | www | www | [readme] |
GerManC | 1654–1799 | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
LitHist | 1790–1829 | literature | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
ReM | 1050–1350 | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
ReN | 1200–1650 | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
Ridges | 1482–1914 | science | [txt] | [tok] | [txt] | [conll] | [all] | www | www | [readme] |
You may also download all German corpora files (including readme files) here: all-german-corpora.zip
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
AGLDT | — | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
Perseus | — | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Proiel | — | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all Greek corpora files (including readme files) here: all-greek-corpora.zip
The following corpus is currently available for historical Hungarian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
HGDS | 1440–1539 | codices | [txt] | [tok] | [txt] | [conll] | [all] | www | free | [readme] |
The following corpus is currently available for historical Icelandic:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
IcePaHC | 1150–2008 | mixed | [txt] | [tok] | [txt] | [txt] | [all] | www | www | [readme] |
The following corpus is currently available for historical Italian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Gutenberg | 1300–1897 | books | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
AGLDT | — | mixed | [txt] | [tok] | — | [xml] | [all] | www | www | [readme] |
charters | 1310–1346 | charters | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Corpus Corporum | 100–1200 | mixed | — | — | — | — | — | www | www | [readme] |
LLCT1 | – | charters | [txt] | [tok] | — | — | [all] | www | www | [readme] |
LLCT2 | – | charters | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
Perseus | – | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] | Proiel | – | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
Thomisticus | 1225–1274 | mixed | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
You may also download all Latin corpora files (including readme files) here: all-latin-corpora.zip
The following corpus is currently available for historical Polish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
PolDiLemma | 1567–1800 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
The following corpus is currently available for historical Portuguese:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Tycho | 1380–1881 | mixed | — | — | — | — | — | www | www | [readme] |
The following corpus is currently available for historical Russian:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
RNC | — | — | [txt] | [tok] | — | [conll] | [all] | www | www | [readme] |
The following corpora are currently available for historical Slovene:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
besedje | 1550–1603 | lexicon | [txt] | — | — | [xml] | [all] | www | www | [readme] |
DigLib | 1584–1918 | mixed | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
RefCorpus | 1584–1899 | mixed | [txt] | [tok] | [txt] | [txt] | [all] | www | www | [readme] |
lex | 1584–1918 | lexicon | [txt] | — | — | — | — | www | www | [readme] |
You may also download all Slovene corpora files (including readme files) here: all-slovene-corpora.zip
The following corpora are currently available for historical Spanish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
IMPACT BVC | 1481–1962 | mixed | [txt] | [tok] | [txt] | — | [all] | www | www | [readme] |
IMPACT GT | 1543–1748 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
You may also download all Spanish corpora files (including readme files) here: all-spanish-corpora.zip?download=1
The following corpora are currently available for historical Swedish:
Name | Time Period | Genre(s) | Download | Source | Licence | Info | ||||
Text | Token | Norm | Anno | All | ||||||
Dalin | 1850–1853 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
EDGeS | 1703–1917 | bible | [txt] | [tok] | — | [txt] | [all] | www | www | [readme] |
Fornsvenska | 1350–1758 | mixed | [txt] | [tok] | — | — | [all] | www | www | [readme] |
GaW | 1527–1812 | court, church | [txt] | [tok] | [txt] | — | [all] | www | Free for research | [readme] |
Gutenberg | 1789–1902 | books | [txt] | [tok] | — | — | [all] | www | www | [readme] |
Konsistoriet | 1624–1699 | protocols | [txt] | [tok] | — | — | [all] | www | Open Access | [readme] |
Schlyter | 500–1500 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
Swedberg | 1700–1735 | lexicon | [txt] | — | — | [txt] | [all] | www | www | [readme] |
You may also download all Swedish corpora files (including readme files) here: all-swedish-corpora.zip
Eva Pettersson, | Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se |
Beáta Megyesi, | Department of Linguistics, Stockholm University, beata.megyesi@ling.su.se |