HistCORP

Historical Corpora

On this page, we gather a wide range of historical corpora for different languages. If you use the resources provided on this page in your research, we would be very happy if you refer to the following paper:

Eva Pettersson and Beáta Megyesi (2018)
The HistCorp Collection of Historical Corpora and Resources.
In Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018. [pdf]

In the table below, you may download historical corpora for seventeen different languages. For more information about these language-specific corpora, and for download, click on the name of the language of interest to you. All resources hereunder are provided on a "AS-IS", “WHEREIS,” and “WITH ALL FAULTS” basis, without warranty of any kind, expressed or implied.

Latest News: (all updates are listed in the archive)

Download Historical Corpora

Coptic

The following corpus is currently available for Coptic:

  • Coptic Scriptorium

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
DiplDipl-CleanCompleteUtilsAll
Coptic Scriptorium mixed [txt] [txt] [complete] [utils] [all] www www [readme]

Czech

The following corpora are currently available for historical Czech:

  • Medieval Charter Sections Corpus (charters)
  • The diachronic section of the Czech National Corpus (DIAKORP)
  • Selection of books from the Gutenberg Project (Gutenberg)

Name Time Period Genre(s) Download Source Licence Info
TextTokenNormAnnoAll
charters 1310–1346charters [txt] [tok] [xml] [all] www www [readme]
DIAKORP1350–1939 mixed[txt] [tok] [all] www www [readme]
Gutenberg1890–1897 fiction [txt] [tok] [all] www www [readme]

You may also download all Czech corpora files (including readme files) here: all-czech-corpora.zip

Dutch

The following corpora are currently available for historical Dutch:

  • Brieven als Buit, BaB (not available for download, but included in the language models)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • The Compilation Corpus Historical Dutch (Compilation)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
BaB 1661–1783 letters www register [readme]
EDGeS 1360–1939 bible [txt] [tok] [txt] [all] www www [readme]
Gutenberg 14nn–1875 fiction [txt] [tok] [all] www www [readme]
Compilation 1236–1938 chancellery, narrative www Free for research [readme]

You may also download all Dutch corpora files (including readme files) here: all-dutch-corpora.zip

English

The following corpora are currently available for historical English:

  • The Corpus of Late Modern English Texts, version 3.1 (CLMET)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Lampeter Corpus of Early Modern English Tracts (lampeter)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
CLMET 1710–1920 mixed [txt] [tok] [txt] [all] www www [readme]
EDGeS 1395–1890 bible [txt] [tok] [txt] [all] www www [readme]
lampeter 1640–1740 tracts [txt] [tok] [all] www www [readme]

You may also download all English corpora files (including readme files) here: all-english-corpora.zip

French

The following corpora are currently available for historical French:

  • Paris speech in the past (Paris)
  • Syntactic Reference Corpus of Medieval French (SRCMF)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Paris 1296–1790 vernacular speech, tax-rolls [txt] [tok] [all] www www [readme]
SRCMF 842–1325 [txt] [tok] [conll] [all] www www [readme]

You may also download all French corpora files (including readme files) here: all-french-corpora.zip

German

The following corpora are currently available for historical German:

  • Deutsches TextArchiv (DTA)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • The Nottingham Corpus of Early Modern German Midwifery and Women's Medicine (GeMi)
  • GerManC
  • German Literary History (LitHist)
  • Reference Corpus of Middle High German (ReM)
  • Reference Corpus of Middle Low German/Low Rhenish (ReN)
  • Register in Diachronic German Science (Ridges)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
DTA 1600–1899 mixed [txt] [tok] [all] www www [readme]
EDGeS 1460–1871 bible [txt] [tok] [txt] [all] www www [readme]
GeMi 1500–1690 medicine [txt] [tok] [all] www www [readme]
GerManC 1654–1799 mixed [txt] [tok] [conll] [all] www www [readme]
LitHist 1790–1829 literature [txt] [tok] [conll] [all] www www [readme]
ReM 1050–1350 mixed [txt] [tok] [xml] [all] www www [readme]
ReN 1200–1650 mixed [txt] [tok] [xml] [all] www www [readme]
Ridges 1482–1914 science [txt] [tok] [txt] [conll] [all] www www [readme]

You may also download all German corpora files (including readme files) here: all-german-corpora.zip

Greek

The following corpora are currently available for Greek:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
AGLDT mixed [txt] [tok] [xml] [all] www www [readme]
Perseus mixed [txt] [tok] [all] www www [readme]
Proiel mixed [txt] [tok] [conll] [all] www www [readme]

You may also download all Greek corpora files (including readme files) here: all-greek-corpora.zip

Hungarian

The following corpus is currently available for historical Hungarian:

  • Hungarian Generative Diachronic Syntax (HGDS)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
HGDS 1440–1539 codices [txt] [tok] [txt] [conll] [all] www free [readme]

Icelandic

The following corpus is currently available for historical Icelandic:

  • Icelandic Parsed Historical Corpus (IcePaHC)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
IcePaHC1150–2008mixed [txt] [tok] [txt] [txt] [all] www www [readme]

Italian

The following corpus is currently available for historical Italian:

  • Selection of books from the Gutenberg Project

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Gutenberg 1300–1897 books [txt] [tok] [all] www www [readme]

Latin

The following corpora are currently available for Latin:
  • Ancient Greek and Latin Dependency Treebank (AGLDT)
  • Medieval Charter Sections Corpus (charters)
  • Corpus Corporum (not available for download, but included in the language models)
  • Late Latin Charter Treebank 1 (LLCT1)
  • Late Latin Charter Treebank 2 (LLCT2)
  • Perseus Digital Library (Perseus)
  • Proiel Treebank (Proiel)
  • Index Thomisticus Treebank (Thomisticus)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
AGLDT mixed [txt] [tok] [xml] [all] www www [readme]
charters 1310–1346 charters [txt] [tok] [txt] [all] www www [readme]
Corpus Corporum 100–1200 mixed www www [readme]
LLCT1 charters [txt] [tok] [all] www www [readme]
LLCT2 charters [txt] [tok] [conll] [all] www www [readme]
Perseus mixed [txt] [tok] [all] www www [readme]
Proiel mixed [txt] [tok] [conll] [all] www www [readme]
Thomisticus 1225–1274 mixed [txt] [tok] [conll] [all] www www [readme]

You may also download all Latin corpora files (including readme files) here: all-latin-corpora.zip

Polish

The following corpus is currently available for historical Polish:

  • Middle Polish Diachrone Lemmatised Corpus (PolDiLemma)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
PolDiLemma 1567–1800 mixed [txt] [tok] [txt] [all] www www [readme]

Portuguese

The following corpus is currently available for historical Portuguese:

  • Tycho Brahe Parsed Corpus of Historical Portuguese (Tycho)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Tycho 1380–1881 mixed www www [readme]

Russian

The following corpus is currently available for historical Russian:

  • Middle Russian Corpus (RNC)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
RNC [txt] [tok] [conll] [all] www www [readme]

Slovene

The following corpora are currently available for historical Slovene:

  • Words of the 16th-Century Slovenian Literary Language (besedje)
  • Digital Library (DigLib)
  • Reference corpus of historical Slovene (RefCorpus)
  • Lexicon of historical Slovene (lex)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
besedje1550–1603 lexicon [txt] [xml] [all] www www [readme]
DigLib1584–1918 mixed[txt] [tok] [txt] [all] www www [readme]
RefCorpus1584–1899 mixed[txt] [tok] [txt] [txt] [all] www www [readme]
lex1584–1918 lexicon[txt] www www [readme]

You may also download all Slovene corpora files (including readme files) here: all-slovene-corpora.zip

Spanish

The following corpora are currently available for historical Spanish:

  • IMPACT-es diachronic corpus, BVC-section (IMPACT BVC)
  • IMPACT-es diachronic corpus, GT-section (IMPACT GT)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
IMPACT BVC 1481–1962mixed [txt] [tok] [txt] [all] www www [readme]
IMPACT GT 1543–1748 mixed [txt] [tok] [all] www www [readme]

You may also download all Spanish corpora files (including readme files) here: all-spanish-corpora.zip?download=1

Swedish

The following corpora are currently available for historical Swedish:

  • Dalin's 19th Century Swedish Dictionary (Dalin)
  • The EDGeS Diachronic Bible Corpus (EDGeS)
  • Fornsvenska Textbanken (Fornsvenska)
  • Selection of books from the Gutenberg Project (Gutenberg)
  • Texts from the Gender and Work project (GaW)
  • Protocols from the Academic Consistory of Uppsala University (Konsistoriet)
  • Schlyter's Medieval Swedish Dictionary (Schlyter)
  • Swensk Ordabok by Jesper Swedberg (Swedberg)

NameTime PeriodGenre(s)DownloadSourceLicenceInfo
TextTokenNormAnnoAll
Dalin 1850–1853 lexicon [txt] [txt] [all] www www [readme]
EDGeS 1703–1917 bible [txt] [tok] [txt] [all] www www [readme]
Fornsvenska 1350–1758 mixed [txt] [tok] [all] www www [readme]
GaW 1527–1812 court, church[txt] [tok] [txt] [all] wwwFree for research [readme]
Gutenberg 1789–1902 books [txt] [tok] [all] www www [readme]
Konsistoriet 1624–1699 protocols [txt] [tok] [all] www Open Access [readme]
Schlyter 500–1500 lexicon [txt] [txt] [all] www www [readme]
Swedberg 1700–1735 lexicon [txt] [txt] [all] www www [readme]

You may also download all Swedish corpora files (including readme files) here: all-swedish-corpora.zip




For questions or comments, or if there are corpora that you would like to add to this page, don't hesitate to contact us:

Eva Pettersson, Department of Linguistics and Philology, Uppsala University, eva.pettersson@lingfil.uu.se
Beáta Megyesi, Department of Linguistics, Stockholm University, beata.megyesi@ling.su.se