------------------------------------------------- The Corpus of Late Modern English Texts (v. 3.1) ------------------------------------------------- The Corpus of Late Modern English Texts, version 3.1 (CLMET3.1) has been created by Hendrik De Smet, Susanne Flach, Hans-Jürgen Diller and Jukka Tyrkkö, as an offshoot of a bigger project developing a database of text descriptors (Diller, De Smet & Tyrkkö 2011). CLMET3.1 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET, CLMETEV, and CLMET3.0, and has been compiled following roughly the same principles, that is: - The corpus covers the period 1710–1920, divided into three 70-year sub-periods. - The texts making up the corpus have all been written by British and Irish authors who are native speakers of English. - The corpus never contains more than three texts by the same author. - The texts within each sub-period have been written by authors born within a correspondingly restricted sub-period. However, compared to the earlier versions, it comes with a number of important improvements (in addition to being substantially bigger): - CLMET3.1 comes with an explicit genre classification. - It is approximately genre-balanced. - It is part-of-speech tagged. - The corpus files have standardized text headers containing descriptive meta-data. - For each text, explicit information is provided on text provenance. - The corpus architecture allows subsequent expansions. - The corpus is CWB compatible. The following table summarises the corpus make-up: Sub-period #authors #texts #words ------------------------------------------------ 1710-1780 51 88 10,480,431 1780-1850 70 99 11,285,587 1850-1920 91 146 12,620,207 TOTAL 212 333 34,386,225 The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The genre-division per sub-period is as follows: Genre 1710-1780 1780-1850 1850-1920 ------------------------------------------------------------------ Narrative fiction 4,642,670 4,830,718 6,311,301 Narrative non-fiction 1,863,855 1,940,245 958,410 Drama 407,885 347,493 607,401 Letters 1,016,745 714,343 479,724 Treatise 1,114,521 1,692,992 1,782,124 Other 1,434,755 1,759,796 2,481,247 Cited from http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html February 10, 2020 HistCorp inclusion date ------------------------ February 19, 2020 Website -------- http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html Licence -------- Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (https://creativecommons.org/licenses/by-nc-sa/4.0/) Citation --------- Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35. The HistCorp files ------------------- On the HistCorp page, the texts from the Corpus of Late Modern English Texts are provided in a plain text format ('txt'), a tokenised format ('tok'), and a linguistically annotated format ('anno'). The plain text files were created from the original CLMET files, removing the XML tags, and modifying the provided metadata information at the top of each file to the format used for all HistCorp files. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. Furthermore, following the HistCorp format, spaces separating punctuation from the words have been removed. In the tokenised files, the texts are split into one token on each line, and with blank lines separating sentences. Since word and sentence boundaries were already marked in the original CLMET plain text files, no additional tokenisation was needed for this step. CLMET also provides linguistic annotation in one single file, containing all the texts in the corpus. This file is provided under the 'anno' tab on the HistCorp page, and is given in a tab-separated format, with one token on each line followed by its part-of-speech tag, its lemma and its class. Here, 'class' is based on the 11 so-called ‘Oxford simplified wordclass tags’ (Burnard 2007) that subsumes groups of pos tags under their more general word classes, i.e. all verb tags (VVI, VBN, VDD, VM0, VHI etc.) are assigned the simplified class tag VERB, or all noun tags SUBST. See further the manual for CLMET 3.1, downloadable from the CLMET website (http://fedora.clarin-d.uni-saarland.de/clmet/clmet.html). Size: 333 texts, with a total of 40,838,175 tokens. Genres: Narrative fiction, narrative non-fiction, drama, letters and treatise.