---------------------- IMPACT-es GT-section ---------------------- The IMPACT-es diachronic corpus of historical Spanish compiles over one hundred books, containing approximately 8 million words, in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in order to permit their intensive exploitation in linguistic research. Approximately 7% of the words in the corpus (a selection aimed at enhancing the coverage of the most frequent word forms) have been annotated with their lemma, part of speech, and modern equivalent. Cited from https://www.digitisation.eu/tools-resources/language-resources/impact-es/ August 29, 2017 HistCorp inclusion date ------------------------ January 30, 2017 Website -------- https://www.digitisation.eu/tools-resources/language-resources/impact-es/ Licence -------- Creative Commons Attribution-ShareAlike 3.0 Unported License (https://creativecommons.org/licenses/by-nc-sa/3.0/) and GNU General Public License GPL3 (https://www.gnu.org/licenses/gpl-3.0.en.html) Citation --------- Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., Carrasco, R.C.: An open diachronic corpus of historical Spanish published in Language Resources and Evaluation. Available at: http://link.springer.com/article/10.1007%2Fs10579-013-9239-y The HistCorp files ------------------- On the HistCorp page, the Spanish texts from the GT-section of the 'IMPACT' corpus are provided in a plain text format ('txt') and in a tokenised format ('tok'). The plain text files were created from the original IMPACT xml files, by extracting the text parts of the XML files, and also adding metadata from the XML files in a TEI-compatible format at the top of each file. In addition, the number of tokens for each file has been calculated based on the tokenised version of the file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Spanish language model provided as a baseline model in the CoNLL17 Shared Task (spanish-ud-2.0-conll17-170315.udpipe). Size: 21 texts, with a total of 6,309,761 tokens.