----------------------
IMPACT-es BVC-section
----------------------

The IMPACT-es diachronic corpus of historical Spanish compiles over
one hundred books, containing approximately 8 million words, in
addition to a complementary lexicon which links more than 10 thousand
lemmas with attestations of the different variants found in the
documents. This textual corpus and the accompanying lexicon have been
released under an open license (Creative Commons by-nc-sa) in order to
permit their intensive exploitation in linguistic research.

Approximately 7% of the words in the corpus (a selection aimed at
enhancing the coverage of the most frequent word forms) have been
annotated with their lemma, part of speech, and modern equivalent.

	Cited from https://www.digitisation.eu/tools-resources/language-resources/impact-es/
	August 29, 2017


HistCorp inclusion date
------------------------
January 30, 2017


Website
--------
https://www.digitisation.eu/tools-resources/language-resources/impact-es/


Licence
--------
Creative Commons Attribution-ShareAlike 3.0 Unported License
(https://creativecommons.org/licenses/by-nc-sa/3.0/) and GNU General
Public License GPL3 (https://www.gnu.org/licenses/gpl-3.0.en.html) 


Citation
---------
Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., Carrasco,
R.C.: An open diachronic corpus of historical Spanish published in
Language Resources and Evaluation. Available at:
http://link.springer.com/article/10.1007%2Fs10579-013-9239-y


The HistCorp files
-------------------
On the HistCorp page, the Spanish texts from the BVC-section of the
'IMPACT' corpus are provided in a plain text format ('txt') and in a
tokenised format ('tok'), and for a manually normalised subset of the
word forms a word-aligned format mapping historical spelling to a
modern spelling ('norm'). The normalised files also contain
information on lemma and part-of-speech.

The plain text files were created by extracting the tokens from the
original BVC XML files, and adding metadata from the original XML
files in a TEI-compatible format at the top of each file. In addition,
the number of tokens for each file has been calculated based on the
tokenised version of the file.

In the tokenised files, the texts are split into one token on each
line, by extracting the word forms from the original IMPACT xml files.

Some of the word forms in the corpus have been annotated with their
lemma, part of speech, and modern spelling equivalent. According to
the corpus creators these normalised word forms make up approximately
7% of the words, aiming at a selection enhancing the coverage of the
most frequent word forms. On the HistCorp page, the normalised word
forms are stored in a single file ('norm') in a tab-separated format
containing the original spelling in the first column, the modern
spelling in the second column, the lemma in the third column, and the
part-of-speech in the fourth column.

Size: 86 texts, with a total of 2,379,039 tokens.