--------------------------- The Gender and Work Corpus --------------------------- The Gender and Work corpus on the HistCorp page contains court records and church documents from the Early Modern Swedish time (appr. 1550--1800), kindly delivered by the historians in the Gender and Work project at the Department of History, Uppsala University (http://gaw.hist.uu.se/) HistCorp inclusion date ------------------------ March 11, 2017 Website -------- http://gaw.hist.uu.se/ Contact information -------------------- http://gaw.hist.uu.se/kontakta-oss/ Licence -------- Free for research. The HistCorp files ------------------- The Gender and Work texts on the HistCorp page are provided in a plain text format ('txt'), a tokenised format ('tok'), and for a manually normalised subset of the corpus a word-aligned format mapping historical spelling to a modern spelling ('norm'). For the plain text files, any pdf, rtf, or doc files were semi-automatically converted to plain text format. Furthermore, metadata is given in a TEI-compatible format at the top of each file. When assigning metadata, the number of tokens has been calculated based on the tokenised version of the file. In the tokenised files, the texts are split into one token on each line. Tokenisation was performed using the UDPipe tokeniser (https://ufal.mff.cuni.cz/udpipe) with the Swedish language model provided as a baseline model in the CoNLL17 Shared Task (swedish-ud-2.0-conll17-170315.udpipe). Parts of the corpus are manually normalised regarding spelling, and are provided in a separate download marked 'norm'. In the download for normalised texts, there is also a subdirectory named 'experiments', with the same mappings of historical spelling to modern spelling, but divided into training, development and test sets identical to the data sets used by Pettersson et. al in the paper 'A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text' (http://aclweb.org/anthology//W/W14/W14-0605.pdf). Size: 23 texts, with a total of 1,102,272 tokens. Genres: court records and church documents.