Sara Stymne
I am Docent in Computational Linguistics, working as a senior lecturer (universitetslektor) in the Computational Linguistics and Language Technology group, Department of Linguistics and Philology, Uppsala University since 2017. I have been working in this group since 2012 as a post-doc (2012-2015), as a researcher (2015-2017), and as an assistant professor (2017-2023).
Since 2025, I am the director Språkbanken CLARIN, a division of the national of the research infrastructure Språkbanken, and the Swedish national coordinator of CLARIN ERIC. CLARIN is a digital infrastructure offering data, tools, and services to support research based on language resources.
My main research interests are cross-lingual NLP and digital philology. I am interested in how computational linguistics can be used to solve research questions in other fields, researched through collaborations with scholars from Scandinavian languages, political science, and other fields. My work on cross-lingual NLP has mostly focused on dependency parsing, where I have also worked on cross-genre methods. My earlier work was focused on machine translation, mainly on discourse-aware translation, compound processing, and error analyis.
I was previously a researcher at the Department of computer and information science at Linköping University until 2012. I received a PhD in Computational Linguistics from Linköping University in 2012, with the thesis Text Harmonization Strategies for Phrase-Based Statistical Machine Translation. I received a Licentiate degree in Computational Linguistics in 2009, and a Master's degree in Cognitive science in 2006, both from Linköping University.
I spent the autumn 2010 and spring 2009 at Xerox Research Centre Europe in Grenoble, France.
Projects
Current
- PI for Language change and non-fictional texts – a large-scale investigation of Late Modern Swedish (1800–1950). Project funded by VR, 2025--2028.
- Enabling climate-resilient development: How disasters can act as a pathway to a safer and more sustainable world. PI: Daniel Nohrstedt, Department of Government, Uppsala University. Project grant from Marianne and Marcus Wallenberg Foundation, 2023-2027.
Current infrastrucutre projects
- Språkbanken. Funded by VR 2025--2028. I am the director of the division Språkbanken CLARIN.
Previous
- Fictional prose and language change. The role of colloquialization in the history of Swedish 1830–1930. PI: David Håkansson. This is a project funded by VR, running 2021-2023. My main role is to develop lanugage technology tools for the analysis of dialogue, narrative and stylistic features of literature.
- Domain-sensitive cross-lingual dependency parsing. This project has funding for a postdoc for two years 2020-2022, by eSSENCE at Uppsala University.
- Datalab for results in the public sector. Project funded by Vinnova. The Uppsala University focus is on a sub project with the goal of identifying causality in government reports, in collaboration with The Swedish National Financial Management Authority and RISE.
- Från närläsning till fjärrläsning: digital humaniora och nya former för textanalys. (From close to distant reading: digital humanities and new forms for textual analysis). PI Johan Svedjedal. Collaboration project 2017-2019, funded by Circus at Uppsala Unviersity.
- Efficient Algorithms for Natural Language Processing Beyond Sentence Boundaries. Postdoc project funded by eSSENCE - The e-Science Collaboration, 2012-2015.
Software and resources
UD-MUTLIGENRE. A reorganisation of a subset of Universal Dpenendency treebanks, with instance-level genre annotations. Main developer: Vera Danilova.
I'm the Swedish language leader for the PARSEME multiword expression data. The latest release: PARSEME 2.0.
Several smaller datasets hosted on the Uppsala University GitHub page, which I have developed in collaborations with colleagues. For details, see each dataset:
- UU_SwedishNovels_1800-1940. A corpus of Swedish literary novels and collections of short stories from 1800–1940
- LitDialogSilver. The data contains automatically annotated speech segments and speech tags from 88 Swedish novels and collections of short stories where quotation marks are used for speech marking.
- SOU Corpus. Cleaned and further processed versions of Swedish Government Official Reports - Statens offentliga utredningar (SOU).
- Swedish Causality Datasets. Three data sets of Swedish text annotated for the presence of causality. The sets are annotated for two different tasks: causality recognition and causality ranking.
uuPronPred is a BiLSTM-based system for cross-lingual pronoun prediction.
uuparser is a dependency parser based on BiLSTM feature extractors (main developer: Miryam de Lhoneux)
Docent is a document-level machine translation decoder. (main developer: Christian Hardmeier)
Blast is a tool for error analysis of machine translation output.
Annotated compounds in German and Swedish. Small sets of running text from Europarl annotated with compounds in two ways.