Cross-Lingual Dependency Parsing

This text is based on an earlier assignment with the purpose of trying out cross-lingual parsing with a neural parser, uuparser, which is developed by the Uppsala parsing team, with Miryam de Lhoneux as the main developer. uuparser is similar to the Kiperwasser and Goldberg parser you read about in literature seminar 2. The variant you will use is the transition-based parser. We recommend that you run uuparser on our Linux system, where everything is installed and running.

To train the parser, you should use the Universal Dependencies (UD) treebanks. Version 2.9 is avialable on our Linux system, and can be used directly.

Data

Use the Universal Dependencies data (UD) version 2.9 in this assignment. The data is availble on the Linux system at: /corpora/ud/ud-treebanks-v2.9 Note that since we run the parser on limited data resources, you cannot use the full treebanks, but need to limit yourselves to a low-resource scenario, by only using a subset of sentences from the training data. Trying to use at most 1000 sentences for training for each language, is usually fine. With a much higher number of languages, there will be very long run times.

You need to prepare your own directory for the languages that you are interested in, and copy the relevant parts of the data there. Note that you should keep the naming convention, i.e. the folders for each language should have the same name as in the original structure (e.g. UD_Swedish-LinES), and the training and development files should also have the same names (e.g. sv_lines-ud-train.conllu and sv_lines-ud-dev.conllu), but their sizes should be modified. You do not need to copy the test files, or any additional files, since you will not use them. In the below we will refer to the names used in the files for the treebank, "sv_lines" in the example above, as the ISO-id of the treebank.

If you are using a lanugage for which you want to keep 100 sentencesfor training, you need a training set of 100 sentences and a development set of at least 100 sentences. If your language has both these sets, copy the first 100 sentences of the train set and keep the full development set (unless it is extremely large, in which case you might want to limit it). If your language does not have a development set, copy 100 sentences from the train set to your train set, and another 100 sentences to your development set. Avoid using treebanks with less than 200 sentences.

To select the first N sentences from a Conllu-file, you can use the following script:
/local/kurs/parsing/assign4/select-n-conllu-sentences.perl N input-file output-file
Where input-file is the original Conllu-file you are reading from, output-file is the file you write to, and N is the number of sentences you want to copy.

Parser

The parser, uuparser, is available on the Linux computer system. You can run the parser using the command uuparser. Treebanks should be given using the ISO id, i.e. the short name for each treebank, which is used in the names of the files with data (for instance sv_lines or en_ewt).

The current version of one of a specification files used by the parser is UD 2.2. If the parser cannot find your treebank, that is because you use treebanks that are newer than 2.2. You then need to add the following flag to all commands:
--json-isos /local/kurs/parsing/ud2.9_iso.json

To train uuparser for a single language, use the command:
uuparser --outdir [results directory] --datadir [your directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --include [treebank to train on denoted by its ISO id] --disable-rlmost

To test uuparser, where LANG is the treebank you want to test on (and which the parser was also trained on):
uuparser --predict --outdir [results directory] --modeldir [model directory in the form model_dir/LANG-iso-id] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --multiling --include [LANG-iso-id]

To test uuparser for another language (NEW) than the language a model is trained on (TRN):
uuparser --predict --outdir [results directory] --modeldir [model directory in the form model_dir/TRN-iso-id] --datadir [directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --multiling --include [NEW-iso-id:TRN-iso-id]

The parser automatically chooses the model with the best development score. The model directory should be the one specified when training the parser, followed by the treebank name. Check that this directory contains the model "barchybrid.model". Note that the include flag needs to specify the ISO of the treebank you want to test on and the ISO for the treebank that you trained the model on, with a colon in between.

To train uuparser for multiple languages, use the command:
uuparser --outdir [results directory] --datadir [your directory of UD files with the structure UD\_\*\*/iso\_id-ud-train/dev.conllu] --disable-rlmost --include ["treebanks to train on denoted by their ISO id"] --multiling

Note that you need to have quotes around the treebanks when you have more than one treebank! (for instance: "sv_lines sv_talbanken" to train on two Swedish treebanks).

Note, use different output directories for the different experiments, in order not to risk that the parser over-writes previous output which you may need. Also note that it takes some time to run these experiments. Each experiment should run for the default 30 epochs. Each epoch probably takes less than a minute for 100 sentences, and somewhere around 3-6 minutes for 500-600 sentences. Thus, take into account when planning your time that you will have to wait some time to get your results!

In all experiments, use the default settings for uuparser, except for the flag --disable-rlmost which disables what is called the extended feature set, i.e. information about children of items in the stack.

Example small setup

In this section, I will give an example of a simple experiment for trying out using uuparser. We will use three example treebanks:

A target language treebank (TGT)
This is the language that you are attempting to parse. It is meant to be treated as a low resource language, but it is OK to use a high resource language and simulate such a scenario by limiting the training data. It is helpful, but not required, to choose a target language that you know.
A transfer language treebank that you believe will be good (GTRF)
This should be a language with more resources than your target language, which you think might help in parsing the target language, for instance since it is (closely) related, shares some important lingusitic features, is a contact language, or some other reason.
A transfer language treebank that you do not believe will be good (or at least not as good as GTRF) (BTRF)
This should be a language with more resources than your target language, which you think might not help in parsing the target language, for instance since it is not related, have different important lingusitic features, maybe has a different script, or some other reason(s).

For trying uuparser out, a good size for TGT is 100 sentences, and a good size for GTRF and BTRF is 500 sentences. If so, make sure that your target language has at least 200 sentences in total in its training and development data, and that your transfer languages have at least 600 sentences training data each.

For cases where a language have more than one treebank, you can pick any that is large enough. If possible you can try to match genres, but that is not required, so focus more on language choice than treebank choice.

You could then run the following five experiments. See above for information on commands and how to handle data.

Train a monolingual parsing model on 100 sentences from TGT and record the scores for the TGT development set recorded during training.
( 0-shot transfer: NOT recommended, since it gives poor results)
1. Train a monolingual model on 500 sentences from GTRF and evaluate it on the TGT development set, using the model from the best iteration.
2. Train a monolingual model on 500 sentences from BTRF and evaluate it on the TGT development set, using the model from the best iteration.
few-shot transfer
1. Train a multilingual model on 100 sentences from TGT and 500 sentences from GTRF and record the scores for the TGT development set recorded during training.
2. Train a multilingual model on 100 sentences from TGT and 500 sentences from BTRF and record the scores for the TGT development set recorded during training.

Note that in all cases you are using a limited amount of data, in order to keep the run time of your experiments reasonable. In a real setting it is quite likely that you would use more data for the transfer languages. Also note that in this assignment, your two transfer languages do not actually need to have more data than the target language, but this can be simulated by not using all available data.

Evaluation

Here are evaluations and results you could consider for this small sample experiment

Present the UAS and LAS scores for the TGT development set for the best iteration for each of your five systems.
Draw learning curves of how the scores for the TGT development set, develop for each epoch for systems 1, 3a, and 3b. Preferably, draw all three curves in the same plot.
Do a small qualitative evaluation where you compare the errors for a few sentences that have different parses across (a subset of) systems.

For all three parts, you should discuss your findings and if possible try to explain them.

For the qualitiative evaluation you may use MaltEval. Note, however, that you will need to convert to Conll-X format for it to work, which can be done with this script: /corpora/ud/ud-tools-v2.9/conllu_to_conllx.pl

Designing a project

For your project, you need to come up with a plan for some aspect you want to explore of cross-lingual parsing. Some suggestions of directions are listed below. You will be expected to motivate and describe a set of experiments, perform those experiments, evaluate and analyse the reuslts. The number of experiemnts to run may differ depending on which aspect(s) you want to explore, but may be in the range 5-10.

Investigate the choice of transfer language for a given target language, by exploring the effect of using different target languages. You should make your choice of transfer languages in some principled way, for instance based on how closely realted the languages are, comparing relatedness with neighboring languages, etc. Discuss the effects of this choice.
Vary the size of the training data for the target and (1 or a few) transfer languages, and discuss the effects.
If any of your proposed transfer languages have more than one treebank, explore the effect of using data from different treebanks, and possibly also of using different parts of the treebanks.
Experiment with using more than one transfer language at the time, by training models with three or more languages in them (multi-source models). Think carefully about the size of the data here.

It is also possible to explore some other issue. Note that for this project it is fine to evaluate on the dev sets from UD (which is what is described here).

Report

Report by uploading a pdf report describing your experiments and discussing your findings, as well as giving a basic high-level description of uuparser and multilingual parsing.

History

This text was first written by by Sara Stymne, 2020, and updated in 2023.