Syntactic Analysis (5LN455): Assignment 4

Note: bachelor students only!

In this assignment, you will learn how to use a state-of-the-art system for dependency parsing (MaltParser) and evaluate its performance.

The assignment is structured into smaller tasks; detailed instructions on each of these tasks are given below. These instructions also specify how to report your work on that task in the lab report.

Setup

To get started, go to the MaltParser website and download the latest release of the parser, either in tar.gz or in zip format. Then, follow the installation instructions. Once you have tested your installation and know that everything works, read the Start Using MaltParser section of the User Guide. In that section you will learn how to train a parsing model on a data set (the training data), and use that model to parse unseen data (the testing data).

Please note that the testing data serves two purposes at the same time: It contains the sentences (tagged with part-of-speech information) that the parser should assign dependency analyses to, and it also contains gold-standard analyses that you can use to evaluate the performance of the parser. These gold-standard analyses are not visible to the parser during parsing. (If they were, the parser could just assign the gold-standard analysis to the sentence and would receive perfect score.)

Task 1: Train a Baseline Model

Your first task is to train a useful parsing model on realistic data. The workflow for this is exactly the same as the one that you used in the setup phase. The only things that change are

  • the training data: sv-ud-train.conllu
  • the testing data: sv-ud-dev.conllu
  • since this data is in CoNLL-U format, add the flag: -if conllu

These files are the from the universal dependencies version of the Swedish treebank, based on Talbanken. You can copy these files from: /local/kurs/parsing/assign4/. Make sure you use the flag -if conllu for the rest of the lab.

Note that training a model with this data will take quite a bit longer than training the dummy model from the setup phase.

Reporting: Report the time it took to train and parse with the model, as well as the hardware configuration of the computer that you used for this experiment (processor type, amount of memory).

Task 2: Evaluate the Baseline Model

Now that you have trained a parsing model and used it to parse the testing data, your next task is to evaluate the performance of your system. For this you will use two measures: labelled attachment score (LAS) and unlabelled attachment score (UAS). In both cases you compare the parser’s output to the testing data.

You can read more about LAS and UAS in section 6.1 of the KMN book. You should implement the word-based version of them.

While there are several tools available for computing attachment scores, you are asked to implement your own evaluator. You can use any programming language you want; the only requirement is that the evaluator should be callable from the command line. It should accept exactly two arguments: the file with the gold-standard data, and the file with the system output. For example, you should be able to do something like this: (note that the numbers in the example are made up!)

% python depEval.py sv-ud-dev.conllu my-out-file.conllu
Total number of edges: 1000
Number of correct edges (LAS): 639
Number of correct edges (UAS): 722
LAS: 0.64
UAS: 0.72

In order to write the evaluator, you need to know about the format of the data files, CoNLL-U. Please see this page for detailed information on this. Remember to ignore any comment lines, starting with the hash symbol (#). In the Swedish data there are no isntances of multiword tokens and empty nodes.

Reporting: Include the code for your evaluator in the lab report (or hand it in as a separate file if it is longer than 1 page). Also give the LAS and UAS score for your baseline system from task 1.

Task 3: Selecting a Good Parsing Algorithm

MaltParser supports several parsing algorithms; this is described in the Parsing Algorithm section of the User Guide. Your next task is to select the best algorithm for the data at hand, where the ‘best’ algorithm is the one that gives the highest score on LAS or UAS. The two metrics might give slightly different results, which means there are different ways of deciding which system is 'best'. To make this choice, for each algorithm you need to train a separate parsing model, use it to parse the testing data, and evaluate the performance of the parser as in Task 2.

You can restrict your search to the algorithms described in the Nivre and Stack families. In each family you should try all variants (given by the flag -a). For the projective algorithms (in both families) try at least one type of pseudo-projective parsing (use the flag -pp). There should thus be at least 8 different combinations of algorithms and arguments. You may also try other options if you wish.

Reporting: Report the LAS and UAS scores for all algorithms and combinations of arguments that you tried, making sure that it is clear which configuration each result belongs to. Write down which algorithm you picked in the end.

Task 4: Feature Engineering

MaltParser processes a sentence from left to right, and at each point takes one of a small set of possible transitions. (Details will be presented in the lectures.) In order to predict the next action, MaltParser uses feature models. Up to now, you have been using the baseline feature model for whatever algorithm you were experimenting with. However, one can often do much better than that.

Your next task is to improve the feature model by exploiting the fact that the training and testing data contain morphological features such as case, tense, and definitiveness. These are specified in the FEATS column (column 6) of the CoNLL-U format. Here is an example:

9       lön     lön     NOUN    NN|UTR|SIN|IND|NOM      Case=Nom|Definite=Ind|Gender=Com|Number=Sing    
5       appos   _       SpaceAfter=No

This line specifies that the word lön (wage) has the grammatical gender common (COM), the grammatical number singular (Sing), is marked as indefinite (Ind), and has the grammatical case nominative (Nom). This is useful information during parsing. See also the file PATH_TO_MALTPARSER/appdata/dataformat/conllu.xml for the MaltParser specification of CoNLL-U format.

Read the Feature Model section of the user guide to find out how to extract the value of the FEATS column and split it into a set of atomic features using the delimiter | (pipe). Then, create a copy of the file that holds the feature model used by the algorithm that you selected in Task 3 for CoNLL-U and make the necessary modifications. Finally, train a new parsing model with the extended feature model, use it to parse the testing data, and evaluate its performance. Note that you have to give the file you copied and modified as input when you train the parser, as specified in the manual.

Note: If you are using an algorithm from the Nivre family, then you should extract the features for Input[0] and Stack[0]. If you are using an algorithm from the Stack family, then you should use Stack[0] and Stack[1].

Reporting: Write down the lines that you added to the baseline feature model, and how this affected the LAS and UAS of the parser relative to the score that you got for Task 3. (Note that you need to retrain the parser with the new feature model in order to see changes.)

Task 5: Error analysis

In addition to standard evaluation, using metrics like LAS, it can be useful to do some more detailed error analysis of the system output. In this task you should explore two ways of doing error analysis: automatically and manually. You will use these methods on the output from your chosen system from task 3, and your system from task 4.

Your first subtask is to write a program that performs a simple form of error analysis, to find out which categories are most often confused. For each erroneous link, your program should store the predicited label, and the gold standard label, for instance: "SS-SP". You only have to care about the label, not about the predicted head. Your program should store the count for each type of confusion, and output the top fifteen confusions, and how many times they occur. Again you may use any programming language, and your program should be callable from the command line, taking the files with gold standard and the system output as arguments.

Run your program twice, once for your selected system from task 3, and once for your system from task 4, and compare the results. Briefly discuss the differences or lack of differences. To aid your discussion, the list of relations available here may be useful.

For your second subtask, find a sentence that have a different tree when using your selected system from task 3 and your system from task 4. Write down the line number where the sentence you pick start. Draw the trees for the sentence, as parsed by the two systems and as in the gold standard (if the sentence is very long, you can choose only to draw a relevant part of the tree). Discuss the difference(s) both between the systems and compared to the gold standard, and why you think they happen. (If you do not speak Swedish, try your best at this task.)

Reporting: Hand in your program and include the result of the automatic analysis, the trees and a discussion of the questions in your report. You may draw the trees by hand, in which case you can leave the trees in Sara's postbox (number 108).

VG task

In order to earn a VG you have to solve a complimentary task, where you show ability to analyse your work, synthesize new ideas, and/or critically assess your implementation. You can choose from the following tasks:
  • Explore the effect of the training and test data. In the folder /local/kurs/parsing/assign4/extra/ there is training and dev data from an additional treebank, sv_lines. This treebank has been developed separately from the other Swedish treebank you worked on. There are differences both in the domains of the data, and in some of the annotation choices made. Make an investigation of the effect of training on either of the treebanks and/or the concatenation of them. The evaluation, analysis and discussion should be detailed, only giving LAS and UAS scores will not be enough.
  • Extend the feature engineering experiments in task 4. Some possible extensions:
    • Apply them to more than one parsing model
    • Investigate the use of lemmas and xpostags, besides morphological features
    • Explore how many words on the stack and buffer to use
    Make sure that you motivate and analyse your experiments in an interesting way.
  • Investigate the effect of using predicited POS tags instead of gold POS tags. More info here

Reporting: Describe your experiments, the results and the conclusions you can draw from them. Note that it is not enough just to a VG task in order to receive a VG on the assignment. You also have to show that you can analyse the results well by having insightful discussions on this and the other sub tasks.

Submission and Grading

Report by uploading a zip or gz document to studentportalen containing your report and your code. Make sure that you have your name in the report and code files. You can write your report in Swedish or English.

The assignment will be graded mainly based on your written lab report, but the code will also be taken into account.

The submission deadline for this assignment is January 12, 2018.

Problems

Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.

Enjoy!

History

This assignment was developed for 5LN455 by Marco Kuhlmann, 2011-2012. Modified by Sara Stymne, 2015, 2017.