Syntactic Analysis (5LN455): Assignment 4

In this assignment, you will learn how to use a state-of-the-art system for dependency parsing (MaltParser) and evaluate its performance.

The assignment is structured into smaller tasks; detailed instructions on each of these tasks are given below. These instructions also specify how to report your work on that task in the lab report.

Setup

To get started, go to the MaltParser website and download the latest release of the parser, either in tar.gz or in zip format. Then, follow the installation instructions. Once you have tested your installation and know that everything works, read the Start Using MaltParser section of the User Guide. In that section you will learn how to train a parsing model on a data set (the training data), and use that model to parse unseen data (the testing data).

Please note that the testing data serves two purposes at the same time: It contains the sentences (tagged with part-of-speech information) that the parser should assign dependency analyses to, and it also contains gold-standard analyses that you can use to evaluate the performance of the parser. These gold-standard analyses are not visible to the parser during parsing. (If they were, the parser could just assign the gold-standard analysis to the sentence and would receive perfect score.)

Task 1: Train a Baseline Model

Your first task is to train a useful parsing model on realistic data. The workflow for this is exactly the same as the one that you used in the setup phase. The only things that change are

the training data: swedish_dep_train.conll
the testing data: swedish_dep_dev.conll

Note that training a model with this data will take quite a bit longer than training the dummy model from the setup phase.

Reporting: Report the time it took to train and parse with the model, as well as the hardware configuration of the computer that you used for this experiment (processor type, amount of memory).

Task 2: Evaluate the Baseline Model

Now that you have trained a parsing model and used it to parse the testing data, your next task is to evaluate the performance of your system. For this you will use two measures: labelled attachment score (LAS) and labelled exact match (LEM). In both cases you compare the parser’s output to the testing data.

You can read more about LAS and LEM in section 6.1 of the KMN book. You should implement the word-based version of LAS.

While there are several tools available for computing LAS and LEM, you are asked to implement your own evaluator. You can use any programming language you want; the only requirement is that the evaluator should be callable from the command line. It should accept exactly two arguments: the file with the gold-standard data, and the file with the system output. For example, you should be able to do something like this:

% java DepEval swedish_dep_dev.conll out.conll
Total number of edges: 9339
Number of correct edges: 6678
LAS: 0.715066
Total number of sentences: XXX
Number of correct edges: YYY
LEM: 0.ZZZ

In order to write the evaluator, you need to know about the format of the data files. Please see this page for detailed information on this.

Reporting: Include the code for your evaluator in the lab report (or send it by email in case the code is more than a page). Also give the LAS and LEM score for your baseline system from task 1.

Task 3: Selecting a Good Parsing Algorithm

MaltParser supports several parsing algorithms; this is described in the Parsing Algorithm section of the User Guide. Your next task is to select the best algorithm for the data at hand, where the ‘best’ algorithm is the one that gives the highest score on LAS or LEM. The two metrics might give slightly different results, which means there are different ways of deciding which system is 'best'. To make this choice, for each algorithm you need to train a separate parsing model, use it to parse the testing data, and evaluate the performance of the parser as in Task 2. You can restrict your search to the algorithms described in the Nivre and Stack families. For the Nivre family you should try at least some combinations of the additional arguments that can be used. For the projective algorithms (in both families) try at least one type of pseudo-projective parsing (use the argument -pp). You may also try other options if you wish.

Reporting: Report the LAS and LEM scores for all algorithms that you tried, and write down which algorithm you picked in the end.

Task 4: Feature Engineering

MaltParser processes a sentence from left to right, and at each point takes one of a small set of possible transitions. (Details will be presented in the lectures.) In order to predict the next action, MaltParser uses feature models. Up to now, you have been using the baseline feature model for whatever algorithm you were experimenting with. However, one can often do much better than that.

Your next task is to improve the feature model by exploiting the fact that the training and testing data contain morphological features such as case, tense, and definitiveness. These are specified in the FEATS column (column 6) of the CoNLL format. Here is an example:

7 hemmet _ NOUN NN NEU|SIN|DEF|NOM 6 PA _ _

This line specifies that the word hemmet has the grammatical gender neuter (NEU), the grammatical number singular (SIN), is marked as definite (DEF), and has the grammatical case nominative (NOM). This is useful information during parsing.

Read the Feature Model section of the user guide to find out how to extract the value of the FEATS column and split it into a set of atomic features using the delimiter | (pipe). Then, create a copy of the file that holds the feature model used by the algorithm that you selected in Task 3 and make the necessary modifications. Finally, train a new parsing model with the extended feature model, use it to parse the testing data, and evaluate its performance.

Note: If you are using an algorithm from the Nivre family, then you should extract the features for Input[0] and Stack[0]. If you are using an algorithm from the Stack family, then you should use Stack[0] and Stack[1].

Reporting: Write down the lines that you added to the baseline feature model, and how this affected the LAS and LEM of the parser relative to the score that you got for Task 3. (Note that you need to retrain the parser with the new feature model in order to see changes.)

Task 5: Error analysis

In addition to standard evaluation, using metrics like LAS, it can be useful to do some more detailed error analysis of the system output. In this task you should explore two ways of doing error analysis: automatically and manually. You will use these methods on the output from your chosen system from task 3, and your system from task 4.

Your first subtask is to write a program that finds out which categories are most often confused. For each erroneous link, your program should store the predicited label, and the gold standard label, for instance: "SS-SP". You only have to care about the label, not about the predicted head. Your program should store the count for each type of confusion, and output the top fifteen confusions, and how many times they occur. Again you may use any programming language, and your program should be callable from the command line, taking the files with gold standard and the system output as arguments.

Run your program for your best system from task 3, and your system from task 4, and compare the results. Briefly discuss the differences or lack of differences. To aid your discussion, you can use the list of labels.

For your second subtask, find a sentence that have a different tree when using your best system from task 3 and your system from task 4. Write down the line number of the sentence you pick. Draw the trees for the sentence, as parsed by the two systems (if the sentence is very long, you can choose to only draw a relevant part of the tree). Discuss the difference(s) and why you think they happen.

Reporting: Hand in your program and include the result of the automatic analysis, the trees and a discussion of the questions in your report. You may draw the trees by hand, in which case you could leave the trees in Sara's postbox (number 108).

Task 6: Gold-Standard Tags Versus Predicted Tags

You only need to work on this task in case you want to get the grade Pass With Distinction (VG).

In the training and testing data that you have been using up to now, the part-of-speech (POS) tags are gold-standard tags, in the sense that they were assigned manually. In this task you will be exploring what happens in the more realistic scenario where the tags are assigned automatically.

Your specific task is to produce alternative versions of the training and testing data where the gold-standard POS tags have been replaced with automatically-assigned tags. To obtain these, you can use Hunpos, a state-of-the-art part-of-speech tagger. Proceed as follows:

Download and install Hunpos on your computer.
Read the User Manual to learn how to use Hunpos.
Use the training data for the parser to produce training data for the tagger.
Train the tagger on this training data.
Use the trained tagger to tag the sentences in the parser data (training and testing).
Produce new parser data by replacing the gold standard tags in the original data with the automatic tags.
Re-train and re-evaluate the parser using the new data.

In order to succeed with these tasks, you may be tempted to write some code that can modify CoNLL files. However, all manipulations can also be done using standard Unix commands such as cut and paste.

Reporting: Report the LAS of the parser trained and tested on data with automatically-assigned tags. If you are feeling extra ambitious, you can experiment with mixed scenarios where you train the parser on gold-standard tags but test on automatically assigned tags. Describe the conclusions that you draw from your results.

Submission and Grading

Submit your lab report, and the code for your scripts by email to Sara. Make sure that you have your name in the report and code files.

The assignment will be graded mainly based on your written lab report.

The submission deadline for this assignment is January 16, 2015.

Problems

Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.

Enjoy!

History

This assignment was developed for 5LN455 by Marco Kuhlmann, 2011-2012.