Assignment: Dependency Parsing

Note: master students only!

In this assignment, you are going to implement some key components of a transition-based dependency parser and analyze its behavior. Starter code can be found in /local/kurs/parsing/dep_starter_code.tar.gz.

Transition System

A transition system for dependency parsing consists of a set of configurations (or parser states) and a set of transitions (or parse actions) for moving between configurations. You are going to implement the arc-eager transition system (see section 3.4.1 in Kübler et al.), where configurations consist of a stack, a buffer and an arc set, all represented as lists. Tokens are represented by integers corresponding to sentence positions (with 0 for the special root node), and dependency arcs are represented as triples (h, d, l), where h is the head token, d the dependent token, and l the dependency label.

Given the sentence "the cat sits on the mat today", the following is a possible configuration:

stack = [0, 2]
buffer = [3, 4, 5, 6, 7]
arcs = [(2, 1, "det")]
This configuration has the root and "cat" on the stack, "sits", "on", "the", "mat" and "today" in the buffer, The arc set contains a single dependency arc connecting the head word "cat" to the dependent word "the" with the dependency relation "det". Make sure you understand how configurations are represented before you move on.

Let us call the first word in the stack top and the first word in the buffer next. The transitions of the arc-eager system can be described as follows:

SH = move next from the buffer to the stack
RE = remove top from the stack
(RA, label) = add (top, next, label) to the arc set; move next from the buffer to the stack
(LA, label) = add (next, top, label) to the arc set; remove top from the stack

Starter code: The file transition.py contains a dummy implementation of a transition-based parser that parses the sentence "the cat sits on the mat today" by initializing the parser configuration to:

stack = [0]
buffer = [1, 2, 3, 4, 5, 6, 7]
arcs = []

performing the following sequence of transitions:

SH
(LA, "det")
SH
(LA, "nsubj")
SH
SH
SH
(LA, "det")
(LA, "case")
(RA, "nmod")
RE
(RA, "nmod")

and printing the resulting dependency tree. You run the parser as follows:

python transition.py
The method transition(trans, stack, buffer, arcs) is only a stub, which only implements the SH transition, and your task is to implement the correct behavior for all four transitions. You can test your implementation by verifying that the parser produces the correct dependency tree for the sentence.

Oracle

Training a transition-based dependency parser means training a model for predicting the next transition given the current configuration. To do this, we first need to construct an oracle, a function that correctly predicts the next transition when given access to the correct dependency tree. You can find a definition of a static oracle for the arc-eager system in Goldberg and Nivre, CoLING 2012, Algorithm 1.

Starter code: The file oracle.py implements an oracle parser that reads a set of syntactically annotated sentences from a file and tries to derive the correct trees by following the predictions of an oracle() method. This method is a stub that always predicts SH, and your task is to implement a better oracle. In addition, you have to replace the transition() method by your own implementation from above. You can test your implementation on the sentence "she gave the cat some food" by running:
python3 oracle.py < example.tab
or
python3 oracle.py tab < example.tab
if you want to see the result in the same tab-based format as the input. The result in the tab-ased format should be the same as in the input file example.tab

Testing on a real treebank

When you are satisfied that your oracle works as it should, it is time to test it on some real treebank data. For this purpose, we will use the development set of the English UD treebank in a simplified format. The first sentence looks like this:

From	ADP	3	case
the	DET	3	det
AP	PROPN	4	nmod
comes	VERB	0	root
this	DET	6	det
story	NOUN	4	nsubj
:	PUNCT	4	punct

The four columns contain tokens, part-of-speech tags, syntactic heads and dependency labels, respectively. To run your oracle parser on the whole development set and make it print the derived trees in the same format, you can run:

python3 oracle.py tab < en-ud-dev.tab > en-ud-dev.out

Note the extra argument tab to change the output format from dependency tuples to the column-based format.

Ideally, the trees output by your oracle parser should be identical to the original trees from the treebank, because we want to use the oracle to guide a machine learning algorithm. However, if you compare the two sets of trees, you are likely to find some discrepancies. Your final task is to figure out what causes these discrepancies and discuss how this problem could be handled. No implementation is needed for this part.

For Distinction (VG)

For distinction, you have to do one of the following:

  • Implement a different transition system, for example, arc-standard.
  • Implement a projectivization algorithm for dependency trees.
  • Implement a dynamic oracle for the arc-eager transition system.

Report

Report by uploading a zip or gz document to studentportalen containing the code file oracle.py and a report discussing the issues found when applying your code on a real treebank, and any additional files that might be needed to report on a VG task.

Deadline

2018-01-12

Problems

Please do not hesitate to contact Sara as soon as possible either personally or via email in case you encounter any problems.

Enjoy!

History

This assignment was originally developed Joakim Nivre in 2016
Modified by Sara Stymne, 2017