Assignment 1: PCFG Parsing

In this assignment, you are going to implement a PCFG parser using the CKY algorithm and evaluate it using treebank data. Starter code is available: pcfg_starter_code.tar.gz.

Data

The data to be used for training and evaluation comes from the Penn Treebank and consists of a training set of about 40,000 sentences(section 02-21) and a development set of 100 sentences (from section 00). The files are:

Training data
- train.dat - original trees
Development data, used to evaluate your parser
- dev.dat - original trees
- dev.raw - only the sentences, used as input to the parser

The data can be found here: master_data.tar.gz.

1: Chomsky Normal Form

The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.)

The trees in the treebank have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:

[ROOT, Child₁, ..., Child_n]

where ROOT is the root symbol and each Child_i is itself a tree represented as a list. Here is a simple example, with indentation to show the tree structure:

["S",
   ["NP",
      ["NNP", "Pierre"],
      ["NNP", "Vinken"]],
   ["VP",
      ["MD", "will"],
      ["ADVP",
         ["RB", "soon"]],
      ["VP",
         ["VB", "join"],
         ["NP",
            ["DT", "the"],
            ["NN", "board"]]]],
   [".", "."]]

Conversion: To convert a tree to CNF, you must (a) eliminate n-ary branching subtrees (for n > 2) by inserting additional nodes, and (b) eliminate unary branching by merging nodes. For our example tree, this gives the following transform, where new nodes are marked in red:

["S",
   ["NP",
      ["NNP", "Pierre"],
      ["NNP", "Vinken"]],
   ["S|NP",
      ["VP",
         ["MD", "will"],
         ["VP|MD",
            ["ADVP+RB", "soon"],
            ["VP",
               ["VB", "join"],
               ["NP",
                  ["DT", "the"],
                  ["NN", "board"]]]]],
      [".", "."]]]

The notation A|B is used for a new node in a subtree with root A and left sibling B. (How much of the previously generated subtrees is encoded in the new label is often referred to as the markovization order. I suggest you keep it simple and only record the immediately preceding sibling, which corresponds to a first-order markovization, but you are free to experiment with this if you wish.) The notation A+B is used for a node that results from merging an A node and a B node in a unary chain. (Although it is possible to apply a markovization scheme here as well, I recommend you to encode the complete path.)

Starter code: The file cnf.py contains a stub for the method to_cnf() together with additional code that can be used to read and write trees and check that the output is indeed in CNF. Your task is to complete the implementation of to_cnf(). (Note that this part of the assignment is slightly changed from previous years. You are expected to return the converted tree from to_cnf(), not to change it in-place.)

Evaluation: To test your implementation, run the main program in cnf.py as follows:

python3 cnf.py < infile > outfile

The program checks that to_cnf() converts all trees in infile to CNF and preserves all the words of the sentence. Converted trees are printed to outfile.

2: Grammar Extraction

Now that you can convert arbitrary treebank trees to CNF, the next step is to extract a PCFG in CNF. This is done by reading a treebank in CNF, one tree at a time, and counting the frequency of nonterminals, words, unary and binary rules in that tree. For this part of the assignment you will not write any code, only use existing code.

You run the program pcfg.py as follows:

python3 pcfg.py treebankfile grammarfile

The program extracts a PCFG from the trees in treebankfile (which must be in CNF) and stores the grammar in grammarfile in the following format:

["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"

The treebank file should contain one tree per line, as in train.dat.

3: CKY Parsing

Now that you have a PCFG in CNF, the next step is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.

Starter code: The file parser.py implements a parser that can load a PCFG in the format created above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method CKY for finding the most probable parse and the method backtrace for extracting the corresponding tree using backpointers. Here are a couple of things to note:

In this implementation of CKY, the sentence has to be represented as a list of pairs (norm, word), where word is the word occurring in the input sentence and norm is either the same word, if it is a known word according to the grammar, or the string _RARE_. Thus, norm should be used for grammar lookup but word should be used in the output tree.
Binary rules that expand a nonterminal sym are stored in binary_rules[X].
Rule probabilities are stored in q1[sym, word] (for unary rules) and q2[sym, y1, y2] (for binary rules).
All non-terminal symbols are stored in N

Evaluation: To test your implementation, run the main program of parser.py as follows:

python3 parser.py grammarfile < infile > outfile

The grammar file should be the grammar extracted from your converted training data in previous assignments. The infile should be the raw sentences in the dev set.

The program parses the sentences in infile and prints the parser trees to outfile. The infile should contain one sentence per line with space-separated tokens, as in the file dev.raw. Also make sure that the file tokenizer.py is available in the same directory. The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. It is advisable to use only a few short sentences during development, and possibly a smaller grammar, to speed up testing. In the end you should parse the full dev set, though.

4: Treebank Evaluation

After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. No implementation is needed for this. Just run the evaluator as follows:

python3 eval.py goldfile outfile

The goldfile should contain the dev sentences with binarized trees (produced in part 1) and the outfile is the outfile from your parser. The parser is not likely to be very accurate, so do not be surprised if your results are quite far from the state of the art. You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most likely an error somewhere.

For Distinction (VG)

For distinction, you have to improve the implementation in one of the following ways and write a report about your additional modifications and results:

Investigate how vertical and horizontal markovization affects parsing and discuss the implications of your results.
Improve parsing accuracy through a better treatment of rare words. Discuss and motivate your chosen treatment.
The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this and for why you tried the methods you did. The parser could be trivially made faster by using threading and parsing sentences simulataneously in different threads. Implementing only threading does not qualify to get a VG on this task.
Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very small) pcfg treebank that uses unit productions (it maybe based ont eh toy grammars and treebanks from the lecture exercises). If your implementation cannot handle chains of unit productions, discuss how this can be added to it.

It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Paloma beforehand, in order to get your specific extension approved. If you pursue an individual idea without approval, no VG grade will be granted.

Note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.

Report

Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (parser.py and cnf.py), and possibly any files needed for reporting the VG-assignment if you choose to do one.

Note that you should write the code yourself.

Deadline

2024-02-09

Problems

Please do not hesitate to contact Sara in case you encounter any problems.

Enjoy!

History

This assignment was originally developed Joakim Nivre in 2016
Modified by Sara Stymne, 2017, 2022