Assignment 1: PCFG Parsing
In this assignment, you are going to implement a PCFG parser using the CKY
algorithm and evaluate it using treebank data. Starter code can be
found in /local/kurs/parsing/pcfg_starter_code.tar.gz
.
Data
The data to be used for training and evaluation comes from the Penn Treebank and consists of a training set of about 40,000 sentences(section 02-21) and a development set of 100 sentences (from section 00). The files are:- Training data
train.dat
- original trees
- Development data, used to evaluate your parser
dev.dat
- original treesdev.raw
- only the sentences, used as input to the parser
The data can be found in /local/kurs/parsing/master_data.tar.gz
.
1: Chomsky Normal Form
The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.)
The trees in the treebank have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:
[ROOT, Child1, ..., Childn]where
ROOT
is the root symbol and each Childi
is itself a tree represented as a list. Here is a simple example, with indentation
to show the tree structure:
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["VP", ["MD", "will"], ["ADVP", ["RB", "soon"]], ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]], [".", "."]]
Conversion: To convert a tree to CNF, you must (a) eliminate n-ary branching subtrees (for n > 2) by inserting additional nodes, and (b) eliminate unary branching by merging nodes. For our example tree, this gives the following transform, where new nodes are marked in red:
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["S|NP", ["VP", ["MD", "will"], ["VP|MD", ["ADVP+RB", "soon"], ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]]], [".", "."]]]
The notation A|B
is used for a new node in a subtree
with root A
and left sibling B
. (How much
of the previously generated subtrees is encoded in the new label is
often referred to as the markovization order. I suggest you keep it
simple and only record the immediately preceding sibling, which
corresponds to a first-order markovization, but you are free to
experiment with this if you wish.)
The notation A+B
is used for a node that results from
merging an A
node and a B
node in a unary
chain. (Although it is possible to apply a markovization scheme here
as well, I recommend you to encode the complete path.)
Starter code: The file cnf.py
contains a stub
for the method to_cnf()
together with additional code
that can be used to read and write trees and check that the output is
indeed in CNF. Your task is to complete the implementation of to_cnf()
.
(Note that this part of the assignment is slightly changed from previous years. You are expected to return the converted tree from to_cnf()
, not to change it in-place.)
Evaluation: To test your implementation, run the main program
in cnf.py
as follows:
python3 cnf.py < infile > outfileThe program checks that
to_cnf()
converts all trees in
infile
to CNF and preserves all the words of the sentence.
Converted trees are printed to outfile
.
2: Grammar Extraction
Now that you can convert arbitrary treebank trees to CNF, the next step is to extract a PCFG in CNF. This is done by reading a treebank in CNF, one tree at a time, and counting the frequency of nonterminals, words, unary and binary rules in that tree. For this part of the assignment you will not write any code, only use existing code.You run the program
pcfg.py
as follows:
python3 pcfg.py treebankfile grammarfile
The program extracts a PCFG from the trees in treebankfile
(which must be in CNF) and stores the grammar in grammarfile
in the following format:
["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"
The treebank file should contain one tree per line, as in train.dat
.
3: CKY Parsing
Now that you have a PCFG in CNF, the next step is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.
Starter code: The file parser.py
implements a parser that can
load a PCFG in the format created above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method CKY
for finding the most probable parse and the method backtrace
for extracting the corresponding tree using backpointers. Here are a couple of things to note:
- In this implementation of CKY, the sentence has to be represented as a list of pairs
(norm, word)
, whereword
is the word occurring in the input sentence andnorm
is either the same word, if it is a known word according to the grammar, or the string_RARE_
. Thus,norm
should be used for grammar lookup butword
should be used in the output tree. - Binary rules that expand a nonterminal
sym
are stored inbinary_rules[X]
. - Rule probabilities are stored in
q1[sym, word]
(for unary rules) andq2[sym, y1, y2]
(for binary rules). - All non-terminal symbols are stored in
N
Evaluation: To test your implementation, run the main program of parser.py
as follows:
python3 parser.py grammarfile < infile > outfile
The grammar file should be the grammar extracted from your converted training data in previous assignments. The infile should be the raw sentences in the dev set.
The program parses the sentences in infile
and prints the parser trees
to outfile
. The infile should contain one sentence per line with space-separated tokens, as in the file dev.raw
. Also make sure that the file tokenizer.py
is available in the same directory.
The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. It is advisable to use only a few short sentences during development, and possibly a smaller grammar, to speed up testing. In the end you should parse the full dev set, though.
4: Treebank Evaluation
After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. No implementation is needed for this. Just run the evaluator as follows:
python3 eval.py goldfile outfile
The goldfile
should contain the dev sentences with binarized trees (produced in part 1) and the outfile is the outfile from your parser. The parser is not likely to be very accurate, so do not be surprised if
your results are quite far from the state of the art. You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most likely an error somewhere.
For Distinction (VG)
For distinction, you have to improve the implementation in one of the following ways and write a report about your additional modifications and results:
- Investigate how vertical and horizontal markovization affects parsing and discuss the implications of your results.
- Improve parsing accuracy through a better treatment of rare words. Discuss and motivate your chosen treatment.
- The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this and for why you tried the methods you did. The parser could be trivially made faster by using threading and parsing sentences simulataneously in different threads. Implementing only threading does not qualify to get a VG on this task.
- Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very small) pcfg treebank that uses unit productions (it maybe based ont eh toy grammars and treebanks from the lecture exercises). If your implementation cannot handle chains of unit productions, discuss how this can be added to it.
It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Paloma beforehand, in order to get your specific extension approved. If you pursue an individual idea without approval, no VG grade will be granted.
Note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.
Report
Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (parser.py
and cnf.py
), and possibly any files needed for reporting the VG-assignment if you choose to do one.
Note that you should write the code yourself.
Deadline
2023-02-13
Problems
Please do not hesitate to contact Sara in case you encounter any problems.
Enjoy!
History
This assignment was originally developed Joakim Nivre in 2016Modified by Sara Stymne, 2017, 2022