Syntactic Analysis (5LN455): Assignment 2
Note: bachelor students only!
In this assignment, you will build a PCFG-parser based on the CKY algorithm (J&M Section 13.4.1, Section 14.2). This corresponds to Exercise 14.1 in J&M.
Code and Data
The code and the data for this assignment can be downloaded from here. The code consists of 4 python files. You only have to modify the file parser.py
. There are also 5 data and grammar files. You can download all files zipped from assign2.zip
Getting Started
The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.)
Treebank data: The data to be used for training and evaluation comes from the Penn Treebank and consists of a training set of about 40,000 sentences(section 02-21) and a development set of 100 sentences (from section 00). The trees have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:
[ROOT, Child1, ..., Childn]where
ROOT
is the root symbol and each Childi
is itself a tree represented as a list. The trees have already been converted to CNF for you. Here is an example with indentation
to show the tree structure::
["S", ["NP", ["NNP", "Pierre"], ["NNP", "Vinken"]], ["S|NP", ["VP", ["MD", "will"], ["VP|MD", ["ADVP+RB", "soon"]] ["VP", ["VB", "join"], ["NP", ["DT", "the"], ["NN", "board"]]]]], [".", "."]]]
The notation A|B
is used for a new node in a subtree
with root A
and left sibling B
.
The notation A+B
is used for a node that results from
merging an A
node and a B
node in a unary
chain.
Based on the training data in CNF, a PCFG has been extracted for you. This was done by reading the treebank one tree at a time, and counting the frequency of nonterminals, words, unary and binary rules in that tree.
The format of the grammar is:
["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"
Note that words that occur 5 times or more are stored as "well known words" in
the grammar. Words with lower frequency are mapped to the symbol
_RARE_
as a simple form of smoothing.
- Training data (not needed to solve the assignment)
train.dat
- original treestrain.dat.cnf
- trees converted to CNF
- Development data, used to evaluate your parser
dev.dat
- original treesdev.dat.cnf
- trees converted to CNF, used for evaluationdev.raw
- only the sentences, used as input to the parser
-
grammar.pcfg
- PCFG trained on the training data in CNF
CKY Parsing Assignment
Your assignment is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.
Starter code: The file parser.py
implements a parser that can
load a PCFG in the format described above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method CKY()
for finding the most probable parse and the method backtrace
for extracting the corresponding tree using backpointers. Make sure that the trees that you create have the same format as in the provided data files. Here are a couple of things to note:
- In this implementation of CKY, the sentence is represented as a list of pairs
(norm, word)
, whereword
is the word occurring in the input sentence andnorm
is either the same word, if it is a known word according to the grammar, or the string_RARE_
. Thus,norm
should be used for grammar lookup butword
should be used in the output tree. - You will need to use the following variables from the
PCFG
class:- Binary rules that expand a nonterminal
X
are stored inbinary_rules[X]
. - Rule probabilities are stored in
q1[sym, word]
(for unary rules) andq2[sym, y1, y2]
(for binary rules). - All non-terminal symbols are stored in
N
- Binary rules that expand a nonterminal
Evaluation: To test your implementation, run the main program of parser.py
as follows:
python3 parser.py grammar.pcfg < dev.raw > YOUR_OUTFILE
The program parses the sentences in infile
and prints the parse trees
to YOUR_OUTFILE
. The infile should contain one sentence per line with space-separated tokens, as in the file dev.raw
. Also make sure that the file tokenizer.py
is available in the same directory.
The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. In order to test your parser in shorter time it is highly recommended that you create a small test file with a few simple sentences, and possibly also a small pcfg file. In order to do this, look at the format of the existing files. (Note that you cannot train a new pcfg on some other training data, since that code is removed, because it is part of the master student assignment.)
Treebank Evaluation
After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. Just run the evaluator as follows:
python3 eval.py dev.dat.cnf YOUR_OUTFILE
You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most probably an error somewhere.
Submission and Grading
In order to pass this assignment (Godkänt), you will need to submit an implementation of a working parser, and report its performance in terms of its overall F1-Score and runtime in seconds, as printed by the software provided.
In order to pass this assignment with distinction (Väl godkänt), I expect you to show ability to analyse your work, synthesize new ideas, and/or critically assess your implementation. Here are some suggestions for extensions to the assignment that will let you demonstrate these abilities:
Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very) small pcfg and treebank that uses unit productions. In the pcfg
Q3
can be used for unary grammar rules, e.g.["Q3", "NP", "PRON", 0.0082]
You should also discuss the pros and cons of allowing unit productions in CKY as opposed to converting them to CNF. If your implementation cannot handle chains of unit productions, discuss how this can be added to it.Implement the conversion to CNF and the missing code in the pcfg extraction (or contact Sara to receive the code for PCFG extraction, and focus more on the conversion, markovization and discussion). Compare a few (two is enough) different schemes for markovization. See the master assignment for information on markovization and how to integrate this with the existing code.
Suggest a way in which your working parser can be improved with respect to accuracy. Implement this suggestion, or at least come up with a concrete plan for doing so.
The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this.
Compare your parser with an existing system from the parsing literature in the form of a one-page essay.
In case you choose to go for any of these extensions, or if you have an idea of your own, please let me know beforehand.
Report
Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (parser.py
), and possibly any files needed for reporting the VG-assignment if you choose to do one.
Deadline
2016-12-06
Problems
Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.
Enjoy!
History
This assignment was originally developed Joakim Nivre in 2016Modified by Sara Stymne, 2016