Syntactic Analysis (5LN455): Assignment 2

Note: bachelor students only!

In this assignment, you will build a PCFG-parser based on the CKY algorithm (J&M Section 13.4.1, Section 14.2). This corresponds to Exercise 14.1 in J&M.

Code and Data

The code can be found in /local/kurs/parsing/pcfg_starter_code.tar.gz. The code consists of 4 python files. You only have to modify the file parser.py. There are also 5 data and grammar files that can be found in /local/kurs/parsing/bachelor_data.tar.gz.

Getting Started

The CKY algorithm only accepts grammars in Chomsky Normal Form (CNF). Before we start extracting grammars from treebanks, we therefore need methods for converting trees into CNF. (Normally, one would also want to convert parser output in CNF back to the original tree format for evaluation purposes, but we will omit this step here and evaluate on the CNF trees instead.) In the bachelor assignment the conversion and grammar extraction is already performed for you.

Treebank data: The data to be used for training and evaluation comes from the Penn Treebank and consists of a training set of about 40,000 sentences(section 02-21) and a development set of 100 sentences (from section 00). The trees have been preprocessed to simplify the annotation (removing empty nodes and function tags) and each tree is represented as a list:

[ROOT, Child₁, ..., Child_n]

where ROOT is the root symbol and each Child_i is itself a tree represented as a list. The trees have already been converted to CNF for you. Here is an example with indentation to show the tree structure::

["S",
   ["NP",
      ["NNP", "Pierre"],
      ["NNP", "Vinken"]],
   ["S|NP",
      ["VP",
         ["MD", "will"],
         ["VP|MD",
            ["ADVP+RB", "soon"]]
            ["VP",
               ["VB", "join"],
               ["NP",
                  ["DT", "the"],
                  ["NN", "board"]]]]],
      [".", "."]]]

The notation A|B is used for a new node in a subtree with root A and left sibling B. The notation A+B is used for a node that results from merging an A node and a B node in a unary chain.

Based on the training data in CNF, a PCFG has been extracted for you. This was done by reading the treebank one tree at a time, and counting the frequency of nonterminals, words, unary and binary rules in that tree.

The format of the grammar is:

["Q1", sym, word, p] = unary rule sym --> word with probability p
["Q2", sym, y1, y2, p] = binary rule sym --> y1 y2 with probability p
["WORDS", wordlist] = list of all "well known words"

Note that words that occur 5 times or more are stored as "well known words" in the grammar. Words with lower frequency are mapped to the symbol _RARE_ as a simple form of smoothing.

The data and pcfg files you need for the assignment are:

Training data (not needed to solve the assignment)
- train.dat - original trees
- train.dat.cnf - trees converted to CNF
Development data, used to evaluate your parser
- dev.dat - original trees
- dev.dat.cnf - trees converted to CNF, used for evaluation
- dev.raw - only the sentences, used as input to the parser
grammar.pcfg - PCFG trained on the training data in CNF

CKY Parsing Assignment

Your assignment is to implement a parser that can use the grammar to analyze new sentences using the CKY algorithm.

Starter code: The file parser.py implements a parser that can load a PCFG in the format described above and parse a set of raw (but tokenized) sentences, one per line, into CNF trees. Your task is to implement the method CKY for finding the most probable parse and the method backtrace for extracting the corresponding tree using backpointers. Make sure that the trees that you create have the same format as in the provided data files. Here are a couple of things to note:

In this implementation of CKY, the sentence is represented as a list of pairs (norm, word), where word is the word occurring in the input sentence and norm is either the same word, if it is a known word according to the grammar, or the string _RARE_. Thus, norm should be used for grammar lookup but word should be used in the output tree.
You will need to use the following variables from the PCFG class:
- Binary rules that expand a nonterminal X are stored in binary_rules[X].
- Rule probabilities are stored in q1[sym, word] (for unary rules) and q2[sym, y1, y2] (for binary rules).
- All non-terminal symbols are stored in N

Evaluation: To test your implementation, run the main program of parser.py as follows:

python3 parser.py grammar.pcfg < INFILE > YOUR_OUTFILE

The program parses the sentences in INFILE and prints the parse trees to YOUR_OUTFILE. The infile should contain one sentence per line with space-separated tokens, as in the file dev.raw. Also make sure that the file tokenizer.py is available in the same directory.

The parser is likely to be very slow, so do not be surprised if it takes half an hour to parse the dev set. In order to test your parser in shorter time it is highly recommended that you create a small test file with a few simple sentences, and possibly also a small pcfg file. In order to do this, look at the format of the existing files.

Treebank Evaluation

After parsing the sentences in the dev set, the final step is to evaluate the accuracy of the parser. Just run the evaluator as follows:

python3 eval.py dev.dat.cnf YOUR_OUTFILE

You should expect the F1-score of the parser to be somewhere around .65-.75. If it is much lower there is most probably an error somewhere.

Submission and Grading

In order to pass this assignment (Godkänt), you will need to submit an implementation of a working parser, and report its performance in terms of its overall F1-Score and runtime in seconds, as printed by the software provided.

In order to pass this assignment with distinction (Väl godkänt), I expect you to show ability to analyse your work, synthesize new ideas, and/or critically assess your implementation. Below are some suggestions for extensions to the assignment that will let you demonstrate these abilities:

Extend the parser to also handle unit productions (unary grammar rules). This is not part of CNF, but this functionality is often added to CKY. In order to demonstrate that it works, you also need to create a (very small) pcfg treebank that uses unit productions. In the pcfg Q3 can be used for unary grammar rules, e.g. ["Q3", "NP", "PRON", 0.0082] You should also discuss the pros and cons of allowing unit productions in CKY as opposed to converting them to CNF. Discuss your implementation and results, and if your implementation cannot handle chains of unit productions, discuss how this can be added to it.
Implement the conversion to CNF. Note that you also have to extract a new grammar on your converted treebank. Compare a few (two is enough) different schemes for markovization. See the master assignment for information on markovization and how to integrate this with the existing code and extract a new grammar. Discuss your findings in a short report.
Suggest a way in which your working parser can be improved with respect to accuracy. Implement this suggestion, and discuss the result, or come up with a concrete plan for doing so and write a report describing and discussing your proposal.
The parser is very slow. Investigate ways to improve its speed. Motivate and evaluate your different ideas for speed improvements. To pass this assignment you should either improve the speed substantially and have reasonable explanations of why, or attempt to improve the speed without succeeding, but provide an insightful discussion about the reasons for this.
Compare your parser with an existing system from the parsing literature in the form of a one-page essay.

It is also possible to earn a VG by exploring some other issue. If you want to do so, it is obligatory to contact Sara beforehand, in order to get the extension approved. If you pursue an individual idea without approval, no VG grade will be granted.

Also note that doing a VG task is no guarantee for earning a VG. The full assignment must also be performed with an overall high quality.

Report

Report by uploading a zip or gz document to studentportalen containing a file with your name, the f1-score and time, the code (parser.py), and possibly any files needed for reporting the VG-assignment if you choose to do one.

Deadline

2017-12-04

Problems

Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.

Enjoy!

History

This assignment was originally developed Joakim Nivre in 2016
Modified by Sara Stymne, 2016/2017