Syntactic Analysis (5LN455): Assignment 2
In this assignment, you will build a PCFG-parser based on the CKY algorithm (J&M Section 13.4.1, Section 14.2), or actually the extended version of the algorithm that can handle unit productions. This corresponds to Exercises 14.1 and 14.4 in J&M.
Code and Data
The code and the data for this assignment can be downloaded from here. The code consists of a couple of Java files (up.tar.gz
); the documentation for these files is here. The data consists of two text files called grammar.txt
and validate.mrg
, containing a grammar and validation data, respectively. The file up.jar
contains (the compiled version of) a reference implementation of the parser provided by me.
Getting Started
As a first step, make sure that you can download and access all files, that you can run the reference parser, and that you can compile the code.
Running the Reference Parser
To run the reference parser, execute the following command, assuming that the file up.jar
and the data files grammar.txt
and validate.mrg
are in the current directory:
% java -Xmx1G -jar up.jar parse --grammar grammar.txt --infile validate.mrg
This will read a grammar from grammar.txt
, construct a parser from it, and use that parser to process the sentences in the file validate.mrg
. You should see some output, reporting the number of sentences parsed, the intermediate performance (expressed as an F1 score, see J&M Section 14.7), and the average number of seconds per parsed sentence.
Read a grammar with 2402 categories, 8142 words, and 14046 rules.
#sentences: 10, F1: 0.6651884700665188, seconds/sentence (avg.): 0
#sentences: 20, F1: 0.63215859030837, seconds/sentence (avg.): 0
#sentences: 30, F1: 0.6287128712871286, seconds/sentence (avg.): 0
#sentences: 40, F1: 0.6434210526315789, seconds/sentence (avg.): 0
#sentences: 50, F1: 0.6221532091097308, seconds/sentence (avg.): 0
#sentences: 60, F1: 0.6092544987146531, seconds/sentence (avg.): 0
#sentences: 70, F1: 0.614619211634137, seconds/sentence (avg.): 0
FINAL RESULTS:
#sentences: 78, F1: 0.6173272535952298, seconds/sentence (avg.): 0
Total time: 18 seconds.
Compiling
You will be working on the file MyParser.java
. To compile this file (and the other source files in up.tar.gz
) you can either use the Ant system (Read here for info on how to run it on our system. Click here for a tutorial), a Java IDE of your choice (such as Eclipse or NetBeans), or the command line. If you want to use Ant, you should run the ant
command from within the directory that contains the build.xml
file. This will create a JAR-file in dist/up.jar
, which you can use in the same way as the JAR-file that you can download from this web page.
To test whether you have everything set up, you can edit the code of the method getBestParse
in MyParser.java
to emit a Hello World message, compile the code, and run the parser. This should print out the Hello World message 78 times, and should report a score of approximately 5%. (The way the system is set up when you download it, it assigns some dummy parse tree to the input sentence. This is why you get 5% without even writing a single line of code.)
Project Description
The basic work-flow behind the call above is this: The file validate.mrg
actually contains phrase structure trees in the Penn Treebank format (see J&M Section 12.4.1); these trees are read one by one. Once a tree has been read, it is binarized, and its yield (a sequence of word tokens) is fed to the parser object implemented by the class MyParser
. The parse tree returned by MyParser
is then debinarized and scored relative to the original tree.
The file MyParser.java
just contains placeholder code; your task is to provide an actual implementation. To do so, you need to implement the method
Tree getBestParse(List<String> tokens)
This method takes as its argument a list of word tokens and returns a best (most probable) parse tree whose terminal yield is that sequence, according to the grammar that was passed to MyParser
in the constructor.
In your implementation, you will have to compute two tables: one with the maximum scores for each span (J&M Exercises 14.1 and 14.4), and one with appropriate backpointers (cf. J&M p. 474). Data structures for these two tables are provided by the classes ScoreChart
and BackpointerChart
, respectively.
Your parser need to be able to handle unary rules. However, it does not need to handle chains of unary rules.
Your parser should achieve an F1-Measure of around 60%, and should be able to parse the provided data in no more than 10 seconds per sentence.
Practical Hints
You should tell Java to allocate as much memory as possible on the machine that you are working on. You can do this by passing the -Xmx
flag. For example, the following will ask Java to allocate 1 GB of memory:
java -Xmx1G -jar up.jar ...
In your implementation of MyParser
, you will need to use functionality from the class Grammar
, which represents PCFGs.
In order to debug your parser, you may want to write small grammars and input data yourself. You can open grammar.txt
and validate.mrg
to see how the format looks like. A grammar is specified as a list of rules. Each rule can either be a preterminal rule (A => word
), a unary rule (A -> B
), or a binary rule (A -> B C
). Note that preterminal rules are written down with the double arrow =>
(equals, greater than), while internal rules are written down with the single arrow ->
(dash, greater than). Each rule is followed by its probability. A parse tree is specified in the Penn Treebank format, as shown in J&M Figure 12.9.
The only metod where you will have to edit code is getBestParse
in MyParser
. It is, however important to familiarise yourself with the rest of the code, since you will need to use variables of other classes, and use methods in other classes. At least the following classes are highly relevant:
ScoreChart
BackpointerChart
Grammar
Rule
PreterminalRule
BinaryRule
UnaryRule
Backpointer
Submission and Grading
In order to pass this assignment (Godkänt), you will need to submit an implementation of a working parser, and report its performance in terms of its accuracy (F-measure) and runtime (sentences/second), as printed by the software provided.
In order to pass this assignment with distinction (Väl godkänt), I expect you to show ability to analyse your work, synthesize new ideas, and/or critically assess your implementation. Here are some suggestions for extensions to the assignment that will let you demonstrate these abilities:
What would you need to do in order for your parser to return trees with long chains of unary rules?
Discuss how a part-of-speech tagger could be integrated into your parser.
Suggest a way in which your working parser can be improved. Implement this suggestion, or at least come up with a concrete plan for doing so.
Compare your parser with an existing system from the parsing literature in the form of a one-page essay.
In case you choose to go for any of these extensions, or if you have an idea of your own, please let me know beforehand.
Deadline
2014-12-10
Problems
Please do not hesitate to contact me as soon as possible either personally or via email in case you encounter any problems.
Enjoy!