Word-based SMT, Part 1

Aim

The goal of this lab is to experiment with word-based statistical machine translation, on an easy text type. The aim is to gain a basic understanding of the role of the translation and language models.

In this lab you will manually change the probabilities of translation and language models. The goal is that you should get a feeling of how the probabilities affect the system. The setup is, of course, artificial. Normally you would not manipulate probabilities in this way, but estimate them from data. For the lanugage model part you will try doing just that in part 2 of the lab.

Note that there will be only one report for both parts of this lab. The instructions are given in the assignments for part 2.

Preparation

Copy the initial models and the test sets from the course area:


mkdir lab2
cd lab2
cp /local/kurs/mt/lab2/data/* .

Task

In this lab we will use a simple word-based decoder, to translate sentences from the block world. The decoder does not allow any reordering. It uses a translation model that consists of two parts, word translation probabilities and fertility probabilities, and a language model. In this lab you will adjust probabilities by hand, in order to get a feeling for how they affect translation, instead of training them on a corpus, as is normally done.

Models

The word translation model contains the probability of a source word translating into a target word, or to the special NULL word, i.e. that it is not translated. The format is that each line contains a target word, a source word and a probability, separated by white space, as:


block	blocket	1	   
take	ta	1
the	den	0.4
the	det	0.1
the	NULL	0.5

The fertility model contains probabilities for how many words each word can translate into. For most words in the blocks world, there will be probability 1 that they translate into 1 word. The format is again one entry per line, whitespace separated, containing a source word, a fertility (0-2 in the current implementation, it is enough for the blocks world) and a probability, as:


block	1	1	   
take	1	1
the	0	0.5
the	1	0.5

The language model contains probabilities for n-grams in the target language. It contains minimum 1-grams, but can also contain higher order n-grams. In this lab you do not need to use more than 3-grams. The lm-format is the ARPA format, but extra headings and backoff probabilities are ignored. That means that for each n-gram, it starts with a line with the tag \n-gram, followed by lines with a probability and n words, separated by white space, such as:


\1-gram
0.3 block
0.3 take
0.4 the

\2-gram
1 take the
1 the block

The given fertility models are called tmf.[swe/eng], the word translation models are called tmw.[sweeng/engswe], and the language models are called lm.[swe/eng]. The given models contains words and word pairs that you need for the lab, initialized with uniform probabilities.

Decoder

You will use a very simple word-based decoder that do not allow any reordering. The decoder can translate either single sentences, or multiple sentences from a file where it treats each line as a sentence. It outputs an n-best list of the best translation options, with their respective total probabilities. The decoder is called on the command line:


/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order

To run a system with the given models from Swedish to English you run:


/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2

The above shows the obligatory arguments, the three model files. In the lab you will also need to set the maximum n-gram order, first to 2, later to larger values. If called like this, it prompts the user to enter a sentence to be translated. In addition there are a number of other arguments that can be used:

-in INFILE - translates the sentences in INFILE instead of prompting for a sentence (can be used with the given sentences to translate them all at one time)
-b WEIGHT - backoff weight for the language model (default 0.05)
-o ORDER - the maximum LM-order read from the LM file (default 3)
-u PROB - the probability used for unknown words (default 0.000001)
-n N - the maximum number of entries in the nbest-list of translations (default 20)
-g - greedy decoding, i.e. when N translation options found return them, instead of decoding everything and return the globally best translations (not used by default)
-s - use sentence boundaries for LM calculations (not used by default)
-lp - probabilities in LM are logprobs (not used by default - needed for LMs trained with SRILM, in part 2)

Domain

In this assignment we will work on the blocks world domain, which contains commands and descriptions of taking and putting different types of blocks on different positions. The model files provided contains all words in the vocabulary with equal probabilities. The sentences you should work on translating are shown below. They are also in the files test_meningar.[swe/eng] that you copied from the course area For the lab we consider the given translations the only possible, and ignore any potential ambiguities. You can thus consider full sentences correct or incorrect, and do not have to use metrics such as Bleu.

ta en pil
take an arrow

ställ en kon på mitt block
put a cone on my block

hon tar blocket
she takes the block

ställ en röd pil på en blå cirkel
put a red arrow on a blue circle

jag tar ett blått block
i take a blue block

ställ pilen på cirkeln
put the arrow on the circle

han ställer det röda blocket på den blåa cirkeln
he puts the red block on the blue circle

han ställer en pil på sin cirkel
he puts an arrow on his circle

hon ställer sitt block på min blåa cirkel
she puts her block on my blue circle

jag ställer konen på cirkeln på hennes blåa cirkel
i put the cone on the circle on her blue circle

Feel free to add some more sentences if you want to illustrate some issue that is not covered by these sentences!

The given LM and TM files contain the full vocabulary, and all needed fertilities and word translations. In lab 3 we will work on language modeling, and a small corpus will be provided. If you want to you can have a look at it now, to get a better feel for the domain. In that case you need to copy those files, see instructions in part 2.

We will work on translation between Swedish and English. For non-Swedish speakers, there is a brief description of the relevant language phenomena. You should translate in both directions, but non-Swedish speakers can focus most of their discussion on translation into English.

In all assignments you should try to get one set of weights that gives the globally best results for all 10 test sentences. This might mean that a change makes the result worse for one single sentence, but better for two other ones. Your task is to find a good compromise of weights that work reasonably well across all 10 sentences.

Evaluation

The evaluation will be done by seeing where in the n-best list the correct translation occurs, if it occurs at all. You should also do some qualitative analysis where you discuss what goes wrong in the translation hypotheses, and why.

If you aspire on a VG grade, the main evaluation should be done by a script you have to write, see below. Writing this program does not guarantee VG on the lab, you are also required to write a high-quality discussion. For a G grade, you can manually find out the rank of the correct translation. In both cases you should also discuss some specific examples of problems with the translations, and possible reasons for not choosing the correct translation.

Evaluation script for VG

In order to achieve the grade VG you need to write an evaluation script that compares the system output with the reference translations. You can write the program in the programming lanugage of your choice. It should take the system output, the reference translation, and the n-best list size as input. Below is an example, given that you wrote your code in Java.


//Given this decoder command:
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2 -n 20 -in test_meningar.swe > output.eng

//A Java-program could be run as:
java EvalMT output.eng test_meningar.eng 20

The output of the program should show the rank of each sentence in the n-best list, or "none" if the correct translation is not on the list. It should also show the average rank of the correct translation in the n-best list. If a translation is not on the n-best list, assume that its rank is n+1, for this calculation. Note that you only have to compare full sentences, there is no need to work on the word level. Example output is shown below. You do not have to follow the format exactly, but all information should be clearly shown. It is fine if you want to print in a format that could easily be pasted directly into your report, such as a latex table.


Sentence Rank
1         6
2        10
3        none
4         6     
5         4
6        12
7         8
8         3
9        20
10        1

Average rank: 9.1

Assignment

1 - Run the system with uniform probabilities

In the given model files all probabilities are equal. This likely gives bad translations. Run the sample sentences through the translation system, and study the translation suggestions. Feel free to add some more sentences if you want to explore something you find interesting. How often is the correct translation at the top and how often is it missing from the n-best list? If translations are missing from the n-best list, you could also try to increase its size, using the "-n" flag. What is the average rank of the correct translaton? Is there a difference between the output quality in the two translation directions? If so, what do you think is the cause? Discuss some of the problems you see. What is causing problems?

2 - Manipulate the translation models

In this task you should adjust the probabilities for fertilities and word translations, to achieve better translations. You should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. In some cases you could improve some translations at the cost of others. There are problems you cannot solve by only manipulating the translation model.

How good are your translations after manipulating the translation model, compared to using the uniform model? Discuss what you can and cannot solve, and how you did it.

Master students only: your probability models should contain proper probability distributions, i.e. the probabilities should sum to 1 for each word, that is the fertilities for each word should sum to one, and the word probabilties p(s|t) should sum to 1 for each t. The given models are nearly correct in this respect, except that 0.33*3=0.99, and not 1.

3 - Manipulate the language model

Go on to manipulate the language model as well, in order to further improving the translations. In the given files there are 1-grams and 2-grams with equal probabilities. Try to just adjust these first, and see if you can solve some problems. Again, you should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. You might also want to add n-grams that are missing, or remove ungrammatical n-grams. You only have to adjust the probabilities for the word sequences in the test set, not for all possible sentences. You might also want to add some 3-grams. If you do, remember to change the decoder flag for order to "-o 3" when you add 3-grams. Which problems require 3-grams? Are there still problems that cannot be solved with 3-grams? If so, discuss why. You might have to adjust some TM probabilities when you start changing the LM probabilities, in that case describe which and why.

In addition to changing the probabilities for n-grams in the file, you can also change the backoff weight for unknown n-grams. A very simple backoff strategy is used in the decoder. If, for example a 3-gram is missing, it backs off to the 2-gram, but with a penalty that can be set on the command line. This penalty is simply multiplied by the 2-gram probability. If the 2-gram is missing too, it backs off to the 1-gram, and multiplies it by the penalty yet another time. The backoff penalty is set on the command line with the flag "-b WEIGHT", and the default values 0.05. Write your final value of the back-off penalty in your report. Could you change the results only be changing the back-off penalty?

How good are your final translations? Compare them to the result from the previous steps. Can you get all correct translations, and can you get them in high or top positions? Were there cases where you could get some correct translations at the cost of getting other translations wrong? How much are the translations improved compared to the given uniform weights? Discuss the difference between the two translation directions, and possible reasons for it!

Lab report

You should not write a separate report for this part of the lab, but combine it with part 2. What you need to hand in from part 1 is discussions of the questions in the assignments, the final model files with new weights from task 2 and 3, and code for your evaluation program, if you chose to write it.