Word-based SMT

Task descriptions

  1. Part 1: with oral examination i class
  2. Part 2: with a written lab report

Aim

The goal of this lab is to experiment with word-based statistical machine translation, on an easy text type. The aim is to gain a basic understanding of the role of the translation and language models. The lab contains two parts with some sub tasks each.

In part 1 of the lab you will manually change the probabilities of translation and language models. The goal is that you should get a feeling of how the probabilities affect the system. The setup is, of course, artificial. Normally you would not manipulate probabilities in this way, but estimate them from data. You should especially never change probabilities based on a very small test set, as in this lab. Here we do it just so that you can get a feeling of how the translation and language model works. For the language model you will also train a model on data in part 2 of the lab , which is what you would normally do also for the translation model.

Preparation

Copy the initial models and the test and training data from the course area:

mkdir lab2
cd lab2
cp /local/kurs/mt/lab2/data/* .

Task

In this lab we will use a simple word-based decoder, to translate sentences from the block world between Swedish and English. The decoder does not allow any reordering. It uses a translation model that consists of two parts, word translation probabilities and fertility probabilities, and a language model. In this lab you will adjust probabilities by hand, in order to get a feeling for how they affect translation, instead of training them on a corpus, as is normally done.

Models

The word translation model contains the probability of a source word translating into a target word, or to the special NULL word, i.e. that it is not translated. The format is that each line contains a target word, a source word and a probability, separated by white space, as:


block	blocket	1	   
take	ta	1
the	den	0.4
the	det	0.1
the	NULL	0.5

The fertility model contains probabilities for how many words each word can translate into. For most words in the blocks world, there will be probability 1 that they translate into 1 word. The format is again one entry per line, white space separated, containing a source word, a fertility (0-2 in the current implementation, it is enough for the blocks world) and a probability, as:


block	1	1	   
take	1	1
the	0	0.5
the	1	0.5

The language model contains probabilities for n-grams in the target language. It contains minimum 1-grams, but can also contain higher order n-grams. In the first part you do not need to use more than 3-grams. The LM-format is the ARPA format, but extra headings and backoff probabilities are ignored. That means that for each n-gram, it starts with a line with the tag \n-gram, followed by lines with a probability and n words, separated by white space, such as:


\1-gram
0.3 block
0.3 take
0.4 the

\2-gram
1 take the
1 the block

The given fertility models are called tmf.[swe/eng], the word translation models are called tmw.[sweeng/engswe], and the language models are called lm.[swe/eng]. The given models contains words and word pairs that you need for the lab, initialized with uniform probabilities.

Decoder

You will use a very simple word-based decoder that do not allow any reordering. The decoder can translate either single sentences, or multiple sentences from a file where it treats each line as a sentence. It normally outputs an n-best list of the best translation options, with their respective total probabilities. The decoder is called on the command line:

/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order
To run a system with the given models from Swedish to English you run:

/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2

The above shows the obligatory arguments, the three model files. In the lab you will also need to set the maximum n-gram order, first to 2, later to larger values. If called like this, it prompts the user to enter a sentence to be translated. You can also give the decoder a file with sentences or use it for evaluation, see below. In addition there are a number of other arguments that can be used:

Domain

In this assignment we will work on the blocks world domain, which contains commands and descriptions of taking and putting different types of blocks on different positions. The model files provided contains all words in the vocabulary with equal probabilities. The sentences you should work on translating are shown below. They are also in the files test_meningar.[swe/eng] that you copied from the course area For the lab we consider the given translations the only possible, and ignore any potential ambiguities. You can thus consider full sentences correct or incorrect, and do not have to use metrics such as Bleu.
ta en pil
take an arrow
ställ en kon på mitt block
put a cone on my block
hon tar blocket
she takes the block
ställ en röd pil på en blå cirkel
put a red arrow on a blue circle
jag tar ett blått block
i take a blue block
ställ pilen på cirkeln
put the arrow on the circle
han ställer det röda blocket på den blåa cirkeln
he puts the red block on the blue circle
han ställer en pil på sin cirkel
he puts an arrow on his circle
hon ställer sitt block på min blåa cirkel
she puts her block on my blue circle
jag ställer konen på cirkeln på hennes blåa cirkel
i put the cone on the circle on her blue circle
Feel free to add some more sentences if you want to illustrate some issue that is not covered by these sentences!

The given LM and TM files contain the full vocabulary, and all needed fertilities and word translations. In part 2 you will work on language modeling, and a small corpus is provided for that. If you want to you can have a look at it now, to get a better feel for the domain. These files are named corpus.*.*

We will work on translation between Swedish and English. For non-Swedish speakers, there is a brief description of the relevant language phenomena. You should translate in both directions, but non-Swedish speakers can focus most of their discussion on translation into English.

In all assignments you should try to get one set of weights that gives the globally best results for all 10 test sentences. This might mean that a change makes the result worse for one single sentence, but better for two other ones. Your task is to find a good compromise of weights that work reasonably well across all 10 sentences.

Evaluation

The evaluation will be done by seeing where in the n-best list the correct translation occurs, if it occurs at all. You should also do some qualitative analysis where you discuss what goes wrong in the translation hypotheses, and why. You should discuss some specific examples of problems with the translations, and possible reasons for the decoder not choosing the correct translation.

The decoder contains a function which gives you the rank of the correct hypothesis in the n-best list, and the average rank for all sentences. If a sentence does not have a translation the rank will be approximated to 500, which means the average rank is not very trustworthy in that case. But the decoder should be able to translate all provided sentences. For evaluation the nbest-flag of the decoder is ignored even if given, and all translation hypothesis are explored. To run this function you use "-eval referenceFile" as an argument to the decoder, for instance:


# for translation from Swedish to English
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2  -in test_meningar.swe -eval test_meningar.eng

# for translation from English to Swedish
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.swe -tmw tmw.engswe -tmf tmf.eng -o 2  -in test_meningar.eng -eval test_meningar.swe

Lab report

Hand in your report for part 2 as a pdf through the student portal. Deadline for handing in the report: April 28, 2017.

If you failed to attend the session for part 1 you also have to hand in a report for that part. If both persons in the pair missed part 1, do it together with the results from part 2. If one person missed the part 1 session, that person should hand in an individual report of part 1. In that case there should be another joint report from that pair for part 2.