Note that there was a bug in the decoder that affected the total probability calculations. If you already started the lab, you can keep using the old decoder, so you do not have to redo any work. But if you're starting it now, please use the new decoder.
The goal of this lab is to experiment with word-based statistical machine translation, on an easy text type. The aim is to gain a basic understanding of the role of the translation and language models.
Note that there will be only one report for lab sessions 2-3. The instructions are given in the assignments of lab 3.
mkdir lab2
cd lab2
cp /local/kurs/mt/lab2/data/* .
In this lab we will use a simple word-based decoder, to translate sentences from the block world. The decoder does not allow any reordering. It uses a translation model that consists of two parts, word translation probabilities and fertility probabilities, and a language model. In this lab you will adjust probabilities by hand, in order to get a feeling for how they affect translation, instead of training them on a corpus, as is normally done.
The word translation model contains the probability of a source word translating into a target word, or to the special NULL word, i.e. that it is not translated. The format is that each line contains a target word, a source word and a probability, separated by white space, as:
block blocket 1
take ta 1
the den 0.4
the det 0.1
the NULL 0.5
The fertility model contains probabilities for how many words each word can translate into. For most words in the blocks world, there will be probability 1 that they translate into 1 word. The format is again one entry per line, whitespace separated, containing a source word, a fertility (0-2 in the current implementation, it is enough for the blocks world) and a probability, as:
block 1 1
take 1 1
the 0 0.5
the 1 0.5
The language model contains probabilities for n-grams in the target language. It contains minimum 1-grams, but can also contain higher order n-grams. In this lab you do not need to use more than 3-grams. The lm-format is the ARPA format, but extra headings and backoff probabilities are ignored. That means that for each n-gram, it starts with a line with the tag \n-gram, followed by lines with a probability and n words, separated by white space, such as:
\1-gram
0.3 block
0.3 take
0.4 the
\2-gram
1 take the
1 the block
The given fertility models are called tmf.[swe/eng], the word translation models are called tmw[sweeng/engswe], and the language models are called lm.[swe/eng]. The given models contains words and word pairs that you need for the lab, initialized with uniform probabilities.
/local/kurs/mt/lab2/simple_decoder_fixed/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order
To run a system with the given models from Swedish to English you run:
/local/kurs/mt/lab2/simple_decoderfixed/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2
If you started the lab with the decoder with the bug, you can keep using that version in order not to redo some work:
/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order
To run a system with the given models from Swedish to English you run:
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2
In this assignment we will work on the blocks world domain, which contains commands and descriptions of taking and putting different types of blocks on different positions. The model files provided contains all words in the vocabulary with equal probabilities. The sentences you should work on translating are shown below. They are also in the files test_meningar.[swe/eng] that you copied from the course area For the lab we consider the given translations the only possible, and ignore any potential ambiguities. You can thus consider full sentences correct or incorrect, and do not have to use metrics such as Bleu.
ta en pil take an arrow |
ställ en kon på mitt block put a cone on my block |
hon tar blocket she takes the block |
ställ en röd pil på en blå cirkel put a red arrow on a blue circle |
jag tar ett blått block i take a blue block |
ställ pilen på cirkeln put the arrow on the circle |
han ställer det röda blocket på den blåa cirkeln he puts the red block on the blue circle |
han ställer en pil på sin cirkel he puts an arrow on his circle |
hon ställer sitt block på min blåa cirkel she puts her block on my blue circle |
jag ställer konen på cirkeln på hennes blåa cirkel I put the cone on the circle on her blue circle |
The given LM and TM files contain the full vocabulary, and all needed fertilities and word translations. In lab 3 we will work on language modeling, and a small corpus will be provided. You can have a look at it now, to get a better feel for the domain. LINK
We will work on translation between Swedish and English. For non-Swedish speakers, there is a brief description of the relevant language phenomena. You should translate in both directions, but non-Swedish speakers can focus most of their discussion on translation into English.
In the given model files all probabilities are equal. This likely gives bad translations. Run the sample sentences through the translation system, and study the translation suggestions. Feel free to add some more sentences if you want to explore something you find interesting. How often is the correct translation at the top and how often is it missing from the n-best list? What is the average rank of the correct translaton? Is there a difference between the difficulties of two translation directions? If so, what do you think is the cause? Discuss some of the problems you see. What is causing problems?
Go on and adjust the probabilities for fertilities and word translations, to achieve better translations. In some cases you could improve some translations at the cost of others. There are problems you cannot solve by only manipulating the translation model. Discuss what you can and cannot solve, and how you do it.
Master students only: your probability models should contain proper probability distributions, i.e. the probabilities should sum to 1 for each word, that is the fertilities for each word should sum to one, and the word probabilties p(s|t) should sum to 1 for each t. The given models are nearly correct in this respect, except that 0.33*3=0.99, and not 1.
How good are your final translations. Can you get all correct translations, and can you get them in the top position? Were there cases where you could get some correct translations at the cost of getting other translations wrong? How much are the translations improved compared to the given uniform weights? Is there a difference between the two translation directions? In that case, discuss why!