The goal of this lab is to experiment with word-based statistical machine translation, on an easy text type. The aim is to gain a basic understanding of the role of the translation and language models. The lab contains two parts with some sub tasks each.
In part 1 of the lab you will manually change the probabilities of translation and language models. The goal is that you should get a feeling of how the probabilities affect the system. The setup is, of course, artificial. Normally you would not manipulate probabilities in this way, but estimate them from data. You should especially never change probabilities based on a very small test set, as in this lab. Here we do it just so that you can get a feeling of how the translation and language model works. For the language model you will also train a model on data in part 2 of the lab , which is how you would normally do also for the translation model.
mkdir lab2
cd lab2
cp /local/kurs/mt/lab2/data/* .
In this lab we will use a simple word-based decoder, to translate sentences from the block world between Swedish and English. The decoder does not allow any reordering. It uses a translation model that consists of two parts, word translation probabilities and fertility probabilities, and a language model. In this lab you will adjust probabilities by hand, in order to get a feeling for how they affect translation, instead of training them on a corpus, as is normally done.
The word translation model contains the probability of a source word translating into a target word, or to the special NULL word, i.e. that it is not translated. The format is that each line contains a target word, a source word and a probability, separated by white space, as:
block blocket 1
take ta 1
the den 0.4
the det 0.1
the NULL 0.5
The fertility model contains probabilities for how many words each word can translate into. For most words in the blocks world, there will be probability 1 that they translate into 1 word. The format is again one entry per line, white space separated, containing a source word, a fertility (0-2 in the current implementation, it is enough for the blocks world) and a probability, as:
block 1 1
take 1 1
the 0 0.5
the 1 0.5
The language model contains probabilities for n-grams in the target language. It contains minimum 1-grams, but can also contain higher order n-grams. In the first part you do not need to use more than 3-grams. The lm-format is the ARPA format, but extra headings and backoff probabilities are ignored. That means that for each n-gram, it starts with a line with the tag \n-gram, followed by lines with a probability and n words, separated by white space, such as:
\1-gram
0.3 block
0.3 take
0.4 the
\2-gram
1 take the
1 the block
The given fertility models are called tmf.[swe/eng], the word translation models are called tmw.[sweeng/engswe], and the language models are called lm.[swe/eng]. The given models contains words and word pairs that you need for the lab, initialized with uniform probabilities.
/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order
To run a system with the given models from Swedish to English you run:
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2
In this assignment we will work on the blocks world domain, which contains commands and descriptions of taking and putting different types of blocks on different positions. The model files provided contains all words in the vocabulary with equal probabilities. The sentences you should work on translating are shown below. They are also in the files test_meningar.[swe/eng] that you copied from the course area For the lab we consider the given translations the only possible, and ignore any potential ambiguities. You can thus consider full sentences correct or incorrect, and do not have to use metrics such as Bleu.
ta en pil take an arrow |
ställ en kon på mitt block put a cone on my block |
hon tar blocket she takes the block |
ställ en röd pil på en blå cirkel put a red arrow on a blue circle |
jag tar ett blått block i take a blue block |
ställ pilen på cirkeln put the arrow on the circle |
han ställer det röda blocket på den blåa cirkeln he puts the red block on the blue circle |
han ställer en pil på sin cirkel he puts an arrow on his circle |
hon ställer sitt block på min blåa cirkel she puts her block on my blue circle |
jag ställer konen på cirkeln på hennes blåa cirkel i put the cone on the circle on her blue circle |
The given LM and TM files contain the full vocabulary, and all needed fertilities and word translations. In part 2 you will work on language modeling, and a small corpus is provided for that. If you want to you can have a look at it now, to get a better feel for the domain. These files are named corpus.*.*
We will work on translation between Swedish and English. For non-Swedish speakers, there is a brief description of the relevant language phenomena. You should translate in both directions, but non-Swedish speakers can focus most of their discussion on translation into English.
In all assignments you should try to get one set of weights that gives the globally best results for all 10 test sentences. This might mean that a change makes the result worse for one single sentence, but better for two other ones. Your task is to find a good compromise of weights that work reasonably well across all 10 sentences.
The decoder contains a function which gives you the rank of the correct hypothesis in the n-best list, and the average rank for all sentences. If a sentence does not have a translation the rank will be approximated to 500, which means the average rank is not very trustworthy in that case. But the decoder should be able to translate all provided sentences. For evaluation the nbest-flag of the decoder is ignored even if given, and all translation hypothesis are explored. To run this function you use "-eval referenceFile" as an argument to the decoder, for instance:
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2 -in test_meningar.swe -eval test_meningar.eng
In addition there is a blind, or secret, set of test sentences that you cannot see. You run the decoder on these sentences by giving the argument "-evalBlind engswe" or "-evalBlind sweeng". The decoder will then translate and evaluate the secret test set. Since this test set is secret you can obviously not analyze what goes wrong with it, but only use the automatic evaluation. Do not try to translate this test set often, only do it once each time you are asked to do so in the instructions. The point is not to see how good you can get on this test set, but to give an idea of the result that can be achieved on a test set that you do not optimize the decoder for. The command is for instance:
/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2 -evalBlind sweeng
In the given model files all probabilities are equal. This likely gives bad translations. Run the sample sentences through the translation system, and study the translation suggestions and also use the automatic evaluation to explore the overall results, and find out the average rank. Feel free to add some more sentences if you want to explore something you find interesting. How often is the correct translation at the top and how often is it missing from the n-best list? If translations are missing from the n-best list, you could also try to increase its size, using the "-n" flag. Is there a difference between the output quality in the two translation directions? If so, what do you think is the cause? Discuss some of the problems you see. What is causing problems?
Also run the decoder on the blind test set once and save the results. In later assignments, compare your scores to this "uniform baseline".
In this task you should adjust the probabilities for fertilities and word translations, to achieve better translations. You should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. In some cases you could improve some translations at the cost of others. There are problems you cannot solve by only manipulating the translation model.
How good are your translations after manipulating the translation model, compared to using the uniform model? Discuss what you can and cannot solve, and how you did it. In your report describe also what types of changes you made, and why. For instance a change may be to have a higher probability for the translation of "en" and "ett" to "a" than "an", with the motivation that "a" is much more common in English than "an".
Try to do a reasonable amount of changes. You do not have to spend a very long time on getting the best results possible, but try to make some principled changes based both on the test sentences, and on your knowledge of English/Swedish.
Master students only: your probability models should contain proper probability distributions, i.e. the probabilities should sum to 1 for each word, that is the fertilities for each word should sum to one, and the word probabilities p(s|t) should sum to 1 for each t. The given models are correct in this respect, except for rounding (which is OK), i.e. 0.33*3=0.99, and not 1.
When you are finished with this task, run the evaluation on the blind test set once, and compare them to the uniform baseline. Was the change in the result as you expected?
Try to do a reasonable amount of changes. You do not have to spend a very long time on getting the best results possible, especially not by adding a lot of 3-grams. Try to make some principled changes based both on the test sentences, and on your knowledge of English/Swedish. On this task it is very easy to overfit to the given test set, for instance by adding all 3-grams in the test set and giving them a high probability. This is not the point of this task, instead try to think of what makes the uniform model bad, and improve on it by making principled choices.
In addition to changing the probabilities for n-grams in the file, you may also change the backoff weight for unknown n-grams. A very simple backoff strategy is used in the decoder. If, for example a 3-gram is missing, it backs off to the 2-gram, but with a penalty that can be set on the command line. This penalty is simply multiplied by the 2-gram probability. If the 2-gram is missing too, it backs off to the 1-gram, and multiplies it by the penalty yet another time. The backoff penalty is set on the command line with the flag "-b WEIGHT", and the default values 0.01. Write your final value of the back-off penalty in your report.
How good are your final translations? Compare them to the result from the previous steps. Can you get all correct translations in relatively high positions? Was there some sentence that was harder than the others, and why? Were there cases where you could get improvements for some sentences at the cost of making translations for other sentences worse? How much are the translations improved compared to the given uniform weights? Also discuss the difference between the two translation directions, and possible reasons for it!
When you are finished with this task, run the evaluation on the blind test set once, and compare them to the results on the previous task. Was the change in the result as you expected? Do you think your LM is overfitted to the known test sentences?
/local/kurs/mt/srilm/bin/i686-m64/ngram-count -wbdiscount -text CORPUS_FILE -lm LM_FILE -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this lab.
In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this lab. The command you will run in this lab, will thus need to use the following flags:
/local/kurs/mt/lab2/simple_decoder/translate -lm LM_FILE -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp
There are two sets of corpora for training a language model: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is larger. Both corpora contains the same type of block world sentences, but in the corpora labeled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them. For training your language model, concatenate the two corpora into one large corpus for each language.
Use your final TM probabilities from part 1. Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Use the best n-gram order in all the following experiments. Discuss what types of problems that can be solved for each increase in LM order. Are there some issues that cannot be solved even with high order n-grams? Do you think that they could be solved with a better training corpus? Was a trained 2-gram and 3-gram model better than the weights you set manually in part 1?
When you are finished with this task, run the evaluation on the blind test set once, using the best lm order, and compare them to the results on the previous task. Was the change in the result as you expected? Did the trained LM generalize better to unseen data than your own LM?
Compare the translations with and without the sentence boundaries both for the known and blind test sentences. Does it make a difference? If so, what do they influence and why?
Kim ställer ett brandgult block på det gröna fältet Kim puts an orange block on the green field |
hon ställer 2 blåa block på ett fält she puts 2 blue blocks on a field |
Send your report and files via e-mail to Sara Stymne (firstname dot lastname at lingfil dot uu dot se). Deadline for handing in the report: April 26, 2016.