Word-based SMT

Aim

The goal of this lab is to experiment with word-based statistical machine translation, on an easy text type. The aim is to gain a basic understanding of the role of the translation and language models. The lab contains two parts with some sub tasks each.

In part 1 of the lab you will manually change the probabilities of translation and language models. The goal is that you should get a feeling of how the probabilities affect the system. The setup is, of course, artificial. Normally you would not manipulate probabilities in this way, but estimate them from data. You should especially never change probabilities based on a very small test set, as in this lab. Here we do it just so that you can get a feeling of how the translation and language model works. For the language model you will also train a model on data in part 2 of the lab , which is how you would normally do also for the translation model.

Preparation

Copy the initial models and the test and training data from the course area:

mkdir lab2
cd lab2
cp /local/kurs/mt/lab2/data/* .

Task

In this lab we will use a simple word-based decoder, to translate sentences from the block world between Swedish and English. The decoder does not allow any reordering. It uses a translation model that consists of two parts, word translation probabilities and fertility probabilities, and a language model. In this lab you will adjust probabilities by hand, in order to get a feeling for how they affect translation, instead of training them on a corpus, as is normally done.

Models

The word translation model contains the probability of a source word translating into a target word, or to the special NULL word, i.e. that it is not translated. The format is that each line contains a target word, a source word and a probability, separated by white space, as:


block	blocket	1	   
take	ta	1
the	den	0.4
the	det	0.1
the	NULL	0.5

The fertility model contains probabilities for how many words each word can translate into. For most words in the blocks world, there will be probability 1 that they translate into 1 word. The format is again one entry per line, white space separated, containing a source word, a fertility (0-2 in the current implementation, it is enough for the blocks world) and a probability, as:


block	1	1	   
take	1	1
the	0	0.5
the	1	0.5

The language model contains probabilities for n-grams in the target language. It contains minimum 1-grams, but can also contain higher order n-grams. In the first part you do not need to use more than 3-grams. The lm-format is the ARPA format, but extra headings and backoff probabilities are ignored. That means that for each n-gram, it starts with a line with the tag \n-gram, followed by lines with a probability and n words, separated by white space, such as:


\1-gram
0.3 block
0.3 take
0.4 the

\2-gram
1 take the
1 the block

The given fertility models are called tmf.[swe/eng], the word translation models are called tmw.[sweeng/engswe], and the language models are called lm.[swe/eng]. The given models contains words and word pairs that you need for the lab, initialized with uniform probabilities.

Decoder

You will use a very simple word-based decoder that do not allow any reordering. The decoder can translate either single sentences, or multiple sentences from a file where it treats each line as a sentence. It normally outputs an n-best list of the best translation options, with their respective total probabilities. The decoder is called on the command line:

/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order
To run a system with the given models from Swedish to English you run:

/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2

The above shows the obligatory arguments, the three model files. In the lab you will also need to set the maximum n-gram order, first to 2, later to larger values. If called like this, it prompts the user to enter a sentence to be translated. You can also give the decoder a file with sentences or use it for evaluation, see below. In addition there are a number of other arguments that can be used:

Domain

In this assignment we will work on the blocks world domain, which contains commands and descriptions of taking and putting different types of blocks on different positions. The model files provided contains all words in the vocabulary with equal probabilities. The sentences you should work on translating are shown below. They are also in the files test_meningar.[swe/eng] that you copied from the course area For the lab we consider the given translations the only possible, and ignore any potential ambiguities. You can thus consider full sentences correct or incorrect, and do not have to use metrics such as Bleu.
ta en pil
take an arrow
ställ en kon på mitt block
put a cone on my block
hon tar blocket
she takes the block
ställ en röd pil på en blå cirkel
put a red arrow on a blue circle
jag tar ett blått block
i take a blue block
ställ pilen på cirkeln
put the arrow on the circle
han ställer det röda blocket på den blåa cirkeln
he puts the red block on the blue circle
han ställer en pil på sin cirkel
he puts an arrow on his circle
hon ställer sitt block på min blåa cirkel
she puts her block on my blue circle
jag ställer konen på cirkeln på hennes blåa cirkel
i put the cone on the circle on her blue circle
Feel free to add some more sentences if you want to illustrate some issue that is not covered by these sentences!

The given LM and TM files contain the full vocabulary, and all needed fertilities and word translations. In part 2 you will work on language modeling, and a small corpus is provided for that. If you want to you can have a look at it now, to get a better feel for the domain. These files are named corpus.*.*

We will work on translation between Swedish and English. For non-Swedish speakers, there is a brief description of the relevant language phenomena. You should translate in both directions, but non-Swedish speakers can focus most of their discussion on translation into English.

In all assignments you should try to get one set of weights that gives the globally best results for all 10 test sentences. This might mean that a change makes the result worse for one single sentence, but better for two other ones. Your task is to find a good compromise of weights that work reasonably well across all 10 sentences.

Evaluation

The evaluation will be done by seeing where in the n-best list the correct translation occurs, if it occurs at all. You should also do some qualitative analysis where you discuss what goes wrong in the translation hypotheses, and why. You should discuss some specific examples of problems with the translations, and possible reasons for the decoder not choosing the correct translation.

The decoder contains a function which gives you the rank of the correct hypothesis in the n-best list, and the average rank for all sentences. If a sentence does not have a translation the rank will be approximated to 500, which means the average rank is not very trustworthy in that case. But the decoder should be able to translate all provided sentences. For evaluation the nbest-flag of the decoder is ignored even if given, and all translation hypothesis are explored. To run this function you use "-eval referenceFile" as an argument to the decoder, for instance:


/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2  -in test_meningar.swe -eval test_meningar.eng

In addition there is a blind, or secret, set of test sentences that you cannot see. You run the decoder on these sentences by giving the argument "-evalBlind engswe" or "-evalBlind sweeng". The decoder will then translate and evaluate the secret test set. Since this test set is secret you can obviously not analyze what goes wrong with it, but only use the automatic evaluation. Do not try to translate this test set often, only do it once each time you are asked to do so in the instructions. The point is not to see how good you can get on this test set, but to give an idea of the result that can be achieved on a test set that you do not optimize the decoder for. The command is for instance:


/local/kurs/mt/lab2/simple_decoder/translate -lm lm.eng -tmw tmw.sweeng -tmf tmf.swe -o 2  -evalBlind sweeng

Part 1, change probabilities manually

1 - Run the system with uniform probabilities

In the given model files all probabilities are equal. This likely gives bad translations. Run the sample sentences through the translation system, and study the translation suggestions and also use the automatic evaluation to explore the overall results, and find out the average rank. Feel free to add some more sentences if you want to explore something you find interesting. How often is the correct translation at the top and how often is it missing from the n-best list? If translations are missing from the n-best list, you could also try to increase its size, using the "-n" flag. Is there a difference between the output quality in the two translation directions? If so, what do you think is the cause? Discuss some of the problems you see. What is causing problems?

Also run the decoder on the blind test set once and save the results. In later assignments, compare your scores to this "uniform baseline".

2 - Manipulate the translation models

In this task you should adjust the probabilities for fertilities and word translations, to achieve better translations. You should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. In some cases you could improve some translations at the cost of others. There are problems you cannot solve by only manipulating the translation model.

How good are your translations after manipulating the translation model, compared to using the uniform model? Discuss what you can and cannot solve, and how you did it. In your report describe also what types of changes you made, and why. For instance a change may be to have a higher probability for the translation of "en" and "ett" to "a" than "an", with the motivation that "a" is much more common in English than "an".

Try to do a reasonable amount of changes. You do not have to spend a very long time on getting the best results possible, but try to make some principled changes based both on the test sentences, and on your knowledge of English/Swedish.

Master students only: your probability models should contain proper probability distributions, i.e. the probabilities should sum to 1 for each word, that is the fertilities for each word should sum to one, and the word probabilities p(s|t) should sum to 1 for each t. The given models are correct in this respect, except for rounding (which is OK), i.e. 0.33*3=0.99, and not 1.

When you are finished with this task, run the evaluation on the blind test set once, and compare them to the uniform baseline. Was the change in the result as you expected?

3 - Manipulate the language model

Go on to manipulate the language model as well, in order to further improving the translations. In the given files there are 1-grams and 2-grams with equal probabilities. Try to just adjust these first, and see if you can solve some problems. Again, you should not set any probabilities to 0, and they should all be real probabilities, i.e. 0 < p ≤ 1. You might also want to add n-grams that are missing, or remove ungrammatical n-grams. You might also want to add some 3-grams. If you do, remember to change the decoder flag for order to "-o 3" when you add 3-grams. Which problems require 3-grams? Are there still problems that cannot be solved with 3-grams? Describe the types of changes you make, and the reasoning behind them. If you wish, you can adjust some TM probabilities when you start changing the LM probabilities, in that case describe which and why.

Try to do a reasonable amount of changes. You do not have to spend a very long time on getting the best results possible, especially not by adding a lot of 3-grams. Try to make some principled changes based both on the test sentences, and on your knowledge of English/Swedish. On this task it is very easy to overfit to the given test set, for instance by adding all 3-grams in the test set and giving them a high probability. This is not the point of this task, instead try to think of what makes the uniform model bad, and improve on it by making principled choices.

In addition to changing the probabilities for n-grams in the file, you may also change the backoff weight for unknown n-grams. A very simple backoff strategy is used in the decoder. If, for example a 3-gram is missing, it backs off to the 2-gram, but with a penalty that can be set on the command line. This penalty is simply multiplied by the 2-gram probability. If the 2-gram is missing too, it backs off to the 1-gram, and multiplies it by the penalty yet another time. The backoff penalty is set on the command line with the flag "-b WEIGHT", and the default values 0.01. Write your final value of the back-off penalty in your report.

How good are your final translations? Compare them to the result from the previous steps. Can you get all correct translations in relatively high positions? Was there some sentence that was harder than the others, and why? Were there cases where you could get improvements for some sentences at the cost of making translations for other sentences worse? How much are the translations improved compared to the given uniform weights? Also discuss the difference between the two translation directions, and possible reasons for it!

When you are finished with this task, run the evaluation on the blind test set once, and compare them to the results on the previous task. Was the change in the result as you expected? Do you think your LM is overfitted to the known test sentences?

Part 2, train a language model and explore

Tools

In this lab you will try to use a language modeling toolkit instead of manually setting the LM weights. We will use the SRILM toolkit. You can run it with the following command:

/local/kurs/mt/srilm/bin/i686-m64/ngram-count -wbdiscount -text CORPUS_FILE -lm LM_FILE  -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this lab.

In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this lab. The command you will run in this lab, will thus need to use the following flags:


/local/kurs/mt/lab2/simple_decoder/translate -lm LM_FILE -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp 

There are two sets of corpora for training a language model: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is larger. Both corpora contains the same type of block world sentences, but in the corpora labeled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them. For training your language model, concatenate the two corpora into one large corpus for each language.

1 - Train LMs

In the first part you will train LMs with SRILM and run the translation system with different order LMs. Train your LMs on the full concatenated corpora. You can train the LM with the highest order you expect to use, and run the translation system with the switch "-o max-ngram-order", to only use n-grams up to that order, i.e. if you train a 5-gram model, you can run it with only 3-grams if you use "-o 3", and so on.

Use your final TM probabilities from part 1. Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Use the best n-gram order in all the following experiments. Discuss what types of problems that can be solved for each increase in LM order. Are there some issues that cannot be solved even with high order n-grams? Do you think that they could be solved with a better training corpus? Was a trained 2-gram and 3-gram model better than the weights you set manually in part 1?

When you are finished with this task, run the evaluation on the blind test set once, using the best lm order, and compare them to the results on the previous task. Was the change in the result as you expected? Did the trained LM generalize better to unseen data than your own LM?

2 - Explore the influence of training data size

Split the training data for the language model into two halves, and retrain the LM. Run and evaluate the system using each of the two smaller LMs. How does this influence the result. Compare both the two smaller models to each other, and the small models to the large models. Also run the different systems on the blind test set once and discuss those results.

3 - Sentence boundaries

In the decoder there is a switch for using markers at the start and end of sentences. These markers can help in language modeling by giving information about which words tend to occur at the start and end of sentences. When the switch "-s" is used with the decoder, the sentence boundary markers are activated, and without it they are deactivated. Sentence markers are added automatically when you create a language model with SRILM. You can see them as "<s>" and "</s>" in the trained LM file.

Compare the translations with and without the sentence boundaries both for the known and blind test sentences. Does it make a difference? If so, what do they influence and why?

4 - Unknown words

In this part we will see what happens with words that are unknown, that is, that are not in the TM. We will focus on two sentence pairs:
Kim ställer ett brandgult block på det gröna fältet
Kim puts an orange block on the green field
hon ställer 2 blåa block på ett fält
she puts 2 blue blocks on a field
The unknown words are transferred as is by the decoder. This is a standard procedure in MT systems. Do you think it is a good idea? Discuss what happens with the words surrounding the unknown words. Why are those particular translations chosen for them, and were they good choices?

5 - Overall impression of word-based SMT

After doing this lab, what are your overall impressions of word-based SMT? Keep in mind that the system used was limited in that it did not allow reordering, and you did not train the TM. Also, the sentence types in the block world are really simple. But still, what do you think are the advantages and disadvantages? What were the differences between the two translation directions?

Lab report

Your lab report should cover both part 1 and part 2. For part 1 the report should contain discussions of the questions in the assignments. Also hand in the final model files with new weights from task 2 and 3. Do not send any other files. For part 2 the report should contain discussions of the questions in the assignments. Make sure that you do not only include the automatic evaluation from the system, but also do some linguistic analysis of errors and improvements. Also write down in your report if you are taking the course on candidate or master level.

Send your report and files via e-mail to Sara Stymne (firstname dot lastname at lingfil dot uu dot se). Deadline for handing in the report: April 26, 2016.