Word-based SMT, Part 2

Aim

In part 2 of the lab you will try to train lanugage models on data instead of manually changing probabilities.

Preparation

Copy some block world corpora from the course area:

mkdir lab3
cd lab3
cp /local/kurs/mt/lab3/data/* .
Also copy your final translation models (tm*.*) from part 1.

Assignment

In this lab you will try to use a language modelling toolkit instead of manually setting the LM weights. We will use the SRILM toolkit. You can run it with the following command:

/local/kurs/mt/srilm/bin/i686/ngram-count -wbdiscount -text CORPUS -lm LM  -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this lab.

In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this lab. It is also reasonable to change the backoff weight with "-b 0.01", but you may experiment with this setting if you want to. If you do, write it down in your report. The command you will run in this lab, will thus need to use the following flags:


/local/kurs/mt/lab2/simple_decoder/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp -b 0.01

It might also be convenient to run all sentences through at the same time with the flag "-in test_meningar.[eng/swe]", if you didn't do that already.

There are two sets of corpora: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is slightly larger. Both corpora contains the same type of block world sentences, but in the corpora labelled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them.

1 - Train LMs

In the first part you will train LMs with SRILM and run the translation system with different order LMs. Start by training LMs on the monolingual corpora, which are bigger than the parallel data. You can evaluate the translation on the sentences from lab2 and possibly also on a subset of the parallel corpus. You should run the translation in both directions. You can train the LM with the highest order you expect to use, and run the translation system with the switch "-o -max-ngram-order", to only use n-grams up to that order.

Start by keeping the TM probabilities constant on your final values from part 1. Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Discuss what types of problems that can be solved for each increase in LM order. Are there some issues that cannot be solved even with high order n-grams? Do you think that they could be solved with a better training corpus? Was a trained 2-gram and 3-gram model better than the weights you set manually in part 1? How would you expect the performance with manually and trained weights to be if you would test on other sentences from the blocks world, than the 10 sentences in the test corpus?

Did your TM probabilities work well with the trained LM? Otherwise, can you further improve some things by changing the TM probabilities?

Concatenate the monolingual data with the parallel data and retrain the LM on this larger corpus, using the best order from before. Does this make a difference? Do you think that more data always is better?

2 - Sentence boundaries

In the decoder there is a switch for using markers at the start and end of sentences. These markers can help in language modelling by giving information on which words tend to occur at the start and end of sentences. When the switch "-s" is used with the decoder, the sentence boundary markers are activated, and without it they are deactivated. Sentence markers are added automatically when you create a language model with SRILM. You can see them as "<s>" and "</s>" in the trained LM file.

Compare the translations with and without the sentence boundaries. Use the best n-gram order from your previous experiment. Does it make a difference? If so, what do they influence and why?

3 - Unknown words

In this part we will see what happens with words that are unknown, that is, that are not in the TM. We will focus on two sentence pairs:
Kim ställer ett brandgult block på det gröna fältet
Kim puts an orange block on the green field
hon ställer 2 blåa block på ett fält
she puts 2 blue blocks on a field
Discuss what happens when you translate these sentences. What happens with the unknown words? Which form of the words surrounding the unknown words are chosen? Can you think of a reason for this?

4 - Overall impression of word-based SMT

After doing this lab, what are your overall impressions of word-based SMT? Keep in mind that the system used was limited in that it did not allow reordering, and you did not train the TM. Also, the sentence types in the block world are really simple. But still, what do you think are the advantages and disadvantages? What were the differences between the two translation directions?

Lab report

Your lab report should cover both part 1 and part 2. For part 1 the report should contain discussions of the questions in the assignments. Also hand in the final model files with new weights from task 2 and 3, and the code for your evaluation program, if you chose to write it. For part 2 the report should contain discussions of the questions in the assignments. Also write down in your report if you are taking the course on candidate or master level.

For a G grade you should have performed all tasks in the lab, except writing an evaluation script, and discussed them in a good way. For a VG grade, in addition, you need to have written and used an evaluation program according to the instruction in part 1. Your discussion also need to be of high quality.

Send your report and files via e-mail to Sara Stymne (firstname dot lastname at lingfil dot uu dot se). Deadline for handing in the report: April 30, 2014.