Lab 3 - Language Modelling and SMT

Preparation

Copy some block world corpora from the course area:

mkdir lab3
cd lab3
cp /local/kurs/mt/lab3/data/* .
Also copy your final translation models (tm*.*) from lab 2.

Assignment

In this lab you will try to use a language modelling toolkit instead of manually setting the LM weights. We will use the SRILM toolkit. You can run it with the following command:

/local/kurs/mt/srilm/bin/i686/ngram-count -wbdiscount -text CORPUS -lm LM  -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this lab.

In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is also reasonable to change the backoff weight with "-b 0.01". It is thus important that you add this flag in this lab:


/local/kurs/mt/lab2/simple_decoder_fixed/translate -lm LM-file -tmw WORD-translation-model -tmf FERTILITY-model -o -max-ngram-order -lp -b 0.01

Note that you shuld run an updated version of the decoder in the folder simple_decoder_fixed, that fixes a small bug in the original decoder.

It might also be convenient to run all sentences through at the same time with the flag "-in test_meningar.[eng/swe]", if you didn't do that already. The test_meningar.[eng/swe].

There are two set of corpora: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is slightly larger.

1 - Train LMs

In the first part you will train LMs with SRILM and run the translation system with different order LMs. Start by training LMs on the monolingual corpora, which are a bigger than the parallel data. You can evaluate the translation on the sentences from lab2 and possibly also on a subset of the parallel corpus. You should run the translation in both directions. You can train the LM with the highest order you expect to use, and run the translation system with the switch "-o -max-ngram-order", to only use n-grams up to that order. Start by keeping the TM probabilities constant on your final values from lab 2. Run the system with different order n-grams, starting with 1, and increasing until you get no further improvements. Discuss what types of problems that can be solved for each increase in LM order. Are there some issues that cannot be solved even with high order n-grams? Do you think that could be solved with a better training corpus? Was a trained 2-gram and 3-gram model better than the weights you set manually in lab 2?

Did your TM probabilities work well with the trained LM? Otherwise, can you further improve some thngs by changing the TM probabilities?

Concatenate the monolingual data with the parallel data and retrain the LM with the best order from before. Evaluate on the lab 2 sentences. Does this make a difference? Is more data always better?

2 - Sentence boundaries

In the decoder there is a switch for using markers at the start and end of sentences. These markers can help in language modelling by giving infornation on which words tend to occur at the start and end of sentences. When the switch "-s" is used with the decoder, the sentence boundary markers are activated, and without it they are deactivated. Sentence markers are added automatically when you use them with SRILM. You can see them as "<s>" and "</s>" in the trained LM file.

Compare the translations with and without the sentence boundaries. Does it make a difference? If so, what do they influence?

3 - Unknown words

In this part we will see what happens with words that are unknown, that is that are not in the TM. We will focus on two sentences:
Kim ställer ett brandgult block på det gröna fältet
Kim puts an orange block on the green field
hon ställer 2 blåa block på ett fält
she puts 2 blue blocks on a field
Discuss what happens when you translate these sentences. Which form of the words surrounding the unknown words are chosen?

4 - Overall impression of word-based SMT

After doing this lab, what are your overall impressions of word-based SMT? Keep in mind that the system used was limited in that it did not allow reordering, and you did not train the TM. But still, what do you think are the advantages and disadvantages? Was there any difference between the two translation directions?

Lab report

Your lab report should cover both lab 2 and lab 3. For lab 2 the report should contain discussions of the questions in the assignments. Also hand in the final model files with new weights from part 1 and 2. For lab 3 the report should contain discussions of the questions in the assignments.

Send your report via e-mail to Sara Stymne (firstname dot lastname at lingfil dot uu dot se). Deadline for handing in the report: 25 April, 2013.