Word-based SMT -- part 2
This document describes the tasks you need to perform for part 2 of lab 2. The task here is to train a language model on data and explore some aspects of the word-based SMT decoder. For this part of the lab you should write a written lab report.
In this part of the lab you should work on both translation directions, but you may choose to focus parts of your analysis on one language directions, see specific instructions below. If you do not know Swedish, there is a grammar sketch of Swedish. For the TM you can either use a TM with uniform probabilities, as in the files given, or you can use your modified TM from part 1.
For training LMs, we will use the SRILM toolkit. You can run it with the following command:
/local/kurs/mt/srilm/bin/i686-m64/ngram-count -wbdiscount -text CORPUS_FILE -lm LM_FILE -order ORDER
where CORPUS is your training corpus, LM is the resulting language model that you will use as input to the decoder, and ORDER is the maximum LM order you want. -wbdiscount means that the program runs Witten-Bell smoothing. It is suitable for small corpora, but you do not have to think about smoothing in this lab. (Note that the -wbdiscount is not a suitable smoothing method for your course projects, where you will use larger training data.)
In the resulting LM files the probabilities are given as logprobs. Our decoder uses standard probabilities, but can convert from logprobs if the flag -lp is used. It is thus important that you add this flag in this lab. The command you will run in this lab, will thus need to use the following flags:
/local/kurs/mt/lab2/simple_decoder/translate -lm LM_FILE -tmw WORD-translation-model -tmf FERTILITY-model -o MAX-NGRAM-ORDER -lp
If you fail to use this flag your results will be strange, and higher LM orders will most likely make the results worse. If that happens, add this flag!
There are two sets of corpora for training a language model: a parallel corpora, called corpus.parallel.* and a monolingual corpus called corpus.mono.*, which is larger. Both corpora contains the same type of block world sentences, but in the corpora labeled parallel, the English and Swedish sentences corresponds to each other, line by line (which would have been necessary if we would have trained a translation model, but is not necessary for language model training). Have a brief look at the corpora files to familiarize yourself with them. For training your language model, concatenate the two corpora into one large corpus for each language.
1 - Train LMs
In the first part you will train LMs with SRILM and run the translation system with different order LMs. Train your LMs on the full concatenated corpora.
You can train the LM with the highest order you expect to use, and run the translation system with the switch "-o max-ngram-order", to only use n-grams up to that order, i.e. if you train a 5-gram model, you can run it with only 3-grams if you use "-o 3", and so on.
Run the system with different order n-grams, starting with 1, and increasing the order until you think that you can get no further improvements. Run your experiments both for the known test set and the blind test set.
In your report:
- Show the average rank for different LM orders for the two translation directions and the two test sets in a table
- For each language pair decide on which you think is the best LM-order to use
- Analyze the results in at least one language direction, and discuss what types of issues are solved by each increase in n-gram order.
- For (some of) the remaining issues in translation, discuss if you think they are an effect of the translation approach (word-based SMT) or due to a non-representative training corpus.
- State which TMs you used for each translation direction (your own, or the given uniform models)
For all remaining experiments: use the best n-gram order you have chosen for each language pair.
2 - Explore the influence of training data size
In general, more training data tends to give better results for SMT. Investigate this by using only part of the training data for training the LM. Create three new smaller files that contains 50%, 25%, and 12.5% of the number of sentences in the full training data, respectively. Train LMs on these smaller files, and see the effect on the translation results. (Tip: use the Unix commands wc
to count the number of sentences, and head
and/or tail
to pick a certain number of sentences for the smaller files)
In your report:
- Show the average rank for different LM training data sizes for the two translation directions and the two test sets in a table.
- Discuss the effect of LM training data size on the translation results.
- For at least one language direction, analyze the errors made by the different systems. Are there specific issues getting worse with less data, or does the system deteriorate in general?
3 - Sentence boundaries
In the decoder there is a switch for using markers at the start and end of sentences. These markers can help in language modeling by giving information about which words tend to occur at the start and end of sentences. When the switch "-s" is used with the decoder, the sentence boundary markers are activated, and without it they are deactivated. Sentence markers are added automatically when you create a language model with SRILM. You can see them as "<s>" and "</s>" in the trained LM file.
In your report:
- Show the average rank with and without sentence boundaries for the two translation directions and the two test sets in a table.
- For at least one language direction, analyze the errors made by the different systems, specifically focusing at the beginning of the sentences, and discuss the effect of using sentence boundary markers.
4 - Unknown words
In this part we will see what happens with words that are unknown, that is, that are not in the TM. We will focus on two sentence pairs:
Kim ställer ett brandgult block på det gröna fältet Kim puts an orange block on the green field |
hon ställer 2 blåa block på ett fält she puts 2 blue blocks on a field |
The unknown words are transferred as is by the decoder. This is a standard procedure in MT systems. Thus, focus on the translations chosen for the words surrounding the unknown words.
Run the decoder on these two sentences and study the results. For this task it si not meaningful to look at the average rank, but you should focus on the actual translations, especially the highest ranking option.
In your report:
- Show the highest ranking translation of these two sentences, for each translation direction.
- Analyze and discuss the translations chosen for the words surrounding the unknown words. Focus particularly on what happens in the language model for these words.
5 - Overall impression of word-based SMT
After doing this lab, you should have had some impressions of word-based SMT. Keep in mind that the system you used was limited in that it did not allow reordering, and you did not train the TM. Also, the sentence types in the block world are really simple.
In your report:
- Discuss the strengths and weaknesses of a word-based SMT system.
- Discuss the differences in performance between the two translation directions.
Lab report
Your lab report should cover part 2. For each of the 5 subtasks it is listed under the heading In your report what should be included in the report.
Hand in you report as a pdf via the student portal. Deadline for handing in the report: April 28, 2017.