Projects in Machine Translation -
Master Program
New: the reports, both the group report and the individual reports are to be handed in to Sara and to your supervisor by email.
Aim
The goal of the master projects is- to study background literature and prepare a presentation for the final seminars in the MT course;
- to carry out a practical assignment related to the topic selected for the seminar and to prepare a final report describing the results
Deadlines and schedule
- April 15: Hand in your topic preferences
- May 23 and 25: Seminar presentations (detailed schedule TBD)
- June 3: Hand in final group report
- June 3: Hand in individual reflection reports
The schedule for the seminar presentations is:
- Monday, May 23:
- 10.15-10.45: Domains and evaluation; Filip, Rebeca, Erik
- 10.45-11.15: LMs and Domains; Manon, Chiao-Ting, Ammar
- 11.15-11.20: Break
- 11.20-11.50: Word alignment and PBSMT; Hoa, Laura, Vanessa
- 11.50-12.00: Time for discussion
- Wednesday, May 25:
- 10.15-10.45: Re-ordering and SMT; Caroline, Magdalena
- 10.45-11.15: Tree-based SMT; Areti, Tobias, Allison
- 11.15-11.20: Break
- 11.20-11.50: Compounds in SMT; Linjing, Nasrin, Carina
- 11.50-12.00: Time for discussion
Organisation
You will work in groups of 3-4 students. The groups will be created by Sara, based on your wishes for which topics you prefer to work on. The list of topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Sara, by the latest on April 15. Give at least three different topics you could consider to work on, ranked from 1 and up, for instance:- Parameter tuning
- Factored translation
- Reordering
- Compounds in SMT
The groups are:
Topic | Students | Supervisor |
---|---|---|
Language Modeling and Domains | Manon, Chiao-Ting, Ammar | Christian |
Word alignment and Phrase-Based SMT | Hoa, Laura, Vanessa | Sara |
Re-ordering and SMT | Caroline, Petros, Magdalena | Sara |
Tree-based SMT | Areti, Tobias, Allison | Fabienne |
Domains and evaluation | Filip, Rebeca, Erik | Christian |
Compounds in SMT | Linjing, Nasrin, Carina | Fabienne |
Seminar Presentations
The goal of the seminars is to give all students an overview of the topics selected by the master students for their projects. Please, try to give a comprehensible introduction to the topic you have selected. Motivate the ideas and concepts and try to be as pedagogical as possible. Allow discussions and questions. The overall time for your presentation is 30 minutes including all discussions and questions. This means that you should prepare a presentation for about 20 minutes up to max 25 minutes.
It is up to the students in each group to decide how to organise the presentation. It is not necessary that all students give the presentation. All students in the group should know the contents in the presentation, however, and be prepared to answer questions. It is compulsary for all students in the group to attend the seminar with their presentation, and it is highly recommended to attend both seminars. The presentation should be given in English.
The seminars will be held on May 23 and 25. A detailed schedule will be posted here later.
Project work
For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in English.
It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.
In the final report, you are expected to
- describe the background in terms of the concepts, approaches and techniques within your selected topic, including references to journal and/or conference articles
- describe your project and motivate your experimental setup
- summarize, evaluate and analyze your results
- describe possible shortcomings and ideas for improvement
Individual reflection report
In addition to the group report, each student should also hand in a short individual reflection report. The report should be about 1-1,5 A4 pages, and can be written in English or Swedish. The report should consist of two parts:
- A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
- Pick a recent conference article related to your topic, and briefly discuss how your project work relates to that work. Do not to pick the same article as the other group members.
Resources
General Resources for the projects will be listed and linked here. The basic resource is translated subtitles as collected in the OPUS corpus collection. For some of the projects you might need more diverse data, discuss this with your supervisor if that is the case. A selection of small/medium sized data sets is available on our server (stp) in
/local/kurs/mt/projects/data
Each parallel data set includes
- training data (
xx-yy.train.xx, xx-yy.train.yy
) - development data (
xx-yy.dev.xx, xx-yy.dev.yy
) - test data (
xx-yy.test.xx, xx-yy.test.yy
)
Currently, there are data sets available for
- English - Swedish (en-sv)
- English - French (en-fr)
- English - Spanish (en-es)
- German - English (de-en)
- German - Swedish (de-sv)
- French - Swedish (fr-sv)
All parallel data sets are sentence aligned (corresponding lines are aligned with each other), tokenized and "true-cased" (look at the Moses homepage to understand what that means. True-casing is not perfect in the case of movie subtitles as there are often dashes or other marker characters in the beginning of a sentence. You may recase the data if you like using the Moses tools.
There are also monolingual data sets for all languages above. They
have the basename mono
and an extension corresponding to
the language ID (de, en, es, fr, sv).
Other tools and resources that you might need in some projects:
- hunpos - POS tagger; pre-trained POS tagging models
- TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
- MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
- Simple tools to convert POS-tagged data to CoNLL format (for
parsing) and MaltParser output to XML trees (for tree-based SMT
training) are available
at
/local/kurs/mt/projects/tools/
:tagged2conll.pl
- convert TAB separated POS-tagging output to CoNLL format for parsingmalt2tree.pl
- convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
- anymalign - an alternative word aligner
- The Berkeley Word Aligner
- More links to tools
Projects
Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. For some projects the subjects will also be discussed briefly during other lectures. If this is the case, discuss the contents of your seminar with the respective teacher.
Note that there will be fewer groups than there are topic suggestions, depending on your wishes and the course coverage.
- Parameter Tuning
- Seminar: Explain the basic concepts of parameter tuning in SMT and introduce algorithms implemented in Moses
- Project: compare tuning algorithms and parameters
- MERT, PRO, Batch MIRA
- investigate tuning stability
- tools: tuning scripts and software in Moses
- data: translated movie subtitles
- Factored SMT models
- Seminar: Explain the basic concepts of factored SMT
- Project: train and compare various factored SMT models
- include factors such as POS tags, lemmas, syntactic function
- compare various combinations of translation and generation steps
- tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
- data: translated movie subtitles
- Language Modeling and Domains
- Seminar: Explain the basic concepts of n-gram LM's
- Project: Explore language models and their parameters
- investigate the effect of data size on translation quality
- compare the use of in-domain versus out-of-domain data (perplexity and translation quality)
- combinations of in-domain and out-of-domain LM's
- tools: KenLM, SRILM, Moses
- data: translated movie subtitles and data from other domains
- Word alignment and Phrase-Based SMT
- Seminar: Explain word alignment algorithms and phrase extraction strategies
- Project: Explore the impact of word alignment on SMT quality
- different settings for GIZA++
- different symmetrization heuristics
- difference between alignment of wordforms, lemmas, (POS tags?)
- other alignment tools: anymalign, (Berkeley aligner?)
- tools: GIZA++, Moses, anymalign, TreeTagger, ...
- data: translated movie subtitles
- Re-ordering and SMT
- Seminar: Explain different re-ordering strategies
- Project: Apply and compare different re-ordering approaches
- lexicalized re-ordering models
- re-ordering constraints (see Moses: hybrid translation)
- pre-ordering (before training/decoding)
- tools: Moses, external or own tools
- data: translated movie subtitles
- Tree-based SMT
- Seminar: Explain the basic concepts of tree-based SMT
- Project: train and compare various tree-based SMT models
- hierarchical phrase-based SMT (no linguistic syntax)
- linguistic syntax in source and/or target language
- tools: POS tagger (e.g. hunpos) and parsers with existing models
- data: translated movie subtitles
- Domains and evaluation
- Seminar: Explain the impact on domain on MT and domain adaption strategies
- Project: Explore the influences of different domains on training and test data, and evaluate through several different methods
- Vary the domain in training, dev and test data
- Train on mixed data or data from a single domain
- Possibly: explore methods for domain adaption
- Evaluate using different automatic metrics
- Evaluate using some manual or semi-automatic method
- tools: Moses, evaluation metrics
- data: translated movie subtitles and data from other domains
- Compounds in SMT
- Seminar: Explain how compound words can be treated in MT
- Project: Explore how to handle compound words for MT from and possibly to a compounding language
- Compound splitting
- Train MT systems with split compounds
- Explore merging strategies for translating into compounding languages?
- tools: Moses, external or own tools
- data: translated movie subtitles
- Lattices and confusion networks
- Seminar: Explain how lattices and confusion networks are used in MT and give some examples of when they have been used
- Project: Identify areas where lattices and/or confusion networks can be useful and apply it to an MT system
- Figure out things that can be represented by lattices and/or confusion networks
- Run MT systems with lattices and/or confusion networks
- tools: Moses, external or own tools
- data: translated movie subtitles