Projects in Machine Translation -
Master Program

New: the reports, both the group report and the individual reports are to be handed in to Sara and to your supervisor by email.

Aim

The goal of the master projects is
  1. to study background literature and prepare a presentation for the final seminars in the MT course;
  2. to carry out a practical assignment related to the topic selected for the seminar and to prepare a final report describing the results

Deadlines and schedule

  • April 15: Hand in your topic preferences
  • May 23 and 25: Seminar presentations (detailed schedule TBD)
  • June 3: Hand in final group report
  • June 3: Hand in individual reflection reports

The schedule for the seminar presentations is:

  • Monday, May 23:
    • 10.15-10.45: Domains and evaluation; Filip, Rebeca, Erik
    • 10.45-11.15: LMs and Domains; Manon, Chiao-Ting, Ammar
    • 11.15-11.20: Break
    • 11.20-11.50: Word alignment and PBSMT; Hoa, Laura, Vanessa
    • 11.50-12.00: Time for discussion
  • Wednesday, May 25:
    • 10.15-10.45: Re-ordering and SMT; Caroline, Magdalena
    • 10.45-11.15: Tree-based SMT; Areti, Tobias, Allison
    • 11.15-11.20: Break
    • 11.20-11.50: Compounds in SMT; Linjing, Nasrin, Carina
    • 11.50-12.00: Time for discussion

Organisation

You will work in groups of 3-4 students. The groups will be created by Sara, based on your wishes for which topics you prefer to work on. The list of topic suggestions can be found at the bottom of this page. You can hand in your preferences by email to Sara, by the latest on April 15. Give at least three different topics you could consider to work on, ranked from 1 and up, for instance:
  1. Parameter tuning
  2. Factored translation
  3. Reordering
  4. Compounds in SMT
I will try my best to accomodate everyone's wishes, but I cannot guarantee that you will get your prefered topics. If you fail to hand in a wish by April 15, I will assign you arbitrarily to a topic.

The groups are:

Topic Students Supervisor
Language Modeling and Domains Manon, Chiao-Ting, Ammar Christian
Word alignment and Phrase-Based SMT Hoa, Laura, Vanessa Sara
Re-ordering and SMT Caroline, Petros, Magdalena Sara
Tree-based SMT Areti, Tobias, Allison Fabienne
Domains and evaluation Filip, Rebeca, Erik Christian
Compounds in SMT Linjing, Nasrin, Carina Fabienne

Seminar Presentations

The goal of the seminars is to give all students an overview of the topics selected by the master students for their projects. Please, try to give a comprehensible introduction to the topic you have selected. Motivate the ideas and concepts and try to be as pedagogical as possible. Allow discussions and questions. The overall time for your presentation is 30 minutes including all discussions and questions. This means that you should prepare a presentation for about 20 minutes up to max 25 minutes.

It is up to the students in each group to decide how to organise the presentation. It is not necessary that all students give the presentation. All students in the group should know the contents in the presentation, however, and be prepared to answer questions. It is compulsary for all students in the group to attend the seminar with their presentation, and it is highly recommended to attend both seminars. The presentation should be given in English.

The seminars will be held on May 23 and 25. A detailed schedule will be posted here later.

Project work

For each topic you should perform a practical project, where you apply some of the concepts related to your projects practically. This includes setting up and running MT systems, normally with Moses, and evaluate and compare systems that differ in some aspect related to your topic. Each project will be assigned a supervisor, with whom you should discuss how to set up your project. The project work should be presented in a written group report, written in English.

It is possible to divide the work in the group so that different persons perform different experiments. It is important that each person in the group should set up and run at least one MT system. You are jointly responsible for writing the report, however, and each person should understand and be familiar with all parts of the report. It is obligatory to have at least one meeting with your supervisor where you discuss how you divide the work, and show that everyone can set up and run MT systems.

In the final report, you are expected to

  • describe the background in terms of the concepts, approaches and techniques within your selected topic, including references to journal and/or conference articles
  • describe your project and motivate your experimental setup
  • summarize, evaluate and analyze your results
  • describe possible shortcomings and ideas for improvement

The deadline for handing in the reports is June, 3! Hand in the report to Sara and your supervisor by email.

Individual reflection report

In addition to the group report, each student should also hand in a short individual reflection report. The report should be about 1-1,5 A4 pages, and can be written in English or Swedish. The report should consist of two parts:

  • A description of your role in the project group and what you personally did in the project, including which MT systems you trained.
  • Pick a recent conference article related to your topic, and briefly discuss how your project work relates to that work. Do not to pick the same article as the other group members.
Hand in the report to Sara and your supervisor by email.

Resources

General Resources for the projects will be listed and linked here. The basic resource is translated subtitles as collected in the OPUS corpus collection. For some of the projects you might need more diverse data, discuss this with your supervisor if that is the case. A selection of small/medium sized data sets is available on our server (stp) in

/local/kurs/mt/projects/data

Each parallel data set includes

  • training data (xx-yy.train.xx, xx-yy.train.yy)
  • development data (xx-yy.dev.xx, xx-yy.dev.yy)
  • test data (xx-yy.test.xx, xx-yy.test.yy)
xx is here the language ID of the source language and yy is the language ID in the target language (you may, of course, also use the data in the other translation direction).

Currently, there are data sets available for

  • English - Swedish (en-sv)
  • English - French (en-fr)
  • English - Spanish (en-es)
  • German - English (de-en)
  • German - Swedish (de-sv)
  • French - Swedish (fr-sv)
I can produce data sets for other language pairs if you like. Just ask me.

All parallel data sets are sentence aligned (corresponding lines are aligned with each other), tokenized and "true-cased" (look at the Moses homepage to understand what that means. True-casing is not perfect in the case of movie subtitles as there are often dashes or other marker characters in the beginning of a sentence. You may recase the data if you like using the Moses tools.

There are also monolingual data sets for all languages above. They have the basename mono and an extension corresponding to the language ID (de, en, es, fr, sv).

Other tools and resources that you might need in some projects:

  • hunpos - POS tagger; pre-trained POS tagging models
  • TreeTagger - another POS tagger with many pre-trained models; includes also lemmatization!
  • MaltParser - a data-driven dependency parser generator. pre-trained POS tagging models for various languages (NOTE - you will need version 1.4.1 for using those models!)
  • Simple tools to convert POS-tagged data to CoNLL format (for parsing) and MaltParser output to XML trees (for tree-based SMT training) are available at /local/kurs/mt/projects/tools/:
    • tagged2conll.pl - convert TAB separated POS-tagging output to CoNLL format for parsing
    • malt2tree.pl - convert TAB separated parser output (CoNLL format) to XML format to be used with Moses syntax-based models. NOTE - This only works for projective trees!
  • anymalign - an alternative word aligner
  • The Berkeley Word Aligner
  • More links to tools

Projects

Here is a list of project ideas and a short description of their main goals. Note that the exact descriptions are suggestions that should be discussed with your supervisor. For some projects the subjects will also be discussed briefly during other lectures. If this is the case, discuss the contents of your seminar with the respective teacher.

Note that there will be fewer groups than there are topic suggestions, depending on your wishes and the course coverage.

  • Parameter Tuning
    • Seminar: Explain the basic concepts of parameter tuning in SMT and introduce algorithms implemented in Moses
    • Project: compare tuning algorithms and parameters
      - MERT, PRO, Batch MIRA
      - investigate tuning stability
      - tools: tuning scripts and software in Moses
      - data: translated movie subtitles

  • Factored SMT models
    • Seminar: Explain the basic concepts of factored SMT
    • Project: train and compare various factored SMT models
      - include factors such as POS tags, lemmas, syntactic function
      - compare various combinations of translation and generation steps
      - tools: Moses, tagger and lemmatizer (hunpos, TreeTagger) and parsers with existing models
      - data: translated movie subtitles

  • Language Modeling and Domains
    • Seminar: Explain the basic concepts of n-gram LM's
    • Project: Explore language models and their parameters
      - investigate the effect of data size on translation quality
      - compare the use of in-domain versus out-of-domain data (perplexity and translation quality)
      - combinations of in-domain and out-of-domain LM's
      - tools: KenLM, SRILM, Moses
      - data: translated movie subtitles and data from other domains

  • Word alignment and Phrase-Based SMT
    • Seminar: Explain word alignment algorithms and phrase extraction strategies
    • Project: Explore the impact of word alignment on SMT quality
      - different settings for GIZA++
      - different symmetrization heuristics
      - difference between alignment of wordforms, lemmas, (POS tags?)
      - other alignment tools: anymalign, (Berkeley aligner?)
      - tools: GIZA++, Moses, anymalign, TreeTagger, ...
      - data: translated movie subtitles

  • Re-ordering and SMT
    • Seminar: Explain different re-ordering strategies
    • Project: Apply and compare different re-ordering approaches
      - lexicalized re-ordering models
      - re-ordering constraints (see Moses: hybrid translation)
      - pre-ordering (before training/decoding)
      - tools: Moses, external or own tools
      - data: translated movie subtitles

  • Tree-based SMT
    • Seminar: Explain the basic concepts of tree-based SMT
    • Project: train and compare various tree-based SMT models
      - hierarchical phrase-based SMT (no linguistic syntax)
      - linguistic syntax in source and/or target language
      - tools: POS tagger (e.g. hunpos) and parsers with existing models
      - data: translated movie subtitles

  • Domains and evaluation
    • Seminar: Explain the impact on domain on MT and domain adaption strategies
    • Project: Explore the influences of different domains on training and test data, and evaluate through several different methods
      - Vary the domain in training, dev and test data
      - Train on mixed data or data from a single domain
      - Possibly: explore methods for domain adaption
      - Evaluate using different automatic metrics
      - Evaluate using some manual or semi-automatic method
      - tools: Moses, evaluation metrics
      - data: translated movie subtitles and data from other domains

  • Compounds in SMT
    • Seminar: Explain how compound words can be treated in MT
    • Project: Explore how to handle compound words for MT from and possibly to a compounding language
      - Compound splitting
      - Train MT systems with split compounds
      - Explore merging strategies for translating into compounding languages?

      - tools: Moses, external or own tools
      - data: translated movie subtitles

  • Lattices and confusion networks
    • Seminar: Explain how lattices and confusion networks are used in MT and give some examples of when they have been used
    • Project: Identify areas where lattices and/or confusion networks can be useful and apply it to an MT system
      - Figure out things that can be represented by lattices and/or confusion networks
      - Run MT systems with lattices and/or confusion networks
      - tools: Moses, external or own tools
      - data: translated movie subtitles