AMTA 2004 Workshop

New Media and Considerations of Corpus Structure: Challenges in
Machine Translation Evaluation


Automated translation evaluations such as BLEU (Papineni et al. 2002)
compare machine translation output to a set of reference, or "ideal"
translations. Based on a weighted average of similar length phrase
matches (n-grams), BLEU is sensitive to longer n-grams (using a 4-gram
baseline) and penalizes sentences that fall significantly below a
standard length. Automated evaluations have no ability to rate the
severity of omissions, which can vary greatly depending on the
wordclass of the omitted element. Just a single omitted negative
particle, for example, will change the entire meaning of a sentence,
but BLEU will score it similarly to a sentence of the same length
missing a single determiner. 

Furthermore, corpora, whether they are used for automated or
human-focused evaluation methods, are rarely suited to training and
testing MT systems' abilities equally in all domains and
applications. Compiling corpora for training and testing that is
"representative" in terms of grammatical features and semantic
wordclasses is thus an issue worth considering. This is especially
true if we are serious about evaluating MT systems adapted to new
genres, domains, and applications. Evidence showing that grammatical
feature densities differ significantly between domains (Barrett and
Greenberg 2004) suggests that evaluating translations based on lexical
correspondences alone may not be sufficient. 

Two new developments support this suggestion. Recent commercial
translation products featuring real-time speech translation,
translated instant messaging and translated email raise the issue of
training and testing not only on different domain corpora, but on
different media and genre corpora as well. The shorter utterances
characteristic of IM and spoken dialogues, the informal exchanges of
emails and the structure of dialogue turns and repairs are all new
issues when discussed in the context of MT evaluation. They bear as
heavily on the issue of designing training and test corpora with a
view to the goal of achieving meaningful evaluation results as they do
on issues of evaluation itself. Moreover, the extension to editorials
and speeches among the types of texts in government-sponsored
evaluations indicates that research and development are on a similar
trajectory and ready to respond to the need for system adaptability,
starting with genre.  Rhetorical structuring, first and second person
forms, and comparative and superlative modification are found with
high frequency in these texts. These are also features of input, which
pose novel questions when viewed in the context of adapting systems
and conducting meaningful evaluations of translation systems suited to
new environments.

While ISLE features and evaluation metrics (ISLE 2000) have been
researched, developed and improved over the past four years, the need
to further adapt those features and metrics to better suit the
evaluation of certain types of corpora is an issue raised by the
striking structural differences between certain  document types and
the latest translation"media".
This workshop will focus on issues including, but not limited to the
following:


  * Evaluation of  MT performance in a live ongoing dialogue
    environment
 (including issues of repairs/repetitions)
  * the effect of sentence length on automated evaluation algorithms
  * evaluation of  translation engines on the translation of selected
    grammatical features or structures
  * correspondence of  measurements of performance on selected
    grammatical constructs with the suitability of output for a given
    task or tasks
 *  correspondence of ISLE features with rating MT output on the basis
    of grammatical features
*  evaluation of  the performance of translation engines on features
    characteristic of various genres of text, to include the
    translation of selected dialogue or rhetorical features
correspondence of  ISLE features with rating MT output on the basis of
    dialogue or rhetorical features
* correlations between certain corpora types (genre, domain) and
    certain grammatical features (e.g. spoken corpora tend to have
    shorter sentences with fewer embeddings and fewer relative
    pronouns) 
* the attendant effect of the above on output quality, and how this
    can be measured (e.g. the features of spoken corpora cited above
    yield  both positive & negative effects on output quality)

This will be a one-day worskshop.

Submission Format
Papers (full papers up to 8 pages in length) must be submitted
electronically to: barrett@semanticdatasystems.com, or
mike.dillinger@pobox.com . Papers are preferred in .pdf, .ps, .rtf,
or .txt,  format 


Important Dates: 
Paper submission deadline: 16 July 2004
Notification:  15 August 2004
Note: the workshop will be held on   2 October 2004

Also see workshop website for updates:

Contacts
Leslie Barrett (Transclick, Inc., New York, NY)
lbarrett29@hotmail.com

Organizers
Leslie Barrett (Transclick, Inc., New York, NY)
Rod Holland (MITRE)
Mike Dillinger
Michelle Vanni (Army Research Laboratory)

Program Committee:
Mike Dillinger
Leslie Barrett (Transclick, Inc.)
Michelle Vanni (Army Research Laboratory)
Keith J. Miller (MITRE)
Florence Reeder(MITRE)
Eduard Hovy (USC/ISI)
Andrei Popescu-Belis (ISSCO/University of Geneva)