AMTA 2004 Workshop New Media and Considerations of Corpus Structure: Challenges in Machine Translation Evaluation Automated translation evaluations such as BLEU (Papineni et al. 2002) compare machine translation output to a set of reference, or "ideal" translations. Based on a weighted average of similar length phrase matches (n-grams), BLEU is sensitive to longer n-grams (using a 4-gram baseline) and penalizes sentences that fall significantly below a standard length. Automated evaluations have no ability to rate the severity of omissions, which can vary greatly depending on the wordclass of the omitted element. Just a single omitted negative particle, for example, will change the entire meaning of a sentence, but BLEU will score it similarly to a sentence of the same length missing a single determiner. Furthermore, corpora, whether they are used for automated or human-focused evaluation methods, are rarely suited to training and testing MT systems' abilities equally in all domains and applications. Compiling corpora for training and testing that is "representative" in terms of grammatical features and semantic wordclasses is thus an issue worth considering. This is especially true if we are serious about evaluating MT systems adapted to new genres, domains, and applications. Evidence showing that grammatical feature densities differ significantly between domains (Barrett and Greenberg 2004) suggests that evaluating translations based on lexical correspondences alone may not be sufficient. Two new developments support this suggestion. Recent commercial translation products featuring real-time speech translation, translated instant messaging and translated email raise the issue of training and testing not only on different domain corpora, but on different media and genre corpora as well. The shorter utterances characteristic of IM and spoken dialogues, the informal exchanges of emails and the structure of dialogue turns and repairs are all new issues when discussed in the context of MT evaluation. They bear as heavily on the issue of designing training and test corpora with a view to the goal of achieving meaningful evaluation results as they do on issues of evaluation itself. Moreover, the extension to editorials and speeches among the types of texts in government-sponsored evaluations indicates that research and development are on a similar trajectory and ready to respond to the need for system adaptability, starting with genre. Rhetorical structuring, first and second person forms, and comparative and superlative modification are found with high frequency in these texts. These are also features of input, which pose novel questions when viewed in the context of adapting systems and conducting meaningful evaluations of translation systems suited to new environments. While ISLE features and evaluation metrics (ISLE 2000) have been researched, developed and improved over the past four years, the need to further adapt those features and metrics to better suit the evaluation of certain types of corpora is an issue raised by the striking structural differences between certain document types and the latest translation"media". This workshop will focus on issues including, but not limited to the following: * Evaluation of MT performance in a live ongoing dialogue environment (including issues of repairs/repetitions) * the effect of sentence length on automated evaluation algorithms * evaluation of translation engines on the translation of selected grammatical features or structures * correspondence of measurements of performance on selected grammatical constructs with the suitability of output for a given task or tasks * correspondence of ISLE features with rating MT output on the basis of grammatical features * evaluation of the performance of translation engines on features characteristic of various genres of text, to include the translation of selected dialogue or rhetorical features correspondence of ISLE features with rating MT output on the basis of dialogue or rhetorical features * correlations between certain corpora types (genre, domain) and certain grammatical features (e.g. spoken corpora tend to have shorter sentences with fewer embeddings and fewer relative pronouns) * the attendant effect of the above on output quality, and how this can be measured (e.g. the features of spoken corpora cited above yield both positive & negative effects on output quality) This will be a one-day worskshop. Submission Format Papers (full papers up to 8 pages in length) must be submitted electronically to: barrett@semanticdatasystems.com, or mike.dillinger@pobox.com . Papers are preferred in .pdf, .ps, .rtf, or .txt, format Important Dates: Paper submission deadline: 16 July 2004 Notification: 15 August 2004 Note: the workshop will be held on 2 October 2004 Also see workshop website for updates: Contacts Leslie Barrett (Transclick, Inc., New York, NY) lbarrett29@hotmail.com Organizers Leslie Barrett (Transclick, Inc., New York, NY) Rod Holland (MITRE) Mike Dillinger Michelle Vanni (Army Research Laboratory) Program Committee: Mike Dillinger Leslie Barrett (Transclick, Inc.) Michelle Vanni (Army Research Laboratory) Keith J. Miller (MITRE) Florence Reeder(MITRE) Eduard Hovy (USC/ISI) Andrei Popescu-Belis (ISSCO/University of Geneva)