BLEURT: Learning Robust Metrics for Text Generation (Paper Explained)

Proper evaluation of text generation models, such as machine translation systems, requires expensive and slow human assessment. As these models have gotten better in previous years, proxy-scores, like BLEU, are becoming less and less useful. This paper proposes to learn a proxy score and demonstrates that it correlates well with human raters, even as the data distribution shifts. OUTLINE: 0:00 - Intro & High-Level Overview 1:00 - The Problem with Evaluating Machine Translation 5:10 - Task Evaluation as a L
В начало