The recent emergence of multilingual large language models (LLMs) is revolutionizing natural language processing, bridging communication gaps across diverse cultures and languages. However, to truly harness the potential of these models, it’s crucial to understand their strengths and limitations across a wide range of languages and tasks. MELT is designed with this in mind, offering a comprehensive approach to evaluate LLMs in various linguistic contexts. Recognizing that proficiency in one language or task does not guarantee similar performance elsewhere, MELT enables users to pinpoint specific areas for improvement, fostering the development of robust and reliable multilingual language technologies.
MELT includes ten carefully selected evaluation scenarios, each targeting a key aspect of LLM capability:
MELT also includes tools to ensure the ethical deployment of LLMs:
MELT offers a holistic evaluation framework that not only assesses performance but also emphasizes ethical considerations, making it an essential tool for developing multilingual language models that are both effective and responsible. MELT currently supports the following languages and tasks:
Model |
---|
|
Task | Dataset | Metric |
---|---|---|
Question Answering | XQuaD | Exact Match, F1 Score |
MLQA | ||
Text Summarization | VietNews | ROUGE-1, ROUGE-2, ROUGE-L, SummaC, BERTScore, Coverage, Density, Compression |
WikiLingua | ||
Sentiment Analysis | VLSP 2016 | Accuracy, F1 Score, AUC ROC, Expected Calibration Error at top-10, Accuracy at 10% coverage |
UiT-VSFC | ||
Text Classification | UiT-VSMEC | Accuracy, F1 Score, AUC ROC, Expected Calibration Error at top-10, Accuracy at 10% coverage |
PhoATIS | ||
Knowledge | ZaloE2E | Exact Match, F1 Score |
ViMMRC | Accuracy, F1 Score, AUC ROC, Expected Calibration Error at top-10, Accuracy at 10% coverage |
|
Toxicity Detection | UiT-ViCTSD | Accuracy, F1 Score, AUC ROC, Expected Calibration Error at top-10, Accuracy at 10% coverage |
UiT-ViHSD | ||
Information Retrieval | mMARCO | Mean Reciprocal Rank in top-10, Boosted Mean Reciprocal Rank in top-10, Normalized Discounted Cumulative Gain in top-10, Boosted Normalized Discounted Cumulative Gain in top-10 |
mRobust04 | ||
Language Modeling | MLQA-MLM | Exact Match, Character Error Rate, Word Error Rate, Character Edit Distance, Word Edit Distance, Perplexity |
VSEC | ||
Reasoning | Synthetic Reasoning - Natural Language |
Exact Match, F1 Score, Equivalent |
Synthetic Reasoning - Abstract Symbol |
||
MATH | ||
Machine Translation | PhoMT | Bilingual Evaluation Understudy, hLEPOR |
OPUS100 | ||
Bias & Toxicity in generation |
XQuaD | Demographic Representations of Races, Demographic Representations of Genders, Stereotypical Associations of Races, Stereotypical Associations of genders, Toxicity score |
MLQA | ||
VietNews | ||
WikiLingua | ||
PhoMT | ||
OPUS100 |
Stanford University
Ho Chi Minh City University of Technology - VNU-HCM
Ho Chi Minh City University of Technology - VNU-HCM
Ho Chi Minh City University of Technology - VNU-HCM
Ho Chi Minh City University of Technology - VNU-HCM
Ho Chi Minh City University of Technology - VNU-HCM
Stanford University