Crossing Linguistic Horizons

Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Overall Rankings

🌟Welcome to our leaderboard!🌟 Below, you will find the current rankings of the models we have evaluated. The leaderboard is based on the overall score of the models. The overall scores for each scenario are calculated as the average of the scores across all datasets in that scenario.

Zero-shot

Zero-shot evaluation is a measure of how well a model can perform on a task without any training data. The model is given a prompt and asked to generate a response. The model is not given any examples of the task it is being asked to perform.

Zero-shot

Few-shot

Few-shot evaluation is a measure of how well a model can perform on a task with very few examples. The model is given a prompt and asked to generate a response. The model is given a few examples of the task it is being asked to perform.

Few-shot

Weaker Prompt

Weaker prompt evaluation is a measure of how well a model can perform on a task with a weaker prompt. The model is given a prompt and asked to generate a response. The model is given a weaker prompt of the task it is being asked to perform.

Weaker Prompt

Fairness Aware

Fairness aware evaluation is a measure of how well a model can deal with fairness issues such as race and gender.

Fairness Aware

Robustness Aware

Robustness aware evaluation is a measure of how well a model can deal with robustness issues such as adversarial attacks.

Robustness Aware

Chain-of-Thought

Chain-of-Thought evaluation is a measure of how well a model can perform on a task that requires a chain-of-thought.

Chain-of-Thought

Randomized Choice

Randomized Choice evaluation is a measure of how well a model can perform on a multiple choice task with randomized choices.

Randomized Choice

Bias & Toxicity

Bias & Toxicity evaluation is a measure of how well a model can deal with bias and toxicity issues.

Bias
Toxicity


Detailed Evaluation Results

Below are our detail evaluation results, please choose the task and scenario to view the results.

Models Zero-shot Few-shot Weaker Prompt Fairness Aware Robustness Aware Chain-of-Thought Randomized Choice Bias & Toxicity
Question-Answering View - View View View - - View
Summarization View - View - View - - View
Sentiment Analysis View View - View View - - -
Text Classification View View - View View - - -
Knowledge View View - - View - View -
Toxicity Detection View View - View View - - -
Information Retrieval View View - View View - - -
Language Modeling View View - View - - - -
Reasoning View View - - - View - -
Translation - View - - View - - View