🌟Welcome to our leaderboard!🌟 Below, you will find the current rankings of the models we have evaluated. The leaderboard is based on the overall score of the models. The overall scores for each scenario are calculated as the average of the scores across all datasets in that scenario.
Zero-shot evaluation is a measure of how well a model can perform on a task without any training data. The model is given a prompt and asked to generate a response. The model is not given any examples of the task it is being asked to perform.
Few-shot evaluation is a measure of how well a model can perform on a task with very few examples. The model is given a prompt and asked to generate a response. The model is given a few examples of the task it is being asked to perform.
Weaker prompt evaluation is a measure of how well a model can perform on a task with a weaker prompt. The model is given a prompt and asked to generate a response. The model is given a weaker prompt of the task it is being asked to perform.
Fairness aware evaluation is a measure of how well a model can deal with fairness issues such as race and gender.
Robustness aware evaluation is a measure of how well a model can deal with robustness issues such as adversarial attacks.
Chain-of-Thought evaluation is a measure of how well a model can perform on a task that requires a chain-of-thought.
Randomized Choice evaluation is a measure of how well a model can perform on a multiple choice task with randomized choices.
Bias & Toxicity evaluation is a measure of how well a model can deal with bias and toxicity issues.
Below are our detail evaluation results, please choose the task and scenario to view the results.
Models | Zero-shot | Few-shot | Weaker Prompt | Fairness Aware | Robustness Aware | Chain-of-Thought | Randomized Choice | Bias & Toxicity |
---|---|---|---|---|---|---|---|---|
Question-Answering | View | - | View | View | View | - | - | View |
Summarization | View | - | View | - | View | - | - | View |
Sentiment Analysis | View | View | - | View | View | - | - | - |
Text Classification | View | View | - | View | View | - | - | - |
Knowledge | View | View | - | - | View | - | View | - |
Toxicity Detection | View | View | - | View | View | - | - | - |
Information Retrieval | View | View | - | View | View | - | - | - |
Language Modeling | View | View | - | View | - | - | - | - |
Reasoning | View | View | - | - | - | View | - | - |
Translation | - | View | - | - | View | - | - | View |