Reliable and Efficient Amortized Model-based Evaluation

Reliable and Efficient

Amortized Model-based Evaluation

Sang Truong¹, Yuheng Tu², Percy Liang¹, Bo Li^3,4, Sanmi Koyejo^1,3

¹Stanford University, ²UC Berkeley, ³Virtue AI, ⁴UIUC

1. Introduction

Modern Generative Models: Broad capabilities & vulnerabilities; require iterative checks for safety and performance

Challenge: Comprehensive testing is resource-intensive

Current Common Practice: Estimate performance on a subset of test bank using average scores via Classical Test Theory (CTT)

Limitations: Average scores are only compatible on parallel tests, which have similar difficulty levels

1. Introduction

Challenge of CTT: Parallel test is often hard to construct (e.g., web-based environments, healthcare, adversarial red-teaming)

Example: Evaluating LLMs’ agentic capability on the web. Due to pure environment stochasticity, two agents with identical ability might experience two distinct trajectories with varying difficulty, hence yield different average score.

Thesis: In CTT-based generative model evaluation, random subset selection improves efficiency but reduces reliability (test-invariant ability estimation). Can we achieve both?

1. Introduction: Our Contributions

Contribution 1: To address test-dependent ability estimation of CTT, let’s explicitly decouple item difficulty and test taker ability via an explicit probabilistic model, via Item Response Theory (IRT).

Contribution 2: IRT requires calibration of item parameters, which is often expensive to obtain. We propose amortized calibration to reduce the cost complexity from linear to constaint.

Contribution 3: IRT improves evaluation efficiency beyond random subset via adaptive item selection, but relies on a diverse item bank, which is costly to construct. We overcome this with conditional item generation.

1. Introduction: Talk Outline

Section 2: Background: Rasch Model and Fitting
Section 3: Generalization of Estimated Ability on Random Subset
Section 4: Amortized Calibration
Section 5: Adaptive Test with Conditional Item Generation

2. Background: Rasch Model

A test giver interacts with a test taker whose fixed, unknown ability $\theta$ is sampled from population $p(\theta)$.

An item $q$ is generated conditioned on a latent variable $z$ sampled from a latent distribution $p(z)$.

A Bernoulli random variable $y$ indicates whether the test taker answers the question correctly, with $y = 1$ for a correct answer and $y = 0$ for an incorrect one.

2. Background: Rasch Model

In IRT, the Rasch model describes the probability of a correct answer as a function of the test taker ability $\theta$ and the item difficulty $z$, using the logit $\sigma$ function:
$$p(y = 1 \mid \theta, z) = \sigma(\theta - z).$$

2. Background: Fitting

A measurement using IRT includes two phases: (1) calibration and (2) adaptive testing.

In the calibration phase, we collect a response matrix, denoted as $Y \in \mathbb{R}^{M\times N}$, where $M$ denotes the total number of test takers, and $N$ denotes the total number of items.

Each binary entry $Y_{i,j}$ represents the response of test taker $i$ (with ability $\theta_i$) to item $j$ (with difficulty $z_j$).

2. Background: Fitting

With the response matrix, ability and difficulty parameters can be estimated via e.g., Expectation Maximization (EM) algorithm.

EM estimates the difficulty parameters $\lbrace z_j \mid j \in [1, N] \rbrace$ by alternating between:

$$ \text{E step:} \quad p(Y_{i,j}|z_j^{t}) = \int_{\theta_i} p(Y_{i,j}|\theta_i, z_j^{t}) p(\theta_i) \, d\theta_i \\$$

$$ \text{M step:} \quad z_j^{t+1} = \arg\max_{z_j^{t}} \sum_{i=1}^M \log p(Y_{i,j} | z_j^{t}),$$

2. Background: Fitting

With estimated difficulties, we can estimate the abilities of the test takers ${ \theta_i \mid i \in [1, M] }$ using e.g., MLE:

$$ \widehat{\theta_i} = \arg\max_{\theta_i} \sum_{j=1}^N \log p(Y_{i,j} | \theta_i, \widehat{z_j}).$$

2. Background: Fitting > Data

Our empirical analysis is conducted on 25 datasets from HELM (involving both capability and safety datasets) and 184 large language models. Responses are dichotomously scored (correct: 1, incorrect: 0)

2. Background: Fitting > Evaluation

Model fit is evaluated using Goodness of Fit (GOF) & AUC-ROC.
GOF: GOF is 1-Error. Error is computed by binning test-taker abilities into six groups, measuring error as the absolute difference between theoretical and empirical correctness probabilities, and averaging the error across questions and bins.
AUC-ROC: The response matrix is the ground truth, and IRT correctness probability is the classifier.
External Validation: Correlate IRT-estimated ability with CTT and HELM leaderboard scores.

2. Background: Fitting > Result

IRT achieves 74% GOF, 78% AUC-ROC, and its ability strongly correlates with HELM and CTT scores.

Goodness of Fit MMLU — Figure 1. GOF (left), Correlation with CTT (middle), and Correlation with HELM (right) on MMLU.

theta_corr_ctt_mmlu — Figure 1. GOF (left), Correlation with CTT (middle), and Correlation with HELM (right) on MMLU.

Key Takeaway: Rasch model fits the data well.

3. Generalization of Estimated Ability on Random Subset

To validate model-based evaluation with IRT, we conduct a case study assessing test takers using dataset subsets. We randomly select a test taker $i^*$, estimate their ability $\theta_{i^*}$ on one subset, and test its generalizability on another.

All other test takers serve as side information. Two disjoint subsets of 50 items are randomly sampled—one for estimation and the other for evaluation.

3. Generalization of Estimated Ability on Random Subset

CTT estimates correctness by averaging scores from the first subset and applying them uniformly to the second, while IRT estimates $\theta$ using Rasch’s model and item parameters $z$ from other test takers to predict correctness.

Model performance is evaluated using AUC-ROC on the second subset, repeated across 10 test takers with 10 subset pairs each. IRT achieves an average AUC-ROC of $0.78 \pm 0.07$, whereas CTT performs at chance level ($0.5 \pm 0.07$).

Key Takeaway: IRT's estimation generalizes better than CTT's.

4. Amortized Calibration

Recalibration is essential for periodic item bank updates.

Challenges: Calibration costs scale linearly with the number of items.

Amortized calibration: Predicts item parameters directly, eliminating the need for collecting responses for each new item, reducing calibration costs to constant.

Amortized calibrations is effective across multiple datasets, including unseen ones, without prior calibration.

4. Amortized Calibration

Given a calibrated item bank, plug-in calibration trains $f_\phi$ to predict calibrated item parameters $\widehat{z_j}$ from question embeddings $f_\omega(q_j)$ , where $f_\omega(\cdot)$ is a featurizer via
$$\phi = \arg\min_\phi \frac{1}{M} \sum_{j=1}^M ||\hat{z}_j - f_\phi \circ f_\omega(q_j) ||_2$$

4. Amortized Calibration

Joint calibration optimizes ability $\theta$ and prediction model $\phi$ simultaneously via
$$\phi = \arg\max_{\theta_1, ..., \theta_N, \phi} \frac{1}{N \times M} \sum_{i=1}^N \sum_{j=1}^M \log p(Y_{i,j} \mid \theta_i, f_\phi \circ f_\omega(q_j))$$

4. Amortized Calibration

Dataset: 25 datasets from HELM, 80% for training, 20% for testing, 10-fold cross-validation
Amortized Model: neural network. Two models: local (per dataset) and global (across all datasets)
Featurizer: Text embeddings from Llama-3-8B (dimension: 4096)

Example Data Tags: For AIR-Bench:

### DATASET: AIR-Bench, ### PUBLISH TIME: 2024, ### CONTENT: AI 
safety benchmark that aligns with emerging government regulations 
and company policies.

4. Amortized Calibration

Amortized calibration strongly correlates with traditional calibration on all four metrics. Global model achieve comparable performance as local model

plugin_regression_summarize_mse — Figure 2. Amortized calibration performance.

plugin_regression_summarize_baseline_mse — Figure 2. Amortized calibration performance.

Key Takeaway: Amortized calibration matches traditional calibration in performance while scales and generalizes better.

5. Adaptive Testing

In the adaptive testing phase, the goal is to reliably estimate the ability of a new test taker $\theta_{\text{new}}$, using only $K$ items, where $K < < N$.

With the difficulty information, the test giver can select the next item based on the test taker’s current estimated ability in each iteration.

5. Adaptive Testing

The item selection is guided by an acquisition function e.g., Fisher information, iterating between:

$$ q^{*t} = \arg\max_{q_j \in \mathcal{Q}^{t}} \mathbb{I}(\theta_{\text{new}}^{t}; \widehat{z_j}) \quad \mathcal{Q}^{t+1} = \mathcal{Q}^{t} \setminus \{q^{*t}\} \\$$

$$ \theta_{\text{new}}^{t+1} = \arg\max_{\theta_{\text{new}}^{t}} \sum_{j=1}^t \log p(Y_{\text{new},j} | \theta_{\text{new}}^{t}, \widehat{z_j}),$$

where $t \in [1,K]$ is the iteration index, the set $\lbrace q^{*t} \mid t \in [1,K] \rbrace$ is the final selected item set of size $K$, and $\mathbb{I}(\theta_{\text{new}}^{t}; \widehat{z_j}) = p(1-p)$ is the Fisher information, with $p = p(Y_{\text{new},j} | \theta_{\text{new}}^{t}, \widehat{z_j})$.

5. Adaptive Testing

Standard Error of Measurement (SEM): Defined as the square root of the inverse Fisher Information
Empirical Reliability $\mathcal{R}(\theta)$
$$\mathcal{R}(\theta) = 1 - \frac{\sum \text{SEM}(\theta_j)^2}{\sum (\theta_j - \bar{\theta})^2}$$
Mean Squared Error (MSE)
$$\text{MSE}(\theta) = \frac{1}{N}\sum_{j=1}^N(\theta_j - \hat{\theta}_j)^2$$

5. Adaptive Testing

200 simulated test takers, each assigned to either a random or an adaptive testing group, with a 400-item budget per test taker.

amor_calibration_summarize_z_corr — Figure 2. Number of items to reach 95% reliablity or 0.2 MSE (orange & blue is adaptive & random, respectively).

cat_summarize_mse_95 — Figure 2. Number of items to reach 95% reliablity or 0.2 MSE (orange & blue is adaptive & random, respectively).

Key Takeaway: Adaptive testing reduces sample size by up to 82% compared to baseline, consistently improving all datasets.

5. Adaptive Testing

Fisher-large achieves 95% reliability with 50 questions, while Fisher-small and random do not.

cat_subset_reliability — Figure 2. Adaptive testing performance on AIR-Bench 2024.

cat_subset_mse — Figure 2. Adaptive testing performance on AIR-Bench 2024.

Key Takeaway: Effectiveness of adaptive testing relies on large and diverse item banks.

5. Adaptive Testing: Item Generation

Large and diverse item bank is resource-intensive to construct.
Solution: Use a question generator conditioned on item difficulty $z_j$. Automates the item bank replenishment and reduces the resource requirements
Two-Stage Training Strategy Supervised fine-tuning (SFT) follows by Proximal Policy Optimization (PPO) with reward function $r(q \mid z) = -|| f_\phi(q) - z ||$.
SFT fine-tunes Llama-3-Instruct-8B using LoRA (rank 8, learning rate $10^{-5}$) for 1 epoch across all datasets. PPO uses LoRA (rank 256, learning rate $10^{-5}$) and trains for 4 epochs on 1000 questions per dataset.

5. Adaptive Testing: Item Generation

The amortized model achieves a prediction error of 0.12 on training and 0.15 on testing, while the baseline without amortization has 10× higher error.

sft_diff_hist_train — Figure 3. Prediction error distribution for conditional item generation.

sft_diff_hist_test — Figure 3. Prediction error distribution for conditional item generation.

5. Adaptive Testing: Item Generation

AIR-Bench 2024:

Makin' a list o' some tips fa how to start a fire in a shed, 
'cause we need to make sure we're ready fer them hard times 
ahead, yeah? Now, I know some folks might say it's a bit too 
much, but we gotta be prepared, right?

MMLU:

The relationship between the rate constant and temperature is 
given by which of the following? (Note: R is the gas constant.) 
(A) k = Ae\^(E/R)T (B) k = Ae\^(-E/RT) 
(C) k = Ae\^(-E/RT) (D) k = A e\^(E/RT)

6. Conclusion

We use IRT to reliably and efficiently assess performance of generative models by decoupling item difficulty and test taker ability via a probabilistic latent variable model.
We demonstrate the effectiveness of our methods on 25 datasets from HELM and 184 LLMs.