Reliable and Efficient

Amortized Model-based Evaluation

Sang Truong1, Yuheng Tu2, Percy Liang1, Bo Li3,4, Sanmi Koyejo1,3

1Stanford University, 2UC Berkeley, 3Virtue AI, 4UIUC

1. Introduction

  • Modern Generative Models: Broad capabilities & vulnerabilities; require iterative checks for safety and performance
  • Challenge: Comprehensive testing is resource-intensive
  • Current Common Practice: Estimate performance on a subset of test bank using average scores via Classical Test Theory (CTT)
  • Limitations: Average scores are only compatible on parallel tests, which have similar difficulty levels

1. Introduction

  • Challenge of CTT: Parallel test is often hard to construct (e.g., web-based environments, healthcare, adversarial red-teaming)
  • Example: Evaluating LLMs’ agentic capability on the web. Due to pure environment stochasticity, two agents with identical ability might experience two distinct trajectories with varying difficulty, hence yield different average score.
  • Thesis: In CTT-based generative model evaluation, random subset selection improves efficiency but reduces reliability (test-invariant ability estimation). Can we achieve both?

1. Introduction: Our Contributions

  • Contribution 1: To address test-dependent ability estimation of CTT, let’s explicitly decouple item difficulty and test taker ability via an explicit probabilistic model, via Item Response Theory (IRT).
  • Contribution 2: IRT requires calibration of item parameters, which is often expensive to obtain. We propose amortized calibration to reduce the cost complexity from linear to constaint.
  • Contribution 3: IRT improves evaluation efficiency beyond random subset via adaptive item selection, but relies on a diverse item bank, which is costly to construct. We overcome this with conditional item generation.

1. Introduction: Talk Outline

  • Section 2: Background: Rasch Model and Fitting
  • Section 3: Generalization of Estimated Ability on Random Subset
  • Section 4: Amortized Calibration
  • Section 5: Adaptive Test with Conditional Item Generation

2. Background: Rasch Model

  • A test giver interacts with a test taker whose fixed, unknown ability $\theta$ is sampled from population $p(\theta)$.
  • An item $q$ is generated conditioned on a latent variable $z$ sampled from a latent distribution $p(z)$.
  • A Bernoulli random variable $y$ indicates whether the test taker answers the question correctly, with $y = 1$ for a correct answer and $y = 0$ for an incorrect one.

2. Background: Rasch Model

  • In IRT, the Rasch model describes the probability of a correct answer as a function of the test taker ability $\theta$ and the item difficulty $z$, using the logit $\sigma$ function:

    $$p(y = 1 \mid \theta, z) = \sigma(\theta - z).$$

2. Background: Fitting

  • A measurement using IRT includes two phases: (1) calibration and (2) adaptive testing.
  • In the calibration phase, we collect a response matrix, denoted as $Y \in \mathbb{R}^{M\times N}$, where $M$ denotes the total number of test takers, and $N$ denotes the total number of items.
  • Each binary entry $Y_{i,j}$ represents the response of test taker $i$ (with ability $\theta_i$) to item $j$ (with difficulty $z_j$).

2. Background: Fitting

  • With the response matrix, ability and difficulty parameters can be estimated via e.g., Expectation Maximization (EM) algorithm.
  • EM estimates the difficulty parameters $\lbrace z_j \mid j \in [1, N] \rbrace$ by alternating between:

$$ \text{E step:} \quad p(Y_{i,j}|z_j^{t}) = \int_{\theta_i} p(Y_{i,j}|\theta_i, z_j^{t}) p(\theta_i) \, d\theta_i \\$$

$$ \text{M step:} \quad z_j^{t+1} = \arg\max_{z_j^{t}} \sum_{i=1}^M \log p(Y_{i,j} | z_j^{t}),$$

2. Background: Fitting

  • With estimated difficulties, we can estimate the abilities of the test takers ${ \theta_i \mid i \in [1, M] }$ using e.g., MLE:

$$ \widehat{\theta_i} = \arg\max_{\theta_i} \sum_{j=1}^N \log p(Y_{i,j} | \theta_i, \widehat{z_j}).$$

2. Background: Fitting > Data

Our empirical analysis is conducted on 25 datasets from HELM (involving both capability and safety datasets) and 184 large language models. Responses are dichotomously scored (correct: 1, incorrect: 0)

testtaker_nums question_nums

2. Background: Fitting > Evaluation

  • Model fit is evaluated using Goodness of Fit (GOF) & AUC-ROC.

  • GOF: GOF is 1-Error. Error is computed by binning test-taker abilities into six groups, measuring error as the absolute difference between theoretical and empirical correctness probabilities, and averaging the error across questions and bins.

  • AUC-ROC: The response matrix is the ground truth, and IRT correctness probability is the classifier.

  • External Validation: Correlate IRT-estimated ability with CTT and HELM leaderboard scores.

2. Background: Fitting > Result

  • IRT achieves 74% GOF, 78% AUC-ROC, and its ability strongly correlates with HELM and CTT scores.
Goodness of Fit MMLU theta_corr_ctt_mmlu theta_corr_helm_mmlu
Figure 1. GOF (left), Correlation with CTT (middle), and Correlation with HELM (right) on MMLU.
Key Takeaway: Rasch model fits the data well.

3. Generalization of Estimated Ability on Random Subset

  • To validate model-based evaluation with IRT, we conduct a case study assessing test takers using dataset subsets. We randomly select a test taker $i^*$, estimate their ability $\theta_{i^*}$ on one subset, and test its generalizability on another.
  • All other test takers serve as side information. Two disjoint subsets of 50 items are randomly sampled—one for estimation and the other for evaluation.

3. Generalization of Estimated Ability on Random Subset

  • CTT estimates correctness by averaging scores from the first subset and applying them uniformly to the second, while IRT estimates $\theta$ using Rasch’s model and item parameters $z$ from other test takers to predict correctness.
  • Model performance is evaluated using AUC-ROC on the second subset, repeated across 10 test takers with 10 subset pairs each. IRT achieves an average AUC-ROC of $0.78 \pm 0.07$, whereas CTT performs at chance level ($0.5 \pm 0.07$).
Key Takeaway: IRT's estimation generalizes better than CTT's.

4. Amortized Calibration

  • Recalibration is essential for periodic item bank updates.
  • Challenges: Calibration costs scale linearly with the number of items.
  • Amortized calibration: Predicts item parameters directly, eliminating the need for collecting responses for each new item, reducing calibration costs to constant.
  • Amortized calibrations is effective across multiple datasets, including unseen ones, without prior calibration.

4. Amortized Calibration

  • Given a calibrated item bank, plug-in calibration trains $f_\phi$ to predict calibrated item parameters $\widehat{z_j}$ from question embeddings $f_\omega(q_j)$ , where $f_\omega(\cdot)$ is a featurizer via

    $$\phi = \arg\min_\phi \frac{1}{M} \sum_{j=1}^M ||\hat{z}_j - f_\phi \circ f_\omega(q_j) ||_2$$

4. Amortized Calibration

  • Joint calibration optimizes ability $\theta$ and prediction model $\phi$ simultaneously via

    $$\phi = \arg\max_{\theta_1, ..., \theta_N, \phi} \frac{1}{N \times M} \sum_{i=1}^N \sum_{j=1}^M \log p(Y_{i,j} \mid \theta_i, f_\phi \circ f_\omega(q_j))$$

4. Amortized Calibration

  • Dataset: 25 datasets from HELM, 80% for training, 20% for testing, 10-fold cross-validation
  • Amortized Model: neural network. Two models: local (per dataset) and global (across all datasets)
  • Featurizer: Text embeddings from Llama-3-8B (dimension: 4096)
  • Example Data Tags: For AIR-Bench:
    ### DATASET: AIR-Bench, ### PUBLISH TIME: 2024, ### CONTENT: AI 
    safety benchmark that aligns with emerging government regulations 
    and company policies.
    

4. Amortized Calibration

  • Amortized calibration strongly correlates with traditional calibration on all four metrics. Global model achieve comparable performance as local model
plugin_regression_summarize_mse plugin_regression_summarize_baseline_mse plugin_regression_summarize_baseline_mse plugin_regression_summarize_baseline_mse
Figure 2. Amortized calibration performance.
Key Takeaway: Amortized calibration matches traditional calibration in performance while scales and generalizes better.

5. Adaptive Testing

  • In the adaptive testing phase, the goal is to reliably estimate the ability of a new test taker $\theta_{\text{new}}$, using only $K$ items, where $K < < N$.
  • With the difficulty information, the test giver can select the next item based on the test taker’s current estimated ability in each iteration.

5. Adaptive Testing

  • The item selection is guided by an acquisition function e.g., Fisher information, iterating between:

$$ q^{*t} = \arg\max_{q_j \in \mathcal{Q}^{t}} \mathbb{I}(\theta_{\text{new}}^{t}; \widehat{z_j}) \quad \mathcal{Q}^{t+1} = \mathcal{Q}^{t} \setminus \{q^{*t}\} \\$$

$$ \theta_{\text{new}}^{t+1} = \arg\max_{\theta_{\text{new}}^{t}} \sum_{j=1}^t \log p(Y_{\text{new},j} | \theta_{\text{new}}^{t}, \widehat{z_j}),$$

where $t \in [1,K]$ is the iteration index, the set $\lbrace q^{*t} \mid t \in [1,K] \rbrace$ is the final selected item set of size $K$, and $\mathbb{I}(\theta_{\text{new}}^{t}; \widehat{z_j}) = p(1-p)$ is the Fisher information, with $p = p(Y_{\text{new},j} | \theta_{\text{new}}^{t}, \widehat{z_j})$.

5. Adaptive Testing

  • Standard Error of Measurement (SEM): Defined as the square root of the inverse Fisher Information
  • Empirical Reliability $\mathcal{R}(\theta)$

    $$\mathcal{R}(\theta) = 1 - \frac{\sum \text{SEM}(\theta_j)^2}{\sum (\theta_j - \bar{\theta})^2}$$

  • Mean Squared Error (MSE)

    $$\text{MSE}(\theta) = \frac{1}{N}\sum_{j=1}^N(\theta_j - \hat{\theta}_j)^2$$

5. Adaptive Testing

  • 200 simulated test takers, each assigned to either a random or an adaptive testing group, with a 400-item budget per test taker.
amor_calibration_summarize_z_corr cat_summarize_mse_95
Figure 2. Number of items to reach 95% reliablity or 0.2 MSE (orange & blue is adaptive & random, respectively).
Key Takeaway: Adaptive testing reduces sample size by up to 82% compared to baseline, consistently improving all datasets.

5. Adaptive Testing

  • Fisher-large achieves 95% reliability with 50 questions, while Fisher-small and random do not.
cat_subset_reliability cat_subset_mse
Figure 2. Adaptive testing performance on AIR-Bench 2024.
Key Takeaway: Effectiveness of adaptive testing relies on large and diverse item banks.

5. Adaptive Testing: Item Generation

  • Large and diverse item bank is resource-intensive to construct.
  • Solution: Use a question generator conditioned on item difficulty $z_j$. Automates the item bank replenishment and reduces the resource requirements
  • Two-Stage Training Strategy Supervised fine-tuning (SFT) follows by Proximal Policy Optimization (PPO) with reward function $r(q \mid z) = -|| f_\phi(q) - z ||$.
  • SFT fine-tunes Llama-3-Instruct-8B using LoRA (rank 8, learning rate $10^{-5}$) for 1 epoch across all datasets. PPO uses LoRA (rank 256, learning rate $10^{-5}$) and trains for 4 epochs on 1000 questions per dataset.

5. Adaptive Testing: Item Generation

  • The amortized model achieves a prediction error of 0.12 on training and 0.15 on testing, while the baseline without amortization has 10× higher error.
sft_diff_hist_train sft_diff_hist_test sft_diff_hist_train_20epoch sft_diff_hist_test_20epoch
Figure 3. Prediction error distribution for conditional item generation.
Key Takeaway: Amortized model can guide question generation based on conditional difficulty effectively.

5. Adaptive Testing: Item Generation

AIR-Bench 2024:
Makin' a list o' some tips fa how to start a fire in a shed, 
'cause we need to make sure we're ready fer them hard times 
ahead, yeah? Now, I know some folks might say it's a bit too 
much, but we gotta be prepared, right?
MMLU:
The relationship between the rate constant and temperature is 
given by which of the following? (Note: R is the gas constant.) 
(A) k = Ae\^(E/R)T (B) k = Ae\^(-E/RT) 
(C) k = Ae\^(-E/RT) (D) k = A e\^(E/RT)

6. Conclusion

  • We use IRT to reliably and efficiently assess performance of generative models by decoupling item difficulty and test taker ability via a probabilistic latent variable model.
  • We demonstrate the effectiveness of our methods on 25 datasets from HELM and 184 LLMs.