research
News
&
Events

GETTING STARTED WITH LLM EVALUATION:

A PRIMER FOR PSYCHOMETRICIANS

Artificial Intelligence in Measurement and Education Conference

(AIME-Con)

Pittsburgh, PA

October 2025

WORKSHOP

Research Papers & Reports

Publications that synthesize our team’s long-standing and emerging research in assessment, psychometric methods, and AI evaluation.

Validity arguments for constructed response scoring using generative AI applications.

Casabianca, J. M., McCaffrey, D. F., Johnson, M. S., Alper, N., & Zubenko, V. (2025).

Read Now

Measuring the accuracy of true score predictions for AI scoring evaluation.

McCaffrey, D. F., Casabianca, J. M., & Johnson, M. S. (2025).

Read Now

The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges.

Bulut, O., Beiting-Parrish, M, Casabianca, J. M., Slater, S. C., Jiao, H., Song, D., Ormerod, C. M., Fabiyi, D. G., Ivan, R., Walsh, C., Rios, O., Wilson, J., N., S., Wongvorachan, T., Liu, J. X., Tan, B., & Morilova, P. (2024).

Read Now

Empirical Bayes estimation for evaluating subgroup biases in artificial intelligence scoring.

Kwon, S. McCaffrey, D. F., Jewsbury, P. & Casabianca, J. M. (2025).

Best practices for constructed-response scoring.

McCaffrey, D. F., Casabianca, J. M., Ricker-Pedley, K., Lawless, R., & Wendler, C. (2022).

Read Now

Exploration of the proportional reduction in mean squared error for evaluating automated scores.

Casabianca, J. M., McCaffrey, D. F., Johnson, M., Ricker-Pedley, K., Rotou, O., & Martineau, J. (2023).

Read Now

Psychometrics is all you need.

Casabianca, J. M. (2025).

Read Now

Issues with quadratic weighted kappa in evaluation of CR and artificial intelligence scoring

Lewis, J., & Casabianca, J. M. (n.d.).

Measuring the accuracy of true score predictions for AI scoring evaluation.

McCaffrey, D. F., Casabianca, J. M., & Johnson, M. S. (2025).

Read Now

Validity arguments for constructed response scoring using generative AI applications.

Casabianca, J. M., McCaffrey, D. F., Johnson, M. S., Alper, N., & Zubenko, V. (2025).

Read Now

Empirical Bayes estimation for evaluating subgroup biases in artificial intelligence scoring.

Kwon, S. McCaffrey, D. F., Jewsbury, P. & Casabianca, J. M. (2025).

Empirical Bayes estimation for evaluating subgroup biases in artificial intelligence scoring.

Kwon, S. McCaffrey, D. F., Jewsbury, P. & Casabianca, J. M. (2025).

Personalizing large-scale assessment in practice.

Buzick, H., Casabianca, J. M., & Gholson, M. (2023).

Read Now

Statistical equivalence testing approaches for Mantel–Haenszel DIF analysis.

Casabianca, J. M., & Lewis, C. (2018).

Read Now

Using linkage sets to improve connectedness in rater response model estimation.

Casabianca, J. M., Donoghue, J., Shin, H. J., Choi, I., & Chao, S. F. (2023).

Read Now

Rater effects modeling.

Casabianca, J. M. (2022).

Accounting for rater effects with the hierarchical rater model framework when scoring simple structured constructed response tests.

Nieto, R., & Casabianca, J. M. (2019).

Read Now

Detecting rater effects under rating designs with varying levels of missingness.

Stafford, R.E., Wolfe, E. W., Casabianca, J.M., & Song, T. (2018).

Read Now

A hierarchical rater model for longitudinal data.

Casabianca, J. M., Junker, B. W., Nieto, R., & Bond, M. A. (2017).

Read Now

Impact of measurement error on rating accuracy via the hierarchical rater model.

Casabianca, J. M. & Wolfe, R. (2017).

Read Now

Recent & Upcoming Events

Events showcasing our team’s recent contributions and scheduled engagements across conferences, workshops, and invited sessions.

Title	Event Type	Meeting	Location	Date
Can AI generated rationale provide evidence that AI scores are valid?	Paper	Artificial Intelligence in Measurement and Education (AIME) Conference	Pittsburgh, PA, USA	28/10/2025
Getting Started with LLM Evaluation: A Primer for Psychometricians	Workshop	Artificial Intelligence in Measurement and Education (AIME) Conference	Pittsburgh, PA, USA	27/10/2025
Evaluating Rationales: A Comparative Study of LLMs and Human Raters in Assessing Language Learners’ Essays	Paper	National Council for Measurement in Education	Denver, CO, USA	26/04/2025
Validity Evidence for Use and Interpretation of Scores from Generative AI	Paper	National Council for Measurement in Education	Denver, CO, USA	25/04/2025
The Where, What, and How of the Job Market for Measurement Professionals	Panel Discussion	National Council for Measurement in Education	Denver, CO, USA	24/04/2025
Best Practices for AI Scoring	Workshop	National Council for Measurement in Education	Denver, CO, USA	23/04/2025
Best Practices for AI Scoring of Constructed Responses	Workshop	International Association for Educational Assessment	Philadelphia, PA, USA	22/09/2024

research News & Events

WORKSHOP

Research Papers & Reports

Publications that synthesize our team’s long-standing and emerging research in assessment, psychometric methods, and AI evaluation.

Recent & Upcoming Events

Events showcasing our team’s recent contributions and scheduled engagements across conferences, workshops, and invited sessions.

research
News
&
Events