Rating Scales & Scoring

AEnB2UoPLMJhSetVTyUTFZBHkbK5HtvBdOIXz80to3XQ9j6OKnBmfgC29_6H1LTdHbhU2t157y-6xkUrR8AbL_W5FQ

RATING SCALES

Scoring

Ratings Scales & Scoring of Open-ended Tasks

Often we wish to measure individuals on a deeper level, with "rich response" formats. This might be with an essay, a spoken task, a performance, a survey, or any task or assignment that cannot be evaluated or graded by a computer. Therein lies the problem that we can solve for you - the scoring of open-ended tasks.

We specialize in the development of rating scales and scoring systems. We have many years of experience performing this work in high-stakes operational environments. We also have many years of experience conducting research in this area. We are positive that we can help you, too!

In some contexts we might use a Likert scale with minimal directions required--it might be administered to a sample of individuals naturally defined by the situation. Others might require a detailed rubric and selection, hiring, and training of independent subject matter experts to serve as raters. In this process, there are many opportunities for errors to be introduced. We minimize those errors by following best practices at every step of the process.

Fun fact!!! We can use rating scales to provide information about your AI models. Suppose you are evaluating your fine-tuned LLM to provide accurate text responses giving feedback on an input (could be an essay or some health measurements like blood pressure or activity). How do you evaluate that? Humans have to review it. But how can we do that at scale, and systematically? We define rating scales so that subject matter experts can review those outputs. This is especially important in high-stakes use contexts (e.g., health case) and in special and young populations to ensure there are no harmful outputs seen by the most vulnerable.

To these ends, our services include putting structure around the problem, starting with the construct definition to the operationalization of the scoring process, comprehensive audits of scoring systems, and research studies on scoring approaches, rater effects, and more.

analysis of rater effects

Scale Development

Creation and review of Likert scales and rubrics

Special Studies

Design and implementation of validity studies and rater reliability studies

Rater Monitoring

Monitoring at the rater level for purposes of rater remediation

Rater Recruitment

Targeted rater recruitment

AI Scoring

Development of AI scoring models with traditional NLP features or Generative AI

Rater Training Materials

Development of training guides and human rater qualification instruments

Rater Training

Rater training on content and implicit bias

Rater Errors

Analysis of rater effects to understand your raters or for research purposes

AI Scoring Evaluation

Comprehensive evaluation of AI scores and validity argumentation

Benchmark Creation

Creation of exemplar responses for rater training and benchmarking

Rater Pool Monitoring

Real-time monitoring of the rater pool for consistency and accuracy

Impact Analyses

Impact analyses to explore different scoring scenarios

Example Use Cases

State K-12 Office Considering use of AI for Scoring for Open-ended Writing Tasks

Suppose you represent a state K-12 department of education, either directly or as an assessment contractor, and you are considering moving from using humans to AI to scores text, spoken, and/or short answer responses. Let us be your thought partner as you consider this move. We can help you consider all of the possibilities and potential pain points. This may include the type of AI to use, evaluating the impact on total test score reliability, how it logistically impacts the assessment pipeline, and more. We can design special studies to explore and compare systems, and provide design plans for implementation, evaluaton and monitoring.

Need a Comprehensive Plan to Monitor Human Rating System

Suppose you are an assessment specialist or psychometrician who needs some extra support determining the best system to monitor human raters. Consult with us to figure out the best monitoring intervals, methods, and metrics to conduct the monitoring. For example, do you want measures of rater accuracy and rater agreement? We can suggest methods to measure both by using pre-scored responses for accuracy and double-scored responses for rater agreement. But how many responses do we need to estimate these statistics? We got you covered! We will develop a plan to get you trustworthy metrics and assistance with interpretation and next steps. And if you're into it, we can even make a PowerBI dashboard so these metrics can be at your fingertips in real-time.

AI Developer Looking for Complete Scaling and Rubric System for LLM Evaluation

Suppose you are an AI developer or researcher using LLMs and you have free response text outputs that you need to evaluate. You could use humans or LLM-as-a judge, but use of a Likert scale or some objective format is preferred either way. Consult with us to explore your options using rating scales and rubrics for evaluation. We will customize a set of scales for your use case to map the evaluations of the AI outputs to numeric scales, whether by humans or another LLM. We will help define the relevant constructs and develop items to evaluate them to ensure that they yield appropriate measurements of the outputs. We can even design an entire human rating system from rater recruitment/training to evaluation!

Medical School Seeking Approaches to Observing and Evaluating Students

You are faculty in a medical school tasked with organizing the observation and evaluation of the medical students for their accuracy in diagnosis, bedside manner, and more. This is a tough task, but we can help! Consult with us to coordinate the use of existing scales and instruments. We will help you select a tool, assist with putting a system around its usage, and conduct the scoring and reporting. We can also help hire observers, design trainings materials, and coordinate assessments. If you're open to it, we can help you modernize your system by integrating AI solutions for performing tasks such as collecting faculty feedback on students and more.

Assistance with Rater Recruitment, Selection, & Training

Suppose you have a need to collect data or scores from a pool of human raters, but don't have the capacity to manage them. We have experts who have managed rater pools with more than 1,000 raters and we can assist with recruitment, selection, and onboarding. We can also assist with training the raters on the rubric, as well as foundational training on being objective, common rater errors, and implicit bias. Our goal is to minimize the error introduced by raters so that your assessments best reflect the the test taker, not errors due to poorly training human raters.