
RATING SCALES
&
Scoring
Ratings Scales & Scoring of Open-ended Tasks
Often we wish to measure individuals on a deeper level, with "rich response" formats. This might be with an essay, a spoken task, a performance, a survey, or any task or assignment that cannot be evaluated or graded by a computer. Therein lies the problem that we can solve for you - the scoring of open-ended tasks.
​
We specialize in the development of rating scales and scoring systems. We have many years of experience performing this work in high-stakes operational environments. We also have many years of experience conducting research in this area. We are positive that we can help you, too!
​
In some contexts we might use a Likert scale with minimal directions required--it might be administered to a sample of individuals naturally defined by the situation. Others might require a detailed rubric and selection, hiring, and training of independent subject matter experts to serve as raters. In this process, there are many opportunities for errors to be introduced. We minimize those errors by following best practices at every step of the process.
Fun fact!!! We can use rating scales to provide information about your AI models. Suppose you are evaluating your fine-tuned LLM to provide accurate text responses giving feedback on an input (could be an essay or some health measurements like blood pressure or activity). How do you evaluate that? Humans have to review it. But how can we do that at scale, and systematically? We define rating scales so that subject matter experts can review those outputs. This is especially important in high-stakes use contexts (e.g., health case) and in special and young populations to ensure there are no harmful outputs seen by the most vulnerable.
To these ends, our services include putting structure around the problem, starting with the construct definition to the operationalization of the scoring process, comprehensive audits of scoring systems, and research studies on scoring approaches, rater effects, and more.​
analysis of rater effects
Scale Development
Creation and review of Likert scales and rubrics ​
​
Special Studies
Design and implementation of validity studies and rater reliability studies
​
Rater Monitoring
Monitoring at the rater level for purposes of rater remediation
​
Rater Recruitment
Targeted rater recruitment
​
​
AI Scoring
Development of AI scoring models with traditional NLP features or Generative AI
Rater Training Materials
Development of training guides and human rater qualification instruments
Rater Training
Rater training on content and implicit bias
​
Rater Errors
Analysis of rater effects to understand your raters or for research purposes​
AI Scoring Evaluation
Comprehensive evaluation of AI scores and validity argumentation
Benchmark Creation
Creation of exemplar responses for rater training and benchmarking
​
Rater Pool Monitoring
Real-time monitoring of the rater pool for consistency and accuracy
Impact Analyses
Impact analyses to explore different scoring scenarios
​

Example Use Cases
1
State K-12 Office Considering use of AI for Scoring for Open-ended Writing Tasks
Suppose you represent a state K-12 department of education, either directly or as an assessment contractor, and you are considering moving from using humans to AI to scores text, spoken, and/or short answer responses. Let us be your thought partner as you consider this move. We can help you consider all of the possibilities and potential pain points. This may include the type of AI to use, evaluating the impact on total test score reliability, how it logistically impacts the assessment pipeline, and more. We can design special studies to explore and compare systems, and provide design plans for implementation, evaluaton and monitoring.
​
​
​
2
Need a Comprehensive Plan to Monitor Human Rating System
Suppose you are an assessment specialist or psychometrician who needs some extra support determining the best system to monitor human raters. Consult with us to figure out the best monitoring intervals, methods, and metrics to conduct the monitoring. For example, do you want measures of rater accuracy and rater agreement? We can suggest methods to measure both by using pre-scored responses for accuracy and double-scored responses for rater agreement. But how many responses do we need to estimate these statistics? We got you covered! We will develop a plan to get you trustworthy metrics and assistance with interpretation and next steps. And if you're into it, we can even make a PowerBI dashboard so these metrics can be at your fingertips in real-time.
3
AI Developer Looking for Complete Scaling and Rubric System for LLM Evaluation
Suppose you are an AI developer or researcher using LLMs and you have free response text outputs that you need to evaluate. You could use humans or LLM-as-a judge, but use of a Likert scale or some objective format is preferred either way. Consult with us to explore your options using rating scales and rubrics for evaluation. We will customize a set of scales for your use case to map the evaluations of the AI outputs to numeric scales, whether by humans or another LLM. We will help define the relevant constructs and develop items to evaluate them to ensure that they yield appropriate measurements of the outputs. We can even design an entire human rating system from rater recruitment/training to evaluation!
​
​
​
4
Medical School Seeking Approaches to Observing and Evaluating Students
You are faculty in a medical school tasked with organizing the observation and evaluation of the medical students for their accuracy in diagnosis, bedside manner, and more. This is a tough task, but we can help! Consult with us to coordinate the use of existing scales and instruments. We will help you select a tool, assist with putting a system around its usage, and conduct the scoring and reporting. We can also help hire observers, design trainings materials, and coordinate assessments. If you're open to it, we can help you modernize your system by integrating AI solutions for performing tasks such as collecting faculty feedback on students and more.
5
Assistance with Rater Recruitment, Selection, & Training
Suppose you have a need to collect data or scores from a pool of human raters, but don't have the capacity to manage them. We have experts who have managed rater pools with more than 1,000 raters and we can assist with recruitment, selection, and onboarding. We can also assist with training the raters on the rubric, as well as foundational training on being objective, common rater errors, and implicit bias. Our goal is to minimize the error introduced by raters so that your assessments best reflect the the test taker, not errors due to poorly training human raters.
​
​