Services | BroadMetrics

Image with pale purple background and an image of a brain being measured__.jpg

PSYCHOMetrics

Psychometric Services

People's hands pointing to parts of an analysis report

Developing an assessment? Need to score an existing assessment?

Psychometric analyses encompass a variety of methods aimed at measuring psychological attributes such as intelligence, personality traits, and attitudes. You might be interested in user experience and creating surveys. You might be evaluating students' learning within an online tutoring system. You might be evaluating patient reported outcomes. You might be evaluating people skills and employer satisfaction. We can help with all of it! The theory is the same.

There are three broad approaches to psychometrics. The traditional era introduced classical test theory, which is still in use today. Another era came with the introduction of latent variable modeling and item response theory, which is more statistically sophisticated and generally requires more data. We recently have ventured into the modern era which relies heavily on AI and relatively more computational solutions. Which approach is best? It really depends. While we start with modern solutions, we might suggest a mixture of these approaches depending on the specific context, data limitations, and goals. No matter the approach, we are dedicated to ensuring that industry standards for validity, reliability, and fairness are met.

Our services include:

item analysis (classical test theory; item response theory; AI-driven)
test construction and validation (factor analyses, item response theory modeling, validity studies)
equating and linking designs
designing and evaluating scoring schemes
scoring of open-ended tasks using human raters
automated scoring methodologies and evaluation for assessments scoring humans or AI
comprehensive psychometric reviews and audits
technical reports, presentations, manuscripts, RFPs/proposals

Applied AI Sciences

Honestly, it can be hard to keep up with AI developments. At BroadMetrics, that's our job. We are here to mentor you in your AI journey, from a psychometric and measurement perspective.

Our specialty is in the application of AI Psychometrics. AI Psychometrics is about developing methods to measure and understand the capabilities, biases, and ethical considerations of AI models. Evaluating AI from the lens of psychometrics ensures a principled validation process that surpasses the basics of traditional machine learning metrics. Evaluating recall and precision are great preliminary steps, but with the high stakes involved with many applications, it's just not enough anymore.

In addition to evaluating AI with established batteries of assessments (IQ test, for example), there are new constructs to measure: trust, safety, clarity, accuracy. To measure these constructs we design rating scales or likert scales, which are then applied by humans. Yes, we need humans to make sure our AI is working as intended! We recruit and train human raters to apply the rating scales for evaluation. This is particularly important in the evaluation of free response outputs, for example when the AI is giving text summaries as feedback or suggestions.

Do you simply need data for training your LLM? We can help with that! Our expertise in ensuring a high quality rating process can reduce the error in human labeling and annotation. We might partner with annotation companies but will add a psychometric flair to the process.

Our services include:

statistical evaluation of AI, using state-of-the-art metrics
identification and definition of AI-relevant constructs
Identification of relevant established standardized assessments to evaluate AI
expert rating systems to evaluate and assess AI for accuracy, safety, and more
automated scoring methodologies and evaluation for assessments scoring humans or AI
facilitation of human annotations to create training data
generative AI research
ethical AI audits
AI validation studies

Ratings Scales & Scoring

Often we wish to measure individuals on a deeper level, with "rich response" formats. This might be with an essay, a spoken task, a performance, a survey, or any task or assignment that cannot be evaluated or graded by a computer. Therein lies the problem that we can solve for you - the scoring of open-ended tasks.

We specialize in the development of rating scales and scoring systems. We have many years of experience performing this work in high-stakes operational environments. We also have many years of experience conducting research in this area. We are positive that we can help you, too!

Here are some example use cases that would be suitable for rating scales or open-ended tasks:

Survey of employee satisfaction and skills

Essay to measure writing ability

Survey to measure quality of life in cancer patients

Evaluation of the safety of AI text outputs

In some contexts we might use a Likert scale with minimal directions required--it might be administered to a sample of individuals naturally defined by the situation. Others might require a detailed rubric and selection, hiring, and training of independent subject matter experts to serve as raters. In this process, there are many opportunities for errors to be introduced. We minimize those errors by following best practices at every step of the process.

Fun fact!!! We highlighted the last use case example in red text because it special. We can use rating scales to provide information about your AI models. Suppose you are evaluating your fine-tuned LLM to provide accurate text responses giving feedback on an input (could be an essay or some health measurements like blood pressure or activity). How do you evaluate that? Humans have to review it. But how can we do that at scale, and systematically? We define rating scales so that subject matter experts can review those outputs. This is especially important in high-stakes use contexts (e.g., health case) and in special and young populations to ensure there are no harmful outputs seen by the most vulnerable.

To these ends, our services include:

putting structure around the problem, starting with the construct definition to the operationalization of the scoring process
creation and review of Likert scales and rubrics
development of training guides and human rater qualification instruments
targeted rater recruitment
rater training on content and implicit bias
creation of exemplar responses for rater training and benchmarking
real-time monitoring of the rater pool for consistency and accuracy
monitoring at the rater level for purposes of rater remediation
development of automated scoring models with traditional NLP features
AI scoring using generative AI (prompting of LLMs)
impact analyses to explore different scoring scenarios
comprehensive evaluation of AI scores
design and implementation of validity studies and rater reliability studies
analysis of rater effects
comprehensive audits of scoring systems
research studies on scoring approaches, rater effects, and more