
Applied AI
Sciences
AI Evaluation & Applied AI Sciences
Our specialty is in the application of AI Psychometrics. AI Psychometrics is about developing methods to measure and understand the capabilities, biases, and ethical considerations of AI models. Evaluating AI from the lens of psychometrics ensures a principled validation process that surpasses the basics of traditional machine learning metrics. Evaluating recall and precision are great preliminary steps, but with the high stakes involved with many applications, it's just not enough anymore.
In addition to evaluating AI with established batteries of assessments (IQ test, for example), there are new constructs to measure: trust, safety, clarity, accuracy. To measure these constructs we design rating scales or likert scales, which are then applied by humans. Yes, we need humans to make sure our AI is working as intended! We recruit and train human raters to apply the rating scales for evaluation. This is particularly important in the evaluation of free response outputs, for example when the AI is giving text summaries as feedback or suggestions.
Do you simply need data for training your LLM? We can help with that! Our expertise in ensuring a high quality rating process can reduce the error in human labeling and annotation. ​We might partner with annotation companies but will add psychometric rigor to the process. ​​​​
Service Packages

The Psychometrics Checkup
One-time Audit
We'll check if you're on-target and using psychometric methods appropriately. We will review your entire system including overall alignment, rating scales, rubrics, benchmark tests, and more.
​​
​​
​​
-
Deliverable: A narrative summary with actionable next steps and recommendations to improve your approach.
-
Flat-fee pricing.

Principled Evaluation Framework
System Design/Overhaul
We will design your system with validity, reliability, and fairness in mind and ensure it's aligned to your product goals. By the end of system implementation, you'll know your AI outputs are trustworthy and defensible.
​
-
Deliverable: A comprehensive plan for integrating principled design and psychometric methods in your evaluation plan.
-
Contact us to discuss pricing.

Validity
Lab Partnership​
Ongoing Support
Do you need a thought partner?​ Need ongoing help with your company-wide initiatives? ​We can be available as an extension of your team, whether that's by impromptu sanity-check meetings or regular systems monitoring. Let's discuss what you need.
​​​​​
​
-
Deliverable: Ongoing support and evaluation as a trusted technical advisor
-
Monthly retainer
What do you need to improve your pipeline?
Human Rater Systems
Design and management of expert rating systems to evaluate AI
​
AI Validation Audits & Studies
Use validity framework from the field of psychometrics and assessment to comprehensively evaluate your AI
Generative AI Research
Conduct research comparing LLMs, prompting techniques, and more.
LLM Evaluation
Statistical evaluation of AI, using state-of-the-art metrics
​​​
​​​
LLM-as-a-Judge
Use of LLM with rating scales and rubrics to evaluate LLM outputs
​​​
AI Construct Definitions
Identification and definition of AI-relevant constructs ​
​
Human Labeling Data
Collection of human annotations to create training data for LLMs
​
AI Evaluation in Assessments
Comprehensive evaluation of AI usage in assessment (of humans) along with validity argumentation
AI Assessments
Identification of relevant established standardized assessments to evaluate AI​


Case Studies
1
Establish System to Collect Expert Human Annotations
Suppose you want to fine-tune an LLM to perform a specific task, but you don’t have any data to do it. Don’t just collect data, optimize your fine-tuning with high quality human annotation data. Work with us to develop a principled system to recruit subject matter experts, evaluate their knowledge to qualify them, develop training materials and benchmarks, and monitor their annotations to ensure they will be useful for training purposes. We will conduct the development work and manage the implementation if needed.
​
2
Develop Likert Scales to Measure Relevant AI Constructs for Evaluation
Suppose you are an AI developer, evaluating generative AI responses in a healthcare context. The AI generates open-ended suggestions to improve health outcomes. In order to ensure that the suggestions are valid, safe, unbiased, and accurate, you need to get human feedback, but don’t have a way to structure the feedback. Consult with us to design a series of scales for your human experts to use to map their evaluations of the AI outputs to numeric scales. We will help define the relevant constructs and develop items to evaluate them to ensure that they yield appropriate measurements of the outputs.
3
Proposal to Integrate AI into Certification Testing Program
Suppose you are the leader of a small certification company who wants to make the leap to incorporate AI into your assessment pipeline. Consult with us to conduct a thorough review of your testing program to determine where you can integrate new AI applications or refine existing applications.
We will review all of your systems starting with item development to score reporting to derive a plan for your future.
​
​
​
4
Presentation to School District on AI Literacy and AI in the Classroom
Suppose you are a visionary faculty member in your school district and want to educate your faculty and staff on the basics of AI, how AI might be used in the classroom, how students might use AI, AI literacy, and pain points to watch out for. We will prepare a workshop for your staff that will prepare them for the future of their classroom and serve as an ongoing consultant to the district as we all navigate the changing landscape of education together.
​
​
_edited.jpg)