Applied AI Sciences | BroadMetrics

Landscape image of a robot over a bright pale orange background..jpg

Applied AI

Sciences

AI Evaluation & Applied AI Sciences

Our specialty is in the application of AI Psychometrics. AI Psychometrics is about developing methods to measure and understand the capabilities, biases, and ethical considerations of AI models. Evaluating AI from the lens of psychometrics ensures a principled validation process that surpasses the basics of traditional machine learning metrics. Evaluating recall and precision are great preliminary steps, but with the high stakes involved with many applications, it's just not enough anymore.

In addition to evaluating AI with established batteries of assessments (IQ test, for example), there are new constructs to measure: trust, safety, clarity, accuracy. To measure these constructs we design rating scales or likert scales, which are then applied by humans. Yes, we need humans to make sure our AI is working as intended! We recruit and train human raters to apply the rating scales for evaluation. This is particularly important in the evaluation of free response outputs, for example when the AI is giving text summaries as feedback or suggestions.

Do you simply need data for training your LLM? We can help with that! Our expertise in ensuring a high quality rating process can reduce the error in human labeling and annotation. We might partner with annotation companies but will add psychometric rigor to the process.

Service Packages

The Psychometrics Checkup

One-time Audit

We'll check if you're on-target and using psychometric methods appropriately. We will review your entire system including overall alignment, rating scales, rubrics, benchmark tests, and more.

Deliverable: A narrative summary with actionable next steps and recommendations to improve your approach.
Flat-fee pricing.

Principled Evaluation Framework

System Design/Overhaul

We will design your system with validity, reliability, and fairness in mind and ensure it's aligned to your product goals. By the end of system implementation, you'll know your AI outputs are trustworthy and defensible.

Deliverable: A comprehensive plan for integrating principled design and psychometric methods in your evaluation plan.
Contact us to discuss pricing.

Validity

Lab Partnership

Ongoing Support

Do you need a thought partner? Need ongoing help with your company-wide initiatives? We can be available as an extension of your team, whether that's by impromptu sanity-check meetings or regular systems monitoring. Let's discuss what you need.

Deliverable: Ongoing support and evaluation as a trusted technical advisor
Monthly retainer

What do you need to improve your pipeline?

Human Rater Systems

Design and management of expert rating systems to evaluate AI

AI Validation Audits & Studies

Use validity framework from the field of psychometrics and assessment to comprehensively evaluate your AI

Generative AI Research

Conduct research comparing LLMs, prompting techniques, and more.

LLM Evaluation

Statistical evaluation of AI, using state-of-the-art metrics

LLM-as-a-Judge

Use of LLM with rating scales and rubrics to evaluate LLM outputs

AI Construct Definitions

Identification and definition of AI-relevant constructs

Human Labeling Data

Collection of human annotations to create training data for LLMs

AI Evaluation in Assessments

Comprehensive evaluation of AI usage in assessment (of humans) along with validity argumentation

AI Assessments

Identification of relevant established standardized assessments to evaluate AI

Case Studies

Establish System to Collect Expert Human Annotations

Suppose you want to fine-tune an LLM to perform a specific task, but you don’t have any data to do it. Don’t just collect data, optimize your fine-tuning with high quality human annotation data. Work with us to develop a principled system to recruit subject matter experts, evaluate their knowledge to qualify them, develop training materials and benchmarks, and monitor their annotations to ensure they will be useful for training purposes. We will conduct the development work and manage the implementation if needed.

Develop Likert Scales to Measure Relevant AI Constructs for Evaluation

Suppose you are an AI developer, evaluating generative AI responses in a healthcare context. The AI generates open-ended suggestions to improve health outcomes. In order to ensure that the suggestions are valid, safe, unbiased, and accurate, you need to get human feedback, but don’t have a way to structure the feedback. Consult with us to design a series of scales for your human experts to use to map their evaluations of the AI outputs to numeric scales. We will help define the relevant constructs and develop items to evaluate them to ensure that they yield appropriate measurements of the outputs.

Proposal to Integrate AI into Certification Testing Program

Suppose you are the leader of a small certification company who wants to make the leap to incorporate AI into your assessment pipeline. Consult with us to conduct a thorough review of your testing program to determine where you can integrate new AI applications or refine existing applications.

We will review all of your systems starting with item development to score reporting to derive a plan for your future.

Presentation to School District on AI Literacy and AI in the Classroom

Suppose you are a visionary faculty member in your school district and want to educate your faculty and staff on the basics of AI, how AI might be used in the classroom, how students might use AI, AI literacy, and pain points to watch out for. We will prepare a workshop for your staff that will prepare them for the future of their classroom and serve as an ongoing consultant to the district as we all navigate the changing landscape of education together.