Case Study: Evaluating and Improving Factuality Across 5000+ Prompts

Back

Back

Achieving 95%+ Factual Accuracy With Human QA Over 5000+ Prompts

Human-labeled evaluations helped close the factuality and response quality gap between the client’s model and frontier AI models, improving alignment, language fluency, and source utilization across 150+ prompt categories.

5000+

Human-labeled prompts: Spanning web contexts with derivability, accuracy, and critique metadata.

~5%

Improvement in positive response quality: Highest in blind review across clarity, structure, and instruction alignment.

95%+

Factuality rates: The top model trained on Turing data showed stronger grounding, better RAG usage, and source alignment.

IndustryAI Research

Company typeEnterprise

CountryUnited States

Capabilites usedTuring AGI Advancement

Achieving 95%+ Factual Accuracy With Human QA Over 5000+ Prompts

The Challenge

A frontier AI lab needed to improve and evaluate the factuality and response grounding of a next-gen LLM across a wide knowledge surface. Traditional hallucination tagging wasn’t enough, and the team required:

Prompt-level claim scoring across diverse user intent types
Web and social media source evaluation with real-time news volatility
Grounded feedback loops to refine instruction-following and content accuracy
A dataset to help reduce “lazy responses” and language errors
A rubric that would work across factual, derived, and unverifiable responses, not just binary yes/no grading

Turing was asked to build a scalable, human-in-the-loop eval pipeline that could surface both systemic weaknesses and fine-grained factual mismatches across categories like coding, politics, science, technology, entertainment, and economics.

The Approach

Dataset & taxonomy

We built a multi-dimensional dataset covering:

150+ categories and subcategories, including Politics, Global Events, Religion, Sports, History, Law, Tech, and more
Balanced prompt types: declarative claims, open questions, and opinion-contextualized queries
5000+ prompts, each with:
a. Model response
b. Supporting web sources
c. Factuality judgments (per claim)
d. Source helpfulness, relevance, completeness
e. Ideal response rewrites for RLHF fine-tuning

Evaluation & QA pipeline

We designed and executed a rigorous rubric covering:

Factuality Evaluation and QA Pipeline

To standardize outputs, we implemented:

A five-tier factuality error schema (e.g., outdated, chronological, data misinterpretation, unverifiable, fabricated)
Instruction-level improvement suggestions + rewrites used as ideal fine-tuning targets
Blind A/B testing across three models to detect systemic failure patterns

Qualitative analysis

We also performed a qualitative analysis of model behavior across categories like Business, Politics, and Science. We flagged:

Unverifiable extrapolations (e.g., conclusions from vague social posts)
Over-indexing on noisy source snippets
Incomplete use of helpful data
Hallucinated citations and pretextual knowledge injections
Language verbosity and structural bloat in “lazy” outputs

Key Results

We evaluated three instruction-following models:

Model A (fine-tuned with Turing data)
Model B (an earlier baseline), and
Model C (a publicly available SOTA model)

across 500+ prompts using blind human review, ensuring that evaluators had no visibility into which model they were assessing. This allowed for objective, model-agnostic scoring across all five rubric dimensions.

Factuality Overall Response Quality

Model A delivered the highest percentage of net positive responses, indicating improved clarity, completeness, and structure in output.

Factuality & Derivability

While Model C led in factuality, Model A showed comparable performance and reduced hallucination risk.

RAG Source Utilization

Model A demonstrated disciplined use of RAG sources, minimizing overuse or irrelevant citations while achieving high grounding coverage.

General Model Behavior (Instance Counts)

Model A had the most consistent behavioral output, with the lowest incidence of grammar issues, hallucinations, and over-verbose responses.

Category-Wise Performance (Net Positive Response Quality)

Model A led in categories like Business, History, Entertainment, and Literature. Model C performed strongest in STEM-heavy categories like Science and Technology, while Model B showed strengths in Art and Politics.

Here are the key takeaways:

81.6% net positive response quality, outperforming other frontier models in blind evals
95.2% factuality rate across human-reviewed samples, matching or exceeding SOTA models
59.6% optimal RAG source utilization, with minimal overuse or distraction from context
Category-level wins in Business, History, Entertainment, and Literature, where the model achieved both high response quality and factual precision
Highest expected model behavior and lowest rate of grammar or formatting issues among peers

The Outcome

This benchmark study demonstrated that AI models, when fine-tuned with human-labeled QA data, can outperform frontier models on several grounded response metrics:

Train: The lab can now improve models on web-based response generation
Evaluate: Researchers can assess derivability, not just factuality, using blind human rubrics
Diagnose: Behavioral metrics like "lazy response" or "missing key info" offer fine-tuning levers
Extend: Frameworks can be adapted to code, logic, and visual reasoning benchmarks

How confident are you in your model’s claims?

Get a human-labeled sample to evaluate factuality, source grounding, and behavioral consistency.

Request Sample

What’s in the eval sample?

A full task set: user prompt, model response, web sources, source evaluations, factuality labels, and rewrite.

Is this compatible with our RAG system?

Yes. Evaluation will be specifically designed for RAG-based model outputs and includes source sufficiency metrics.

Can we compare multiple models?

Yes. Our framework supports blind review across multiple checkpoints or system variants.

What types of prompts are included?

Statements, opinion-form questions, current event queries, social inference tasks, and fact-checkable claims.

What’s the QA process?

All prompts pass a multi-layer review, with both quantitative tagging and qualitative critique.

What’s the NDA process?

A standard mutual NDA; Turing returns countersignature within one business day.

How fast can I get a sample?

Within 3 business days of NDA execution.

Related resources

Article

Factuality in LLMs: Key Metrics, Challenges & Improvement Strategies

Read

Audio SFT- Enhancing AI with Real-World Spoken Prompt Training_Hero_1232-770

Article

Audio SFT: Teaching AI to Understand Human Voice in Noisy, Real-World Scenarios

Read

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Read

Want to evaluate your model at scale?

Get a full evaluation sample with claim-level scoring, source metrics, and rewrite-ready supervision.

Request Sample

Achieving 95%+ Factual Accuracy With Human QA Over 5000+ Prompts

5000+

Human-labeled prompts: Spanning web contexts with derivability, accuracy, and critique metadata.

~5%

Improvement in positive response quality: Highest in blind review across clarity, structure, and instruction alignment.

95%+

Factuality rates: The top model trained on Turing data showed stronger grounding, better RAG usage, and source alignment.

The Challenge

The Approach

Key Results

The Outcome

How confident are you in your model’s claims?

Share

FAQ

What’s in the eval sample?

Is this compatible with our RAG system?

Can we compare multiple models?

What types of prompts are included?

What’s the QA process?

What’s the NDA process?

How fast can I get a sample?

Related resources

Article

Factuality in LLMs: Key Metrics, Challenges & Improvement Strategies

Article

Audio SFT: Teaching AI to Understand Human Voice in Noisy, Real-World Scenarios

Article

Chain of Experts: What Is It and How It Solves MoE’s Limitations

Want to evaluate your model at scale?