Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

ServiceNow partnered with Turing to build a high-quality SFT dataset spanning reasoning, code, Glide scripting, function calling, and complex instruction following. This dataset powered the training of Apriel-1.5, a 15 billion parameter model that achieved frontier-level benchmark scores before reinforcement, matching much larger models while remaining deployable on a single GPU.

390,000+

curated SFT tasks across code, reasoning, CIF, function calling, and Glide.

100%

taxonomy coverage across aligned capabilities.

Zero

synthetic data, fully human-authored, reviewed, and verified.

MethodSFT data creation

DomainCoding

Dataset scale390,000+ tasks

CapabilityData Packs

Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

The Challenge

ServiceNow aimed to build a small but extremely capable reasoning model that:

Performed across multiple domains including code, CIF, CR, function calling, agentic workflows, and business logic
Understood and generated ServiceNow-specific Glide scripts, configuration logic, and platform workflows
Demonstrated frontier-level reasoning while maintaining efficiency for enterprise deployment
Offered general-purpose instruction-following strength while excelling at task execution needed for enterprise automation
Performed well in benchmarks covering multi-hop reasoning, tool use, and structured outputs

Achieving this capability within a 15 billion parameter constraint required domain-balanced, high-signal SFT data, with not only quantity but high-quality coverage across diverse taxonomies.

The Approach

Turing built a 390,000+ sample SFT dataset precisely structured to uplift all major capability axes that ServiceNow targeted.

Dataset

i. Code & Glide domain (ServiceNow platform)

The team created domain-specific data to train the model to:

Improve text-to-code capabilities such as text-to-SQL and text-to-Python
Cover general and target capabilities across languages such as Cypher, Python, JavaScript, SQL, and mixed-language scenarios in highly complex multi-turn conversations
Write and debug Glide scripts
Combine code with function calling capabilities

ii. Function calling & tool-use

Turing developed thousands of structured examples that taught the model to:

Obey strict schemas
Use tools appropriately in multi-turn conversational workflows
Recover gracefully from malformed user inputs
Handle complex user personas and variations

Impact:

Improved scores on benchmarks measuring structured reasoning and task execution, such as IFBench and Tau.

iii. Complex instruction following (CIF)

Tasks spanned:

Multi-constraint instructions
Long prompts with conflicting details
Precise output formatting
Realistic workflow instructions

The dataset followed a strict distribution of 25% single-turn and 75% multi-turn data. Model improvements were tested on a small sample to validate uplift statistics.

Complex instruction following (CIF)

Impact:

Significant uplift in deterministic, schema-controlled outputs - critical for enterprise usage.

iv. Complex reasoning & agentic tasks

We constructed reasoning tasks requiring:

Multi-hop chains
Long-form justification
Planning and execution
Conditional reasoning
Thought decomposition

Impact:

Material improvements in long-horizon reasoning contributed to Artificial Analysis 52 and Tau 68 results.

Talent and QA

We used a multi-layer QA pipeline ensuring world-class quality:

300+ vetted annotators with software engineering, enterprise IT, and workflow automation expertise
Dedicated Glide/ServiceNow specialists for domain precision
Multi-layer QA:
L1: Human review for code correctness, reasoning flow, and domain grounding
L2: LLM-assisted variance and coverage review to ensure breadth and robustness
L3: Evenly sampled calibration from domain experts on
Average quality score: 4.5/5 across 390,000+ prompts

Key Results

390,000+ high-quality SFT samples across code, CIF, CR, function calling, and ServiceNow’s Glide domain
100% coverage across all aligned taxonomies
Raised mid-training performance for Apriel-1.5-15B-Thinker across benchmarks
All samples were curated and verified with zero synthetic noise

The Outcome

Turing’s 390K-sample SFT dataset helped ServiceNow train Apriel-1.5-15B-Thinker, a 15B model that matches frontier model capabilities at a fraction of the size.

Highlights from ServiceNow’s benchmark release:

On par with DeepSeek-R1-0528 at 1/40th the size
Runs efficiently on a single GPU, enabling enterprise-scale deployment
Artificial Analysis score of 52 (surpassing Gemini-2.5-Flash and GLM-4.5)
IFBench: 62
Tau Reasoning: 68
Significant reasoning gains achieved before any RL stage (mid-training only)

Turing contributed a high-quality SFT corpus, across reasoning, instruction following, code generation, function calling, and ServiceNow’s domain (Glide), that lifted the model’s mid-training capabilities.

Need instruction-tuned data across enterprise IT domains?

Request a sample with multi-turn SFT samples for code generation, reasoning, and tool use, grounded in real automation tasks.

Request Sample

What capabilities does the dataset cover?

The dataset includes reasoning, code generation, complex instruction following, function calling, agentic task planning, and Glide (ServiceNow’s proprietary scripting language).

Was Glide domain data included?

Yes. Turing provided extensive domain-specific SFT focused on writing and debugging Glide scripts, platform workflows, and Now Platform automation logic.

How was data quality ensured?

All 390,000 samples were curated and reviewed through a three-layer QA system, including LLM-assisted coverage checks and calibration from domain experts.

Can this data improve performance before RLHF?

Yes. The dataset contributed to benchmark wins during mid-training, showing significant uplift before reinforcement stages.

How was the dataset structured?

It was taxonomy aligned, with structured distributions across multi-turn vs single-turn, code vs non-code, and agentic versus declarative workflows, designed to match ServiceNow’s internal modeling goals.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Related resources

Case Study

Benchmarking RTL Agents with 1,500+ Real-World Verilog Tasks for NVIDIA’s CVDP

Read

Case Study

Achieving 95%+ Factual Accuracy With Human QA Over 5000+ Prompts

Read

Case Study

Creating 10,000+ Supervised GUI Tasks to Train General-Purpose Computer Agents

Read

Need high-quality SFT data to match frontier models at compact scale?

Request curated, benchmark-aligned datasets across reasoning, code, and enterprise domains.

Request Sample

AGI Advance Newsletter

Weekly updates on frontier benchmarks, evals, fine-tuning, and agentic workflows read by top labs and AI practitioners.

Subscribe Now