Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

ServiceNow partnered with Turing to build a high-quality SFT dataset spanning reasoning, code, Glide scripting, function calling, and complex instruction following. This dataset powered the training of Apriel-1.5, a 15 billion parameter model that achieved frontier-level benchmark scores before reinforcement, matching much larger models while remaining deployable on a single GPU.

390,000+

curated SFT tasks across code, reasoning, CIF, function calling, and Glide.

100%

taxonomy coverage across aligned capabilities.

Zero

synthetic data, fully human-authored, reviewed, and verified.

MethodSFT data creation
DomainCoding
Dataset scale390,000+ tasks
CapabilityData Packs
Driving Frontier-Level Reasoning in Apriel-1.5 with 390K+ High-Signal Prompts

The Challenge

ServiceNow aimed to build a small but extremely capable reasoning model that:

  • Performed across multiple domains including code, CIF, CR, function calling, agentic workflows, and business logic
  • Understood and generated ServiceNow-specific Glide scripts, configuration logic, and platform workflows
  • Demonstrated frontier-level reasoning while maintaining efficiency for enterprise deployment
  • Offered general-purpose instruction-following strength while excelling at task execution needed for enterprise automation
  • Performed well in benchmarks covering multi-hop reasoning, tool use, and structured outputs

Achieving this capability within a 15 billion parameter constraint required domain-balanced, high-signal SFT data, with not only quantity but high-quality coverage across diverse taxonomies.

The Approach

Turing built a 390,000+ sample SFT dataset precisely structured to uplift all major capability axes that ServiceNow targeted.

Dataset

i. Code & Glide domain (ServiceNow platform)

The team created domain-specific data to train the model to:

  • Improve text-to-code capabilities such as text-to-SQL and text-to-Python 
  • Cover general and target capabilities across languages such as Cypher, Python, JavaScript, SQL, and mixed-language scenarios in highly complex multi-turn conversations
  • Write and debug Glide scripts
  • Combine code with function calling capabilities

ii. Function calling & tool-use

Turing developed thousands of structured examples that taught the model to:

  • Obey strict schemas
  • Use tools appropriately in multi-turn conversational workflows
  • Recover gracefully from malformed user inputs
  • Handle complex user personas and variations

Impact:

Improved scores on benchmarks measuring structured reasoning and task execution, such as IFBench and Tau.

iii. Complex instruction following (CIF)

Tasks spanned:

  • Multi-constraint instructions
  • Long prompts with conflicting details
  • Precise output formatting
  • Realistic workflow instructions

The dataset followed a strict distribution of 25% single-turn and 75% multi-turn data. Model improvements were tested on a small sample to validate uplift statistics.

Complex instruction following (CIF)

Complex instruction following (CIF)

Impact:

Significant uplift in deterministic, schema-controlled outputs - critical for enterprise usage.

iv. Complex reasoning & agentic tasks 

We constructed reasoning tasks requiring:

  • Multi-hop chains
  • Long-form justification
  • Planning and execution
  • Conditional reasoning
  • Thought decomposition

Impact:

Material improvements in long-horizon reasoning contributed to Artificial Analysis 52 and Tau 68 results.

Talent and QA

We used a multi-layer QA pipeline ensuring world-class quality:

  • 300+ vetted annotators with software engineering, enterprise IT, and workflow automation expertise
  • Dedicated Glide/ServiceNow specialists for domain precision
  • Multi-layer QA:
    L1: Human review for code correctness, reasoning flow, and domain grounding
    L2: LLM-assisted variance and coverage review to ensure breadth and robustness
    L3: Evenly sampled calibration from domain experts on
  • Average quality score: 4.5/5 across 390,000+ prompts

Key Results

  • 390,000+ high-quality SFT samples across code, CIF, CR, function calling, and ServiceNow’s Glide domain
  • 100% coverage across all aligned taxonomies
  • Raised mid-training performance for Apriel-1.5-15B-Thinker across benchmarks
  • All samples were curated and verified with zero synthetic noise

The Outcome

Turing’s 390K-sample SFT dataset helped ServiceNow train Apriel-1.5-15B-Thinker, a 15B model that matches frontier model capabilities at a fraction of the size.

Highlights from ServiceNow’s benchmark release:

  • On par with DeepSeek-R1-0528 at 1/40th the size
  • Runs efficiently on a single GPU, enabling enterprise-scale deployment
  • Artificial Analysis score of 52 (surpassing Gemini-2.5-Flash and GLM-4.5)
  • IFBench: 62
  • Tau Reasoning: 68
  • Significant reasoning gains achieved before any RL stage (mid-training only)

Turing contributed a high-quality SFT corpus, across reasoning, instruction following, code generation, function calling, and ServiceNow’s domain (Glide), that lifted the model’s mid-training capabilities.

Need instruction-tuned data across enterprise IT domains?

Request a sample with multi-turn SFT samples for code generation, reasoning, and tool use, grounded in real automation tasks.

Request Sample

Share

FAQ

What capabilities does the dataset cover?

The dataset includes reasoning, code generation, complex instruction following, function calling, agentic task planning, and Glide (ServiceNow’s proprietary scripting language).

Was Glide domain data included?

Yes. Turing provided extensive domain-specific SFT focused on writing and debugging Glide scripts, platform workflows, and Now Platform automation logic.

How was data quality ensured?

All 390,000 samples were curated and reviewed through a three-layer QA system, including LLM-assisted coverage checks and calibration from domain experts.

Can this data improve performance before RLHF?

Yes. The dataset contributed to benchmark wins during mid-training, showing significant uplift before reinforcement stages.

How was the dataset structured?

It was taxonomy aligned, with structured distributions across multi-turn vs single-turn, code vs non-code, and agentic versus declarative workflows, designed to match ServiceNow’s internal modeling goals.

What’s the NDA process?

A standard mutual NDA. Turing provides the countersigned agreement within one business day.

How fast can I get a sample?

Within three business days after NDA execution.

Need high-quality SFT data to match frontier models at compact scale?

Request curated, benchmark-aligned datasets across reasoning, code, and enterprise domains.

Request Sample