Scorpio Research

Investigating the effectiveness of constraint-based AI tutoring systems. Our ablation study demonstrates how a layered architecture of inference-time rules can transform a general-purpose LLM into a specialized Socratic tutor.

Key Findings

Constraint Effectiveness

High

Modular constraint stack enforces 100% domain adherence and notation accuracy (LaTeX density 0.92)

Eliminates off-topic and poorly formatted responses

Direct Answer Prevention

High

Direct Answer Rate (DAR) reduced from 100% (NONE) to 0% (FULL)

Forces productive struggle and guided reasoning

Socratic Engagement

High

FULL stack achieves 1.16 questions per response (vs. 0.32 for DOMAIN ONLY)

Significant increase in inquiry-based interaction

Pedagogical Quality

Medium

Quality scores remain high and consistent (3.92/5 FULL, 3.96/5 NONE)

Reliable teaching effectiveness across all tiers

System Performance

Constraint Level	Description	Domain Adh.	DAR	LaTeX %	Avg Q's	Quality
NONE	Baseline Gemini 2.5 Flash, no constraints	0.0%	100%	0.22	1.08	3.96
DOMAIN	Physics domain restriction only	100.0%	100%	0.35	0.32	3.98
PEDAGOGY	Domain + response classification	100.0%	0.0%	0.28	0.84	3.86
NOTATION	Domain + pedagogy + LaTeX/unit enforcement	100.0%	0.0%	0.88	1.04	4.02
FULL	Complete Socratic tutoring stack	100.0%	0.0%	0.92	1.16	3.92

Performance by Difficulty

Difficulty	Quality	Rule Adherence %	Avg Length (Chars)
Basic	3.72	77.8%	399
Intermediate	4.05	100.0%	641
Advanced	4.13	100.0%	1578
College	3.75	100.0%	3322

Methodology

Category	Count/Details	Breakdown
Question Types	28 total	Conceptual, Procedural, Adversarial
Difficulty Levels	4 levels	Basic (8), Intermediate (10), Advanced (6), College (4)
Constraint Levels	5 configurations	NONE, DOMAIN, PEDAGOGY, NOTATION, FULL
Metrics Collected	Direct Answer Rate, LaTeX Density, Question Density, Domain Adherence, Pedagogical Quality
Sample Size	140 responses	28 questions × 5 constraint levels
AI Model	Gemini 2.5 Flash	Lightweight model, inference-time constraints