Scorpio Research
Investigating the effectiveness of constraint-based AI tutoring systems. Our ablation study demonstrates how a layered architecture of inference-time rules can transform a general-purpose LLM into a specialized Socratic tutor.
Key Findings
Constraint Effectiveness
HighModular constraint stack enforces 100% domain adherence and notation accuracy (LaTeX density 0.92)
Eliminates off-topic and poorly formatted responses
Direct Answer Prevention
HighDirect Answer Rate (DAR) reduced from 100% (NONE) to 0% (FULL)
Forces productive struggle and guided reasoning
Socratic Engagement
HighFULL stack achieves 1.16 questions per response (vs. 0.32 for DOMAIN ONLY)
Significant increase in inquiry-based interaction
Pedagogical Quality
MediumQuality scores remain high and consistent (3.92/5 FULL, 3.96/5 NONE)
Reliable teaching effectiveness across all tiers
System Performance
| Constraint Level | Description | Domain Adh. | DAR | LaTeX % | Avg Q's | Quality |
|---|---|---|---|---|---|---|
| NONE | Baseline Gemini 2.5 Flash, no constraints | 0.0% | 100% | 0.22 | 1.08 | 3.96 |
| DOMAIN | Physics domain restriction only | 100.0% | 100% | 0.35 | 0.32 | 3.98 |
| PEDAGOGY | Domain + response classification | 100.0% | 0.0% | 0.28 | 0.84 | 3.86 |
| NOTATION | Domain + pedagogy + LaTeX/unit enforcement | 100.0% | 0.0% | 0.88 | 1.04 | 4.02 |
| FULL | Complete Socratic tutoring stack | 100.0% | 0.0% | 0.92 | 1.16 | 3.92 |
Performance by Difficulty
| Difficulty | Quality | Rule Adherence % | Avg Length (Chars) |
|---|---|---|---|
| Basic | 3.72 | 77.8% | 399 |
| Intermediate | 4.05 | 100.0% | 641 |
| Advanced | 4.13 | 100.0% | 1578 |
| College | 3.75 | 100.0% | 3322 |
Methodology
| Category | Count/Details | Breakdown |
|---|---|---|
| Question Types | 28 total | Conceptual, Procedural, Adversarial |
| Difficulty Levels | 4 levels | Basic (8), Intermediate (10), Advanced (6), College (4) |
| Constraint Levels | 5 configurations | NONE, DOMAIN, PEDAGOGY, NOTATION, FULL |
| Metrics Collected | Direct Answer Rate, LaTeX Density, Question Density, Domain Adherence, Pedagogical Quality | |
| Sample Size | 140 responses | 28 questions × 5 constraint levels |
| AI Model | Gemini 2.5 Flash | Lightweight model, inference-time constraints |