AI Slop Isn't Vague — It's Measurable: A Production Framework Built on Academic Research
A Northeastern/Meta AI paper gives us the first rigorous taxonomy of AI 'slop.' Here's how to turn that taxonomy into a working quality gate — with code, real tools, and the pipeline architecture we run at SMF Works.
Morgan Lockridge
Social Media Manager

Everyone in content marketing knows AI "slop" when they see it. The hollow opening. The em-dash avalanche. The paragraph that says nothing in forty words. The conclusion that summarizes what was never argued. We point at it, we feel it, and then we ship it anyway because there's no systematic way to catch it.
A paper from Northeastern University and Meta AI — "Measuring AI 'Slop' in Text" by Shaib, Chakrabarty, Garcia-Olano, and Wallace — changes that. For the first time, slop isn't a vibe. It's a construct with a taxonomy, an annotation methodology, and empirical validation.
This post turns that academic framework into a production quality gate. I'll walk through the taxonomy, map each dimension to detection methods that actually work, and show the pipeline architecture we use at SMF Works to keep slop out of published content.
The Slop Taxonomy: 10 Codes, 3 Themes
The researchers interviewed 19 experts across NLP, professional writing, and philosophy, then validated the resulting taxonomy through span-level annotation of 150 news articles and 100 QA passages. The result is 10 codes organized into 3 themes:
Theme 1: Information Utility
| Code | What It Catches |
|---|---|
| Density (IU1) | Low information density — words that fill space without adding meaning. "In today's rapidly evolving landscape, it's worth noting that..." |
| Relevance (IU2) | Off-topic or tangential content that doesn't serve the piece's purpose |
Density is the most frequent complaint. It's the hallmark of AI output that was optimized for length, not for meaning. Relevance failures are subtler — they're often technically accurate but disconnected from what the reader actually needs.
Theme 2: Information Quality
| Code | What It Catches |
|---|---|
| Factuality (IQ1) | Fabricated claims, unsupported assertions |
| Bias (IQ2) | Systematic one-sidedness, cherry-picked evidence |
Factuality is the danger zone. It's where AI content goes from "annoying" to "harmful." The paper found that annotators were most consistent in flagging factual errors — but also most forgiving of bias, which makes bias the silent failure mode.
Theme 3: Style Quality
| Code | What It Catches |
|---|---|
| Repetition (SQ1) | Repeated phrases, recurring sentence structures |
| Templatedness (SQ2) | Rigid formulaic structure — "first, second, finally, in conclusion" |
| Coherence (SQ3) | Logical flow failures, missing transitions |
| Fluency (SQ4) | Unnatural phrasing, mechanical word choices |
| Verbosity (SQ5) | Excessive wordiness beyond what the idea requires |
| Word Complexity (SQ6) | Unnecessarily elevated vocabulary ("utilize" instead of "use") |
| Tone (SQ7) | Flat, inappropriately formal, or mismatched voice |
Repetition and templatedness are the most agreed-upon slop indicators across all 19 experts. The paper calls this the "metronome detector" — the sense that a human writer varies their technique, while AI applies it uniformly. Any single technique works in isolation. The tell is how evenly it gets distributed.
Why Standard Metrics Fail
The most actionable finding in the paper: BLEU, ROUGE, and other standard text metrics cannot detect slop. They fail specifically on the dimensions that matter most — relevance, coherence, and fluency. A text can score perfectly on ROUGE and still be 80% slop by the taxonomy.
The researchers also tested whether reasoning LLMs (GPT-4 class) could reliably identify slop. They couldn't. The inter-annotator agreement between LLM judges and human annotators was low enough that the authors conclude you can't just throw a bigger model at the detection problem.
This has a clear architectural implication: you need hybrid detection — deterministic heuristics for what can be measured mechanically, LLM-based evaluation for what requires context, and human review for what requires judgment. No single layer catches everything.
Mapping the Taxonomy to Detection Methods
Here's how each dimension maps to a practical detection approach:
| Slop Dimension | Detection Method | Tool/Implementation | Latency |
|---|---|---|---|
| Density | Word-to-idea ratio, sentence compression test | Deterministic (ContentForge) | <50ms |
| Relevance | Semantic similarity to brief/intent | LLM-based (Ramsay grounding) | 2-5s |
| Factuality | Claim extraction + source verification | LLM + retrieval (Ramsay) | 3-8s |
| Bias | Perspective coverage analysis | LLM-based | 2-4s |
| Repetition | N-gram overlap, PoS tag sequence analysis | Deterministic (regex + heuristics) | <10ms |
| Templatedness | Structural pattern detection (transition words, list markers) | Deterministic (ContentForge) | <50ms |
| Coherence | Argument flow analysis, logical connective density | LLM-based (Ramsay rubric) | 2-4s |
| Fluency | Perplexity variance, sentence-level naturalness | Statistical + LLM hybrid | 500ms-2s |
| Verbosity | Length constraints, compression ratio | Deterministic (ContentForge) | <10ms |
| Word Complexity | Readability scores, vocabulary tier analysis | Deterministic (Flesch-Kincaid) | <10ms |
| Tone | Voice anchor comparison | LLM-based (Ramsay rubric) | 2-4s |
The key insight: deterministic checks run first because they're free and instant. They eliminate 60-70% of slop before you spend a single API call on LLM evaluation. The LLM layer catches what heuristics can't — relevance, coherence, tone. Human review catches what both layers miss — strategic misalignment, brand voice drift, audience mismatch.
The Production Pipeline
Here's the three-gate architecture we run at SMF Works, designed directly from this taxonomy:
Generated Content
│
▼
┌─────────────────────────────────┐
│ GATE 1: Deterministic Heuristics│
│ Density, Verbosity, Repetition, │
│ Templatedness, Word Complexity │
│ Latency: <50ms Cost: $0 │
│ Reject rate: ~40% │
└──────────────┬──────────────────┘
│ pass
▼
┌─────────────────────────────────┐
│ GATE 2: LLM-Based Evaluation │
│ Relevance, Factuality, │
│ Coherence, Tone, Bias │
│ Latency: 5-15s Cost: ~$0.005 │
│ Reject rate: ~25% │
└──────────────┬──────────────────┘
│ pass
▼
┌─────────────────────────────────┐
│ GATE 3: Human Review │
│ Strategic fit, brand voice, │
│ audience match │
│ Latency: minutes Cost: time │
│ Reject rate: ~10% │
└──────────────┬──────────────────┘
│ pass
▼
Publish
Gate 1: Deterministic Heuristics with Ramsay's Regex Precheck
Ramsay implements a regex precheck that runs in under 1ms with zero API cost. It catches banned phrases, structural tells (multiple em dashes in one paragraph), and formulaic patterns. This is the "metronome detector" operationalized:
# Custom Ramsay rubric for social media content
name: smf-social-post
description: SMF Works social media post quality gate
dimensions:
- name: voice_authenticity
description: "Does it sound like our brand? 1=generic AI, 5=distinct voice"
min_pass: 4
hard_floor: true
- name: information_density
description: "Is every sentence earning its place? 1=fluff, 5=tight"
min_pass: 3
hard_floor: false
- name: hook_strength
description: "Does the opening stop the scroll? 1=boring, 5=compelling"
min_pass: 4
hard_floor: true
banned_phrases:
- "game-changing"
- "in today's landscape"
- "it's worth noting"
- "let's dive in"
- "the reality is"
- "revolutionize the way"
kill_list:
- "Multiple em dashes in one paragraph"
- "Opening with a rhetorical question"
- "Ending with a generic call to action"
pass_rule: all_hard_floors
The hard_floor: true flag means failing that dimension fails the entire post — no averaging, no forgiveness. This matches the paper's finding that binary judgments (slop / not slop) are subjective at the boundary but clear at the extremes. Hard floors enforce the extremes.
Gate 2: LLM-Based Evaluation with Ramsay's Three-Stage Pipeline
Content that passes the regex precheck enters Ramsay's full pipeline:
- Generate — Initial text from task + sources + voice profile
- Ground — Extract every factual claim, check against sources. One failing claim fails the entire text. This is strict by design: the paper showed that factuality errors are the most consistently flagged slop dimension by human annotators.
- Score — LLM evaluation on rubric dimensions. The code enforces pass/fail deterministically — the LLM never decides the gate.
from ramsay import generate, evaluate
# Full pipeline: generate + ground + score
result = generate(
task="Write a LinkedIn post about our new AI content pipeline",
sources="Internal metrics: 50 posts generated, 2.1hr total, 3 quality gate rejections",
rubric="smf-social-post",
voice="Direct, technical, confident. Short sentences. Data over adjectives.",
)
# Or evaluate existing content without regeneration
eval_result = evaluate(
text="Your existing draft...",
rubric="smf-social-post",
source="Reference materials for fact-checking...",
)
print(f"Passed: {result.passed}")
print(f"Scores: {result.scores}")
The grounding stage is what separates this from simple rubric scoring. It catches fabricated claims — the most dangerous slop dimension — before they reach publication. At SMF Works, we've seen AI confidently cite metrics that don't exist and reference product features that haven't shipped. Grounding prevents that.
Gate 1 + 2 Complement: Deterministic Platform Scoring
While Ramsay handles text quality, ContentForge provides deterministic platform-specific scoring — a tweet scores differently than a LinkedIn post, because the slop threshold varies by context:
import requests
BASE = "https://contentforge-api-lpp9.onrender.com"
# Score the same content across platforms
r = requests.post(f"{BASE}/v1/score_multi", json={
"text": "Our AI pipeline just shipped 50 posts in 2 hours. Zero slop. Here's how...",
"platforms": ["twitter", "linkedin", "threads"]
})
# Auto-improve: score → rewrite → re-score loop
r = requests.post(f"{BASE}/v1/auto_improve", json={
"text": "We built something cool with AI",
"platform": "twitter",
"max_iterations": 5
})
# Returns best version + full iteration history
# Batch quality gate for content calendars
r = requests.post(f"{BASE}/v1/quality_gate", json={
"posts": [
{"text": "Post 1 draft...", "platform": "twitter"},
{"text": "Post 2 draft...", "platform": "linkedin"},
]
})
ContentForge has 0% variance on the same input — deterministic heuristics always produce the same score. This makes it reliable as a first-pass filter. Ramsay's LLM evaluation has the typical ~15% variance of any LLM judge, but its code-enforced pass/fail gates prevent the LLM from making the final call.
The Feature ↔ Reward Symmetry Principle
If you're fine-tuning your own models for content generation (and the Social Media AI Engineering ETL pipeline makes this accessible), there's a deeper architectural principle at work: the features you extract in your quality pipeline should be the same features you optimize during training.
This pipeline by Jacob Warren demonstrates the pattern explicitly. Every feature extracted in the ETL stage has a corresponding GRPO reward function:
| ETL Feature Extraction | GRPO Reward Function |
|---|---|
| Bullet style analysis | bullet_style_reward_func |
| Tone analysis | tone_alignment_reward_func |
| Emoji usage patterns | emoji_usage_reward + emoji_variety_reward |
| Post length constraints | precise_post_length_reward |
| Sentence structure | sentence_structure_reward_func |
This symmetry means you're not just detecting slop — you're training the model to avoid producing it in the first place. The quality gate and the training pipeline are two sides of the same specification.
Domain-Specific Thresholds
The paper's most underappreciated finding: what counts as slop varies by domain. The slop threshold for a news article is different from a QA answer is different from a social media post.
This means you can't take a general-purpose slop detector and apply it to social media. You need:
- Domain-specific banned phrases — Social media tolerates informal language that would be slop in a research paper
- Platform-specific thresholds — LinkedIn rewards longer, structured posts; Twitter penalizes them
- Brand-specific voice anchors — Your brand voice is slop if it sounds like everyone else's, even if the grammar is perfect
At SMF Works, we calibrate our quality gates against our best-performing posts. The thresholds aren't abstract — they're derived from engagement data. A post that scored 72 on our rubric and got 3x average engagement tells us more about where the threshold should be than any academic paper.
The Full Stack in Practice
Here's the complete detection stack, running from cheapest/fastest to most expensive/slowest:
Input text
│
├─ Regex precheck (Ramsay) ─────── banned phrases, structural tells ─── <1ms, $0
│
├─ Deterministic scoring (ContentForge) ── density, verbosity, templatedness ── <50ms, $0
│
├─ Statistical analysis ────────── readability, compression ratio ─── <10ms, $0
│
├─ LLM rubric scoring (Ramsay) ── relevance, coherence, tone ─── 2-5s, ~$0.003
│
├─ Factuality grounding (Ramsay) ─ claim extraction + verification ─── 3-8s, ~$0.002
│
└─ Human review ────────────────── strategic fit, brand voice ─── minutes, time
Total cost per content piece: roughly $0.005 in API calls and however long human review takes for the ~25% that reaches Gate 3. The deterministic layers eliminate 60-70% of slop for free. The LLM layer catches most of the rest. Human review becomes the exception, not the rule.
What We Learned Running This
Three months into this pipeline at SMF Works, the data is clear:
- Gate 1 (deterministic) rejects ~40% of generated content before any API call. The most common failure: banned phrases and templated structure. AI models default to "game-changing" and "let's dive in" with alarming consistency.
- Gate 2 (LLM-based) rejects ~25% of what passes Gate 1. The most common failure: relevance drift — the post is well-written but doesn't serve the brief.
- Gate 3 (human) rejects ~10% of what passes Gate 2. The most common failure: tone mismatch — technically correct but emotionally wrong for the audience.
The compound rejection rate means about 58% of AI-generated first drafts never reach publication. That's not a failure of the generation model. That's the slop signal the paper measured. Without the gates, that 58% becomes your brand.
Building Your Own
- Start with the taxonomy. Read the paper. Map your worst AI output against the 10 codes. You'll see patterns immediately.
- Build banned phrase lists. Collect every AI tell your team spots. Add them to a Ramsay-style regex precheck. This alone catches 30-40% of slop.
- Add deterministic scoring. Use ContentForge or build your own heuristics for density, verbosity, and templatedness. Zero cost, zero variance.
- Layer LLM evaluation. Use Ramsay or build rubric-based scoring with hard floors. Don't let the LLM make the pass/fail call — code enforces that.
- Calibrate against your data. Your slop thresholds are different from mine. Your best-performing content defines your standard.
The slop problem is solvable. Not by building a better generation model, but by building a better detection pipeline. The taxonomy gives us the vocabulary. The tools give us the implementation. The pipeline gives us the architecture.
Stop shipping slop. Start measuring it.
References:
- Shaib, C., Chakrabarty, T., Garcia-Olano, D., & Wallace, B. C. (2025). Measuring AI "Slop" in Text. arXiv:2509.19163. Submitted to ICLR 2026.
- Ramsay — Quality-controlled text generation with rubric gates. github.com/davegoldblatt/ramsay
- ContentForge — Deterministic pre-publish quality gates. github.com/CaptainFredric/ContentForge
- Social Media AI Engineering ETL — Manifest-driven fine-tuning pipeline. github.com/jacobwarren/social-media-ai-engineering-etl