AI Slop Isn't Vague — It's Measurable: A Production Framework Built on Academic Research

Everyone in content marketing knows AI "slop" when they see it. The hollow opening. The em-dash avalanche. The paragraph that says nothing in forty words. The conclusion that summarizes what was never argued. We point at it, we feel it, and then we ship it anyway because there's no systematic way to catch it.

A paper from Northeastern University and Meta AI — "Measuring AI 'Slop' in Text" by Shaib, Chakrabarty, Garcia-Olano, and Wallace — changes that. For the first time, slop isn't a vibe. It's a construct with a taxonomy, an annotation methodology, and empirical validation.

This post turns that academic framework into a production quality gate. I'll walk through the taxonomy, map each dimension to detection methods that actually work, and show the pipeline architecture we use at SMF Works to keep slop out of published content.

The Slop Taxonomy: 10 Codes, 3 Themes

The researchers interviewed 19 experts across NLP, professional writing, and philosophy, then validated the resulting taxonomy through span-level annotation of 150 news articles and 100 QA passages. The result is 10 codes organized into 3 themes:

Theme 1: Information Utility

Code	What It Catches
Density (IU1)	Low information density — words that fill space without adding meaning. "In today's rapidly evolving landscape, it's worth noting that..."
Relevance (IU2)	Off-topic or tangential content that doesn't serve the piece's purpose

Density is the most frequent complaint. It's the hallmark of AI output that was optimized for length, not for meaning. Relevance failures are subtler — they're often technically accurate but disconnected from what the reader actually needs.

Theme 2: Information Quality

Code	What It Catches
Factuality (IQ1)	Fabricated claims, unsupported assertions
Bias (IQ2)	Systematic one-sidedness, cherry-picked evidence

Factuality is the danger zone. It's where AI content goes from "annoying" to "harmful." The paper found that annotators were most consistent in flagging factual errors — but also most forgiving of bias, which makes bias the silent failure mode.

Theme 3: Style Quality

Code	What It Catches
Repetition (SQ1)	Repeated phrases, recurring sentence structures
Templatedness (SQ2)	Rigid formulaic structure — "first, second, finally, in conclusion"
Coherence (SQ3)	Logical flow failures, missing transitions
Fluency (SQ4)	Unnatural phrasing, mechanical word choices
Verbosity (SQ5)	Excessive wordiness beyond what the idea requires
Word Complexity (SQ6)	Unnecessarily elevated vocabulary ("utilize" instead of "use")
Tone (SQ7)	Flat, inappropriately formal, or mismatched voice

Repetition and templatedness are the most agreed-upon slop indicators across all 19 experts. The paper calls this the "metronome detector" — the sense that a human writer varies their technique, while AI applies it uniformly. Any single technique works in isolation. The tell is how evenly it gets distributed.

Why Standard Metrics Fail

The most actionable finding in the paper: BLEU, ROUGE, and other standard text metrics cannot detect slop. They fail specifically on the dimensions that matter most — relevance, coherence, and fluency. A text can score perfectly on ROUGE and still be 80% slop by the taxonomy.

The researchers also tested whether reasoning LLMs (GPT-4 class) could reliably identify slop. They couldn't. The inter-annotator agreement between LLM judges and human annotators was low enough that the authors conclude you can't just throw a bigger model at the detection problem.

This has a clear architectural implication: you need hybrid detection — deterministic heuristics for what can be measured mechanically, LLM-based evaluation for what requires context, and human review for what requires judgment. No single layer catches everything.

Mapping the Taxonomy to Detection Methods

Here's how each dimension maps to a practical detection approach:

Slop Dimension	Detection Method	Tool/Implementation	Latency
Density	Word-to-idea ratio, sentence compression test	Deterministic (ContentForge)	<50ms
Relevance	Semantic similarity to brief/intent	LLM-based (Ramsay grounding)	2-5s
Factuality	Claim extraction + source verification	LLM + retrieval (Ramsay)	3-8s
Bias	Perspective coverage analysis	LLM-based	2-4s
Repetition	N-gram overlap, PoS tag sequence analysis	Deterministic (regex + heuristics)	<10ms
Templatedness	Structural pattern detection (transition words, list markers)	Deterministic (ContentForge)	<50ms
Coherence	Argument flow analysis, logical connective density	LLM-based (Ramsay rubric)	2-4s
Fluency	Perplexity variance, sentence-level naturalness	Statistical + LLM hybrid	500ms-2s
Verbosity	Length constraints, compression ratio	Deterministic (ContentForge)	<10ms
Word Complexity	Readability scores, vocabulary tier analysis	Deterministic (Flesch-Kincaid)	<10ms
Tone	Voice anchor comparison	LLM-based (Ramsay rubric)	2-4s

The key insight: deterministic checks run first because they're free and instant. They eliminate 60-70% of slop before you spend a single API call on LLM evaluation. The LLM layer catches what heuristics can't — relevance, coherence, tone. Human review catches what both layers miss — strategic misalignment, brand voice drift, audience mismatch.

The Production Pipeline

Here's the three-gate architecture we run at SMF Works, designed directly from this taxonomy:

Generated Content
    │
    ▼
┌─────────────────────────────────┐
│  GATE 1: Deterministic Heuristics│
│  Density, Verbosity, Repetition, │
│  Templatedness, Word Complexity  │
│  Latency: <50ms  Cost: $0       │
│  Reject rate: ~40%               │
└──────────────┬──────────────────┘
               │ pass
               ▼
┌─────────────────────────────────┐
│  GATE 2: LLM-Based Evaluation   │
│  Relevance, Factuality,         │
│  Coherence, Tone, Bias          │
│  Latency: 5-15s  Cost: ~$0.005  │
│  Reject rate: ~25%              │
└──────────────┬──────────────────┘
               │ pass
               ▼
┌─────────────────────────────────┐
│  GATE 3: Human Review            │
│  Strategic fit, brand voice,     │
│  audience match                  │
│  Latency: minutes  Cost: time    │
│  Reject rate: ~10%              │
└──────────────┬──────────────────┘
               │ pass
               ▼
          Publish

Gate 1: Deterministic Heuristics with Ramsay's Regex Precheck

Ramsay implements a regex precheck that runs in under 1ms with zero API cost. It catches banned phrases, structural tells (multiple em dashes in one paragraph), and formulaic patterns. This is the "metronome detector" operationalized:

# Custom Ramsay rubric for social media content
name: smf-social-post
description: SMF Works social media post quality gate

dimensions:
  - name: voice_authenticity
    description: "Does it sound like our brand? 1=generic AI, 5=distinct voice"
    min_pass: 4
    hard_floor: true

  - name: information_density
    description: "Is every sentence earning its place? 1=fluff, 5=tight"
    min_pass: 3
    hard_floor: false

  - name: hook_strength
    description: "Does the opening stop the scroll? 1=boring, 5=compelling"
    min_pass: 4
    hard_floor: true

banned_phrases:
  - "game-changing"
  - "in today's landscape"
  - "it's worth noting"
  - "let's dive in"
  - "the reality is"
  - "revolutionize the way"

kill_list:
  - "Multiple em dashes in one paragraph"
  - "Opening with a rhetorical question"
  - "Ending with a generic call to action"

pass_rule: all_hard_floors

The hard_floor: true flag means failing that dimension fails the entire post — no averaging, no forgiveness. This matches the paper's finding that binary judgments (slop / not slop) are subjective at the boundary but clear at the extremes. Hard floors enforce the extremes.

Gate 2: LLM-Based Evaluation with Ramsay's Three-Stage Pipeline

Content that passes the regex precheck enters Ramsay's full pipeline:

Generate — Initial text from task + sources + voice profile
Ground — Extract every factual claim, check against sources. One failing claim fails the entire text. This is strict by design: the paper showed that factuality errors are the most consistently flagged slop dimension by human annotators.
Score — LLM evaluation on rubric dimensions. The code enforces pass/fail deterministically — the LLM never decides the gate.

from ramsay import generate, evaluate

# Full pipeline: generate + ground + score
result = generate(
    task="Write a LinkedIn post about our new AI content pipeline",
    sources="Internal metrics: 50 posts generated, 2.1hr total, 3 quality gate rejections",
    rubric="smf-social-post",
    voice="Direct, technical, confident. Short sentences. Data over adjectives.",
)

# Or evaluate existing content without regeneration
eval_result = evaluate(
    text="Your existing draft...",
    rubric="smf-social-post",
    source="Reference materials for fact-checking...",
)

print(f"Passed: {result.passed}")
print(f"Scores: {result.scores}")

The grounding stage is what separates this from simple rubric scoring. It catches fabricated claims — the most dangerous slop dimension — before they reach publication. At SMF Works, we've seen AI confidently cite metrics that don't exist and reference product features that haven't shipped. Grounding prevents that.

Gate 1 + 2 Complement: Deterministic Platform Scoring

While Ramsay handles text quality, ContentForge provides deterministic platform-specific scoring — a tweet scores differently than a LinkedIn post, because the slop threshold varies by context:

import requests

BASE = "https://contentforge-api-lpp9.onrender.com"

# Score the same content across platforms
r = requests.post(f"{BASE}/v1/score_multi", json={
    "text": "Our AI pipeline just shipped 50 posts in 2 hours. Zero slop. Here's how...",
    "platforms": ["twitter", "linkedin", "threads"]
})

# Auto-improve: score → rewrite → re-score loop
r = requests.post(f"{BASE}/v1/auto_improve", json={
    "text": "We built something cool with AI",
    "platform": "twitter",
    "max_iterations": 5
})
# Returns best version + full iteration history

# Batch quality gate for content calendars
r = requests.post(f"{BASE}/v1/quality_gate", json={
    "posts": [
        {"text": "Post 1 draft...", "platform": "twitter"},
        {"text": "Post 2 draft...", "platform": "linkedin"},
    ]
})

ContentForge has 0% variance on the same input — deterministic heuristics always produce the same score. This makes it reliable as a first-pass filter. Ramsay's LLM evaluation has the typical ~15% variance of any LLM judge, but its code-enforced pass/fail gates prevent the LLM from making the final call.

The Feature ↔ Reward Symmetry Principle

If you're fine-tuning your own models for content generation (and the Social Media AI Engineering ETL pipeline makes this accessible), there's a deeper architectural principle at work: the features you extract in your quality pipeline should be the same features you optimize during training.

This pipeline by Jacob Warren demonstrates the pattern explicitly. Every feature extracted in the ETL stage has a corresponding GRPO reward function:

ETL Feature Extraction	GRPO Reward Function
Bullet style analysis	`bullet_style_reward_func`
Tone analysis	`tone_alignment_reward_func`
Emoji usage patterns	`emoji_usage_reward` + `emoji_variety_reward`
Post length constraints	`precise_post_length_reward`
Sentence structure	`sentence_structure_reward_func`

This symmetry means you're not just detecting slop — you're training the model to avoid producing it in the first place. The quality gate and the training pipeline are two sides of the same specification.

Domain-Specific Thresholds

The paper's most underappreciated finding: what counts as slop varies by domain. The slop threshold for a news article is different from a QA answer is different from a social media post.

This means you can't take a general-purpose slop detector and apply it to social media. You need:

Domain-specific banned phrases — Social media tolerates informal language that would be slop in a research paper
Platform-specific thresholds — LinkedIn rewards longer, structured posts; Twitter penalizes them
Brand-specific voice anchors — Your brand voice is slop if it sounds like everyone else's, even if the grammar is perfect

At SMF Works, we calibrate our quality gates against our best-performing posts. The thresholds aren't abstract — they're derived from engagement data. A post that scored 72 on our rubric and got 3x average engagement tells us more about where the threshold should be than any academic paper.

The Full Stack in Practice

Here's the complete detection stack, running from cheapest/fastest to most expensive/slowest:

Input text
  │
  ├─ Regex precheck (Ramsay) ─────── banned phrases, structural tells ─── <1ms, $0
  │
  ├─ Deterministic scoring (ContentForge) ── density, verbosity, templatedness ── <50ms, $0
  │
  ├─ Statistical analysis ────────── readability, compression ratio ─── <10ms, $0
  │
  ├─ LLM rubric scoring (Ramsay) ── relevance, coherence, tone ─── 2-5s, ~$0.003
  │
  ├─ Factuality grounding (Ramsay) ─ claim extraction + verification ─── 3-8s, ~$0.002
  │
  └─ Human review ────────────────── strategic fit, brand voice ─── minutes, time

Total cost per content piece: roughly $0.005 in API calls and however long human review takes for the ~25% that reaches Gate 3. The deterministic layers eliminate 60-70% of slop for free. The LLM layer catches most of the rest. Human review becomes the exception, not the rule.

What We Learned Running This

Three months into this pipeline at SMF Works, the data is clear:

Gate 1 (deterministic) rejects ~40% of generated content before any API call. The most common failure: banned phrases and templated structure. AI models default to "game-changing" and "let's dive in" with alarming consistency.
Gate 2 (LLM-based) rejects ~25% of what passes Gate 1. The most common failure: relevance drift — the post is well-written but doesn't serve the brief.
Gate 3 (human) rejects ~10% of what passes Gate 2. The most common failure: tone mismatch — technically correct but emotionally wrong for the audience.

The compound rejection rate means about 58% of AI-generated first drafts never reach publication. That's not a failure of the generation model. That's the slop signal the paper measured. Without the gates, that 58% becomes your brand.

Building Your Own

Start with the taxonomy. Read the paper. Map your worst AI output against the 10 codes. You'll see patterns immediately.
Build banned phrase lists. Collect every AI tell your team spots. Add them to a Ramsay-style regex precheck. This alone catches 30-40% of slop.
Add deterministic scoring. Use ContentForge or build your own heuristics for density, verbosity, and templatedness. Zero cost, zero variance.
Layer LLM evaluation. Use Ramsay or build rubric-based scoring with hard floors. Don't let the LLM make the pass/fail call — code enforces that.
Calibrate against your data. Your slop thresholds are different from mine. Your best-performing content defines your standard.

The slop problem is solvable. Not by building a better generation model, but by building a better detection pipeline. The taxonomy gives us the vocabulary. The tools give us the implementation. The pipeline gives us the architecture.

Stop shipping slop. Start measuring it.

References:

Shaib, C., Chakrabarty, T., Garcia-Olano, D., & Wallace, B. C. (2025). Measuring AI "Slop" in Text. arXiv:2509.19163. Submitted to ICLR 2026.
Ramsay — Quality-controlled text generation with rubric gates. github.com/davegoldblatt/ramsay
ContentForge — Deterministic pre-publish quality gates. github.com/CaptainFredric/ContentForge
Social Media AI Engineering ETL — Manifest-driven fine-tuning pipeline. github.com/jacobwarren/social-media-ai-engineering-etl