Inside the AI Editorial Stack: How We Produce Beyond The Prompt

Every week, Beyond The Prompt publishes articles about AI systems, agent architectures, and the mechanics of building with language models. What we have not written about yet is the operation behind the publication itself — how we actually produce these pieces.

We use AI to write this. Significantly, specifically, and by design. This is not a quiet experiment we are running on the side. It is the whole point. We built an AI-augmented editorial pipeline and then used that pipeline to ship a real publication. This article is a transparent account of how that works, what it looks like in practice, and where the seams still show.

Why Build This

The honest version: we wanted to write about AI systems for practitioners, and we thought the most credible way to do that was to actually operate one.

There is a gap in most AI coverage between the demo and the production system. Articles describe what a model can do in ideal conditions. They do not describe the scaffolding required to make it do that thing reliably, at scale, on unpredictable inputs. We wanted to write from inside that gap, and building the publication itself gave us a reason to live there.

The less idealistic version: the content cadence required for a weekly publication is genuinely hard to maintain with a small team. AI assistance was not an optional nice-to-have. It was the only realistic path to shipping consistently.

The Pipeline Architecture

The editorial pipeline runs from brief to published MDX file through a sequence of coordinated agent steps. At its core, the workflow looks like this:

ContentBrief → Writer → Editor → [SEO, FactCheck, Design] → Orchestrator → QAOutput

Each role maps to a distinct agent with defined responsibilities, not a single model handling everything in one pass. This matters for the same reasons prompt chaining matters in any production AI system: each step has a single responsibility, produces structured output, and can fail independently without corrupting the whole run.

Content Brief defines the article's scope, target audience, angle, and length. The brief is the one place where editorial judgment is most concentrated. What we decide here shapes everything downstream, and it is currently the step most dependent on human input.

Writer produces the first draft. It works from the brief and draws on context about the publication's voice and existing articles. The writer is not generating from scratch with no constraints — it has style guidance, examples, and explicit directions about what to avoid (hype, hedging, vague AI boosterism).

Editor revises the draft with specific instructions: tighten passive constructions, flag claims that need sourcing, ensure the piece has a clear argument rather than a list of observations. This is the step that most often surfaces the structural problems — an article that has good sentences but no spine.

Specialist agents (SEO, Fact Checker, Design) run in parallel after the editorial pass. SEO validates keyword fit and meta content. Fact Checker flags assertions that look like they need verification. Design produces image specifications and visual notes. These run concurrently because they are independent — none of them need each other's output to do their jobs.

Orchestrator collects the specialist outputs, resolves conflicts (SEO wants a phrase the editor removed, for example), and produces a final reconciled draft.

QA Output does a final check: does the frontmatter validate, is the MDX well-formed, are there any obvious formatting errors before the file is committed to the repo?

The Technology

The agent system runs on CrewAI, a Python framework for multi-agent coordination. Each agent has a defined role, a backstory that establishes its perspective, explicit goals, and access to a set of tools. Agents communicate through a structured handoff mechanism — one agent's final output becomes another's input, with the orchestrator managing state across the pipeline.

The underlying model across most agents is Claude. We chose it primarily for the quality of its instruction-following at longer outputs and the consistency of voice it produces. A model that writes one excellent article and one incoherent one is not useful for a pipeline. Reliability matters more than occasional brilliance.

Content is stored as MDX — Markdown with embedded JSX components. We chose this format because it works cleanly with the publication's Next.js frontend and keeps content and presentation concerns mostly separate. An article is a file. It lives in a directory. It has frontmatter that the site knows how to render. The simplicity is deliberate.

The frontend is Next.js 15 with App Router. It reads the MDX files, renders articles, and handles routing. There is no CMS, no database for content, no admin panel. Articles are files in a repository. This is not sophisticated, and it is exactly right — complexity you do not need is complexity that breaks.

What Works Well

Speed from brief to draft. A complete first draft from a well-formed brief takes minutes. This is genuinely transformative for a small editorial operation. The bottleneck is no longer "can we produce the content" — it is "can we produce the right brief and then do the editorial work to make the draft publishable."

Structural consistency. The pipeline produces articles that reliably have the right sections, the right length range, frontmatter that validates, and MDX that renders. These are the kinds of mechanical concerns that consume surprising amounts of time in traditional editorial operations. Automating them frees up attention for things that actually require judgment.

Parallelism at the specialist layer. Running SEO, fact-checking, and design in parallel means those steps add almost no latency to the total pipeline time. In a sequential workflow, they would be the long tail. Here they are effectively free in terms of turnaround.

Iterability. Because every article starts as a brief and every step is reproducible, revising an article is fast. Change the brief, re-run the pipeline, get a different draft. This lowers the cost of trying angles that do not work out.

What Is Still Rough

Voice consistency across articles. The agent has style guidance, but the voice still varies more than we would like. An article produced by one configuration of the writer agent sounds slightly different from one produced after a prompt adjustment. Managing that consistency across a growing catalog is an open problem. Humans notice it even when they cannot articulate what shifted.

The brief remains a skilled task. We described this above, but it is worth underscoring. The quality of the output is gated almost entirely by the quality of the brief. A vague brief produces a generic article. A brief that specifies the argument, the target reader's knowledge level, the specific claims to make and the ones to avoid — that produces something worth publishing. Writing good briefs is not easier than writing articles; it is different, and currently it requires a person who understands both the subject matter and the publication's editorial standards.

Fact verification at scale. The fact checker agent flags suspicious claims, but it does not verify them. Verification requires retrieval — querying sources, cross-referencing claims — and while that is technically possible to add to the pipeline, we have not built it robustly. Right now, a human reads every article before publication with specific attention to factual claims. That step is not optional.

Hallucination in technical detail. On articles about specific technologies, APIs, or code patterns, the writer will sometimes produce confident-sounding detail that is wrong. Not catastrophically wrong — usually plausible-sounding-but-off in a way that practitioners will catch. The editorial pass is supposed to surface these, but it requires a technically literate human reviewer who knows enough to question the specific claims.

MDX edge cases. The QA agent catches most formatting errors, but complex MDX — components with multiple props, nested structures, unusual code block annotations — still occasionally produces files that need manual repair. This is partly a model limitation and partly a testing limitation. We test the common cases well. The uncommon cases surface in production.

What Humans Must Still Own

We want to be precise about this, because the alternatives are either to overclaim (we have automated editorial work) or to underclaim (humans are doing all the real work and AI is just an autocomplete).

Judgment about what to publish. The pipeline can produce an article. It cannot decide whether that article is worth the reader's time. Editorial curation — choosing angles that matter, killing pieces that are technically correct but dull, commissioning coverage of things that are not yet in the conversation — is not a task the pipeline can do.

Voice at the sentence level. The model produces competent prose. It does not produce the kind of writing that makes a reader want to keep reading purely for the pleasure of the sentences. That quality, when it appears, comes from human editing. We do not always achieve it, but it is the target.

Credibility signals. A publication earns trust through the specificity and accuracy of its claims over time. That trust is built by humans who understand what they are writing about well enough to know when something is not quite right. The fact checker agent helps, but credibility is not a pipeline output. It is a relationship.

Escalation decisions. When the pipeline produces something that seems off — an article that is technically correct but says something inadvisable, or a piece that is not factually wrong but takes a position we are not sure we want to take — those are not decisions you can automate. Someone has to make the call.

What This Operation Enables

With this stack, a very small team can operate an editorial publication at a pace and coverage breadth that would otherwise require a much larger staff.

The more interesting benefit is what it does to the cost of iteration. Because drafts are cheap, we try more angles. Because revision is fast, we publish less that is mediocre. Because the mechanical work is handled, attention goes to the editorial tasks that actually matter.

The publication exists as an ongoing test of what AI-augmented editorial work can produce at its current level of development. Some of what we publish is better than it would have been without the pipeline. Some of it still carries the signature limitations of the technology. We track the difference and try to improve both the process and the model of where it makes sense to apply it.

This is not the final form of AI-augmented publishing. It is an early one. The honest position is that we are operating at the frontier of what is practical right now, with full awareness that the frontier is moving. What requires human involvement today may not in two years. What the pipeline handles clumsily now it will likely handle well. The constraints we are working within are real but temporary.

What will not change: the editorial judgment required to make content worth reading. The pipeline handles production. The question of what to produce, and whether it is worth a reader's time, stays with the humans.

That part, we are not in a hurry to automate.

Inside the AI Editorial Stack: How We Produce Beyond The Prompt

Inside the AI Editorial Stack: How We Produce Beyond The Prompt

Why Build This

The Pipeline Architecture

The Technology

What Works Well

What Is Still Rough

What Humans Must Still Own

What This Operation Enables

More articles

Why Most AI Projects Fail: The Gap Between Demos and Production

Building Reliable AI Pipelines: A Practical Guide to Prompt Chaining

Beyond Scripted Workflows: Building Data-Aware Agentic Systems