The LLM-augmented research workflow we use internally

Most consultancies hide their internal AI use behind a curtain of human authorship. The framing is roughly: "the deliverable was written by our analysts; we may have used AI to help, but the work is human." We think the curtain is doing more harm than good — to clients, to the discourse, and to our own work. So this note describes what we actually use, where, and what discipline we apply.

What the AI agents do inside our research pipeline

A typical Zhianrui research workstream takes a brief from a client and produces a structured dossier — claims tagged to source rows, A–E tier classification, confidence rubric, open-questions register. The pipeline that produces this dossier has agentic steps in the following places:

Source identification. Given the brief and a scoping outline, an agent enumerates candidate source domains and document types. A human reviewer prunes and adds. The agent does not have authority to commit a source to the dossier; it produces a candidate list.
Tier classification. Given a candidate source, an agent proposes the A–E tier with reasoning. A human reviewer accepts, rejects, or amends. Disagreements between the agent and the reviewer are logged; periodic reviews of the disagreement log are how we recalibrate the agent.
Contradiction detection. Given a draft dossier and the source set, an agent flags claims that disagree, claims that contradict the cited source, and claims whose source is weaker than the confidence rating implies. The agent does not resolve contradictions; it surfaces them for the open-questions register.
Confidence scoring. Given a claim and its sources, the agent proposes a confidence rating. As above, a reviewer accepts or amends; disagreements feed back into calibration.

The pattern is consistent across all four steps. The agents propose; humans dispose. The agents do the throughput-bound work — enumerating, classifying, cross-checking — at speeds the humans cannot match. The humans do the judgement-bound work — accepting, rejecting, contextualising. Neither does the other's job.

Why we say this out loud

Three reasons.

First, our clients are entitled to know how their deliverables are produced. A dossier with agentic steps in the workflow is not lower-quality than a dossier produced entirely by humans — but it is a different production process, with different failure modes. Pretending it is the same is a category error that will eventually show up as an unpleasant surprise.

Second, we sell research about AI systems engineering. A consultancy that builds AI systems for clients but pretends not to use AI in its own work is making an architectural claim through the back door — that AI systems are too unreliable for serious research. We do not believe that. Our systems are reliable enough for our own research, with the discipline described above; they are also reliable enough for our clients' production deployments, with the same discipline applied. The internal practice and the external deliverable are consistent.

Third, the discipline transfers. The agents in our internal workflow are governed by the same provenance-first framework we apply to human researchers and to client agent systems. The framework does not care whether the entity producing an evidence row is human or model — it cares whether the row resolves to a verifiable source. That symmetry is the point. A consultancy whose internal AI use is governed by different rules than the systems it ships for clients is signalling that the rules are not really structural; they are marketing.

What the discipline actually looks like

The agents in our pipeline operate under three rules:

No agent commits anything to the deliverable directly. Every agentic output is a candidate for a human review queue. The final dossier contains only items that survived review.
Disagreements are logged. Every time a reviewer overrides an agent decision, the override is recorded with a brief reason. Periodically — currently weekly — we sample the override log to recalibrate the agents and update their prompting.
Agent prompts are versioned alongside research. A claim produced under agent prompt v3.2 carries that version in its source row. If we later discover prompt v3.2 had a systematic bias, we can find every claim it influenced.

This is not most internal AI workflows. Most internal AI workflows have an agent step somewhere and a human at the end, with no logging in between. The deliverable looks the same; the auditability is not.

Why we will keep doing this in public

There is a comfortable position available to consultancies right now: do not use AI internally, and tell clients that your hand-crafted deliverables are superior to what AI-augmented competitors produce. There is a less comfortable position: use AI internally with discipline, say so, and let the clients judge whether the discipline produces better work than its absence.

We are taking the less comfortable position. Our deliverables are visibly produced by humans and agents in collaboration, with the agents disciplined by the same provenance framework as the humans. The result is what we are willing to be measured on. A consultancy unwilling to say what is in its workflow is a consultancy whose workflow is doing something it cannot defend.

← All insights