Agent mesh architecture: the A2A event bus we ship to BFSI
Why we moved away from a star topology, what AgentCop catches in production, and the four invariants we enforce on every agent-to-agent message — even when both ends are first-party.
The first version of Blackbead was a star. One orchestrator in the middle, every agent talking to the centre. It was easy to reason about, easy to instrument, and entirely wrong for the workloads our customers were starting to run.
This post walks through why we moved to a mesh with an A2A (agent-to-agent) event bus, what shipped in PR #11 as the first version of the bus, and the four invariants we now enforce on every message that traverses it.
Why the star broke
Three pressures stacked up:
- Latency. Every hop went through the centre. For a fraud-scoring flow with five agents in the chain, that's ten round trips when it should be five.
- Blast radius. The orchestrator was a single point of compromise. A prompt injection that escalated through the orchestrator's allowlist could touch any agent in the fleet.
- Cross-customer noise. The orchestrator was the only place that saw everything — which meant it was also the only place that logged everything. Per-tenant queries got slower and noisier as the fleet grew.
The mesh fixes the first two and lets us partition the third by tenant.
The shape we ship
┌────────┐ ┌────────┐
│ Agent │ ◀─────▶ │ Agent │
│ A │ │ B │
└───┬────┘ └───┬────┘
│ │
▼ ▼
┌───────────────────────────────┐
│ A2A Event Bus (pub/sub) │
│ ↳ tenant-partitioned topics │
│ ↳ signed envelopes │
│ ↳ AgentCop tap on every msg │
└───────────────────────────────┘
▲ ▲
│ │
┌───┴────┐ ┌───┴────┐
│ Agent │ ◀─────▶ │ Agent │
│ C │ │ D │
└────────┘ └────────┘
Agents publish events. Other agents subscribe to topics they care about. The bus partitions topics by tenant, signs every envelope, and runs AgentCop as a tap that sees every message without being on the critical path.
Invariant 1: every envelope is signed
The signing key is workload-identity-bound (see the non-human identity post). The receiver verifies the signature before deserialising. If the signature fails, the message is dropped and a security event is emitted on a separate, out-of-band channel. We've never seen a legitimate message fail this check; we've seen attempted spoofs in red-team exercises and they get caught at the receiver, not at the bus.
Invariant 2: provenance chains, not just signatures
The envelope carries a chain of every agent that has touched the message. Each hop appends its signed claim. The receiver can answer "who started this?" without trusting any intermediate to truthfully say so. This is the single most useful field during incident response — when a tainted output appears at agent D, the chain tells you exactly which path got it there.
Invariant 3: schema before semantics
Every topic has a schema. Messages that don't validate against the schema never enter the bus. This sounds obvious; it stopped being obvious the moment we let agents start generating messages from natural-language prompts. The temptation is to allow free-form payloads "for flexibility." We hold the line — schemas, every time — because the cost of letting a malformed payload propagate is much higher than the cost of failing fast at the publisher.
Invariant 4: AgentCop sees everything, but isn't load-bearing
AgentCop runs as a tap. It receives a copy of every message. It scores them for PII, SPI, credentials, and OWASP LLM Top 10 patterns. It writes its findings to the audit log. It can raise events on the security topic that other agents subscribe to. But it cannot block delivery on the main path. The reason is simple: if AgentCop fails, the bus has to keep working. We make the security observability layer redundant to the business observability layer, never the other way around.
Security observability must be redundant, not load-bearing. If your security layer can take down your business layer, attackers will figure out how to take down your security layer first.
What this caught last month
One real example. A customer's fraud-scoring agent started occasionally producing scores that included internal model debug strings in the explainability field. The strings were benign — internal logging that leaked through a code path nobody had touched in months — but they were PII-adjacent (they contained masked customer IDs in a recoverable format).
AgentCop scored them. The scores never crossed the threshold to alert, but the per-agent baseline shifted. The shift was visible on the dashboard inside three days. Engineering tracked it down to a logging change in the model server, patched it, and moved on.
In the star version of the architecture, that signal would have lived in the orchestrator's logs, mixed in with every other tenant's traffic, and almost certainly gone unnoticed.
What's still hard
Two things are still hard, in case anyone is shipping this and wants the honest version:
- Backpressure. When one agent spikes, the bus has to push back on its publishers without starving its subscribers. We use per-tenant token buckets; they work but require tuning.
- Replay. Replaying an incident from the audit log into a staging mesh is harder than it looks because the signatures don't validate cross-environment. We sign replays with a separate key and tag them; it's clunky and we'll have a better answer soon.
What to take away
If you're running more than a handful of agents and they need to coordinate, a mesh with a typed, signed event bus is the architecture that holds up. The four invariants above are the cheapest ones to get right at the start and the most expensive to retrofit later. Start with signing and provenance — they pay for themselves the first time you have to write an incident report.