Real-time fraud scoring inside a SWIFT MT103 hop

What it takes to put a model on the payment rail without missing the SLA. The signals we run, the latency budget we keep, and the explainability the bank's risk team actually reads.

Blackbead.ai

AI security research

Payment fraud detection is one of the oldest problems in the field. It's also the one where the gap between research and production is the most punishing: the model has to fire inside a budget you do not control, on data you cannot fully clean, with an explanation a risk officer will read in three seconds.

Last quarter we shipped a real-time fraud agent for SWIFT MT103, ISO 20022 (pacs.008), UPI, and Stripe-style REST payloads. The agent is in PR #8; this post documents the design constraints and how we held them.

The latency budget

SWIFT MT103 hops don't have a hard SLA on the wire, but every operations team has an unwritten one: their gateway watches transit time and pages on tail-latency outliers. To stay invisible to that ops team, we needed p99 inference under 80ms. The actual budget we held: 52ms p99.

Where it goes:

Parse + canonicalise: 6ms
Sanctions screen (cached): 4ms
Velocity lookup: 9ms
Geo risk + IP reputation: 7ms
Structuring + first-beneficiary heuristics: 3ms
Score fusion + tiering: 2ms
Explainability assembly: 4ms
Headroom: 17ms

The signals are cheap because we resolved the expensive ones at index time. The sanctions list is in memory with a Bloom-filter prefilter. Velocity counters are pre-aggregated in Redis with 30s freshness. Geo risk is a 200KB lookup table that fits in CPU cache.

The signals

Eight signals fire on every payment:

Sanctions screen. Beneficiary name, country, and account fuzzy-matched against OFAC, UN, EU, and the bank's internal list.
Velocity. Payments per minute / hour / day from the same payer or to the same beneficiary, with three lookback windows.
Structuring. Pattern detection for amounts just under reporting thresholds (the classic 9,900 / 49,900 patterns).
First-beneficiary. Is this the first time the payer has paid this beneficiary? First-time payments to high-risk geographies score higher.
Geo risk. Country-level risk score, plus mismatch between payer country and IP country.
Channel. API vs. branch vs. mobile vs. ATM. Same payment from different channels gets different baselines.
IBAN / account validity. Format checksum + bank-routing sanity.
Round-amount + currency anomaly. Round amounts in unusual currencies (e.g. exact 10,000 KRW SWIFT payment from a UK SME) score higher.

Each signal returns a contribution between -1 and +1, with a confidence. The fusion stage is deliberately simple: linear combination with policy-defined weights. Simpler than the gradient-boosted alternative we tried, and the auditor can read it without a tutorial.

The explainability the risk team actually reads

Initial version produced this:

{
  "score": 0.81,
  "tier": "high",
  "model_confidence": 0.94,
  "features": {
    "f_velocity_24h": 0.42,
    "f_geo_mismatch": 0.31,
    "f_first_beneficiary": 0.18,
    ...
  }
}

The risk team didn't read it. They glanced at the score, glanced at the tier, and decided. The features array was opaque. We rewrote it to the version they do read:

{
  "score": 0.81,
  "tier": "high",
  "why": [
    "First payment to this beneficiary in 11 months.",
    "Beneficiary country (KP) is on the bank's high-risk list.",
    "Payer's IP country (RU) does not match payer's registered country (GB).",
    "Amount (9,900 USD) is just under the structuring threshold."
  ],
  "policy_version": "fraud-policy-2026-04"
}

Same data, English. The risk team reads it in three seconds. The auditor reads it in five. The features array still ships, in a separate field, for the data-science team.

Explainability lesson

If the risk team has to learn your feature names to read your output, you've shipped a model. If they can read it without learning anything, you've shipped a product.

What we didn't ship

Three things, on purpose:

An end-to-end deep model. Too slow, too opaque. Boring signals + linear fusion was 4ms faster than the deep alternative and beat it on explainability by an order of magnitude.
Auto-blocking. The agent scores; it never blocks unilaterally. The bank's payment system applies the action based on tier + policy. We never want to be the system that stops a legitimate corporate payment at 3am because nobody on our side is awake.
Cross-tenant model sharing. Each bank gets its own policy, its own thresholds, and its own training data. The signals are shared; the calibration is not.

Where to start if you're building this

Inventory the signals you already have access to. Most banks have more than they realise; the data is scattered.
Build the boring fusion first. You can always add a model later. You will never get back the explainability you lose by starting with one.
Hold a hard latency budget. Measure it from day one. Tail latency is the silent killer.
Write the explainability for the human who reads it, not the data scientist who built it.

The agent is live at /api/payment-fraud/score. The sample endpoint at /api/payment-fraud/sample runs a worked example so you can see the response shape in one curl.

Real-time fraud scoring inside a SWIFT MT103 hop

The latency budget

The signals

The explainability the risk team actually reads

What we didn't ship

Where to start if you're building this

Get a weekly note from the team