Reverse-engineering a pig-butchering syndicate from a dead domain

What this is

A technical case study of building a passive-OSINT toolkit to investigate a transnational pig-butchering syndicate impersonating a major Indian broker. The seed domains were already dead by the time the investigation started — so the entire analysis worked from third-party historical data, browser-captured network requests, and seized banking records.

I'm a backend engineer (Python, Django, Celery, AWS). What I learned building this toolkit about how these operations actually work is more interesting than the scam itself, and the engineering decisions in here generalise beyond any one case.

If you've never investigated a fraud case, this will read as a long-form tutorial. If you have, skip Sections 1–3 — the meat is the per-module walkthrough and the engineering decisions at the end.

Section 1 — What is pig butchering, and how big is the problem

If you only know one thing about pig butchering, know this: it is not a scam in the old sense. It is an industrial operation, run out of fortified compounds in Southeast Asia, staffed largely by trafficked workers, built on top of reusable software boilerplates that are bought and sold like SaaS products. Calling it a "scam" undersells the scale.

The term

Sha zhu pan (杀猪盘) translates as "pig butchering plate." First documented in China around 2016, the metaphor is intentionally brutal: scammers spend weeks or months "fattening" a victim with manufactured affection and small successful trades, then "slaughter" them by convincing them to pour their savings into a fraudulent investment platform from which no money will ever come back.

The model migrated from China to Cambodia, Myanmar, and Laos around 2019–2020. According to a UN report covered by The Week, it is now run out of compounds the size of small towns — some over 500 acres, with armed guards, that operate as forced-labour camps. The workers staffing the chat windows that talk to victims are themselves trafficked, often graduates lured to Bangkok with fake job offers and then driven across the border under threats of torture, organ harvesting, and sexual slavery.

A UNODC-cited estimate puts the trafficked workforce at at least 300,000 people from 66 countries, with about 75% in the Mekong region.

The financial scale

The numbers, even taken conservatively, are staggering:

Source	Figure
FBI IC3 2024 Annual Report	$5.8 billion lost to crypto investment fraud / pig butchering in 2024 (US complaints alone), across 41,557 reports
FBI IC3 2024 — crypto-related total	$9.3 billion, 149,686 complaints (+66% YoY)
UNODC, 2020–2024 cumulative	$64–75 billion generated globally by these operations
FBI Operation Level Up	8,103 victims notified; 77% didn't realise they were being scammed; ~$511M in savings prevented from loss

India is a major target

The model has followed the diaspora and the smartphone economy into India:

Source	Figure
Ministry of Home Affairs (via Scroll.in)	₹22,845 crore (~$2.7B USD) lost to cyber fraud in India in 2024 — a 206% year-over-year increase
MHA, total cybercrime incidents 2024	22.68 lakh cases (~2.27 million)
I4C Suspect Registry	11 lakh suspect identifiers from banks, 24 lakh Layer-1 mule accounts shared with participating banks; ₹4,631 crore saved
Repatriated Indians from SEA scam compounds	2,907

If you live in India and you have a phone, you are inside the operational radius of this fraud category. The exact look-alike-broker variant (in my case "Upstox" — the legitimate one is a major Indian retail broker) is one of the more sophisticated forms.

The transnational layered model

This is the picture you need to hold in your head before the technical part makes sense:

Fig 1 — The transnational layered model: code, operators, victims, and money are in different places by design.

Code is built in China, sold as a boilerplate. Operators are typically not in China — they're in SEA compounds running the chats, or in West Africa running variant operations, or anywhere with a banking system to harvest mules from. Money is laundered through layered bank accounts before crypto cash-out.

What makes investigation hard is that each layer has plausible deniability about the others. The code substrate doesn't know who the operators are. The mule accounts don't know what the upstream is. The CDN provider sees only that someone paid for a Pro plan. We're looking for the fingerprints that nonetheless leak across layers.

Section 2 — The technical problem

Most "incident response" tutorials online start with "first, run nmap." That assumes the target is alive and you can probe it. Forensic OSINT — the kind of investigation whose outputs need to land alongside a criminal complaint — starts from the opposite assumption: do not touch the target.

For our case, the seed domains (upstox-api.com and upstox-capital.com) were already dead by the time I started — the syndicate had pulled them down after harvesting funds. Even if they hadn't been, probing them risked tipping off the operator, who would then rotate keys and destroy mule-account paperwork on the way out.

So the question is:

How do you investigate a scam operation when your inputs are (a) the corpse of a domain, (b) some browser-captured network requests from the victim session, and (c) a CSV of mule-account holds from the cybercell?

Before any code, I wrote down the operating constraints. Each one is the consequence of either an evidentiary requirement (chain of custody, anonymisation) or an operational one (don't tip off the target):

Constraint	Reason
No live probing of seed domains	Don't tip off the operator
Third-party / historical data only	Certificate Transparency logs, Wayback Machine, passive DNS, WHOIS history, victim browser captures
Identifiable User-Agent on every outbound request	Auditable; no impersonation
Append-only chain of custody	Every external call logged with timestamp, URL, SHA-256 of response, status
Anonymise victim identifiers by default	Real bank account numbers → `A1`, `A2`, …; mapping in a private file
Suspect identifiers retained verbatim	Domains, IPs, brand names of the syndicate stay in full — that's the point
API keys via `.env`, never hardcoded	Standard, but worth saying

These ripple into every module's design. When you write them as constraints up front, you stop reaching for tools that violate them.

Section 3 — The case in 90 seconds

The seed domains were upstox-api.com and upstox-capital.com. Both impersonated the legitimate Indian broker Upstox. Both were dead by investigation time. Inputs I had:

WHOIS history JSON from WhoisXMLAPI (Dynadot registrar, several recorded changes).
Two HAR files — full browser captures from the victim session, with request/response headers, the IP that served the page, JS asset paths, and the proprietary API error envelope.
A handful of crt.sh PDFs saved while the domains were still live.
An I4C JMIS transactions CSV made available for the investigation — the full mule-account flow, several thousand rows.

What I ended up with after the toolkit had run end-to-end:

A 41-entity infrastructure cluster — 5 apex domains, 30 subdomains, 6 IPs — all controlled by the same operator. Identified four sibling brand fronts (attiora-02, corecap-investments, fidelityaccets, keyinvestservices) and two adjacent apex domains (psdconcepts.com, rtbd.net).
A layered money-flow map across ~1,400 bank accounts and 10 laundering layers, with the top aggregator accounts ranked for subpoena priority.
A smurfing signature — ₹972 exact-amount commission transactions making up 8.84% of all transactions, consistent with a payment-gateway flat-percentage cut taken by the syndicate.
A two-layer attribution model — Chinese scam-platform boilerplates (the code substrate) running on a West-African operator's hosting (Nigerian hosting providers in the early window), routed through US datacenters, with Indian mule banking accounts at the bottom.

If you want to stop reading and skim, that's the whole post. Below is how I got there.

Section 4 — The toolkit, module by module

The whole thing is written as cooperating Python modules — each an async + click CLI, each writing structured JSON to a shared case_file.json, each respecting the constraints above. Modules talk to each other through the case file, not direct imports. If a module breaks, the others keep working from cached state.

Quick concept dump for newer engineers (skim past if known):

A HAR file is a JSON dump of everything a browser session did — every request, response, header, body, server IP. Chrome/Firefox both export it. MDN docs here.
A CDN (Cloudflare, Fastly, Akamai) sits between users and a website's real server, hiding the real server's IP from public view. The real server is called the "origin."
Certificate Transparency (CT) logs are public append-only logs that every CA writes to whenever they issue a certificate. crt.sh is a search frontend over them. If a syndicate registers a new subdomain and gets a TLS cert, the cert lands in a CT log within minutes.
Passive DNS services (urlscan.io, AlienVault OTX, HackerTarget, DNSDumpster, dnshistory.org) keep a historical record of DNS resolutions seen from many vantage points — so you can ask "what IPs did this domain ever resolve to" without asking the domain itself.
mmh3 is a fast non-cryptographic hash function. Shodan uses a particular mmh3-over-base64 encoding of favicon bytes as a "favicon hash" — same favicon across two sites = same hash, lets you pivot.

Module 0 — Fingerprint extraction from a dead body

The first thing I needed was a set of reusable fingerprints that could be pivoted on — things that would let me say "give me everything on the internet that looks like this". Even though the seed domains were down, I had:

The WHOIS history — registrar (Dynadot LLC, IANA ID 472), seven change events. Useful because if any other domain in the syndicate ecosystem moved through Dynadot, that's a registrar pivot.
The HAR files — including a UniApp build hash (WAkTUj0f) embedded in the JS asset paths (/static/js/index-WAkTUj0f.js), the originating server IPs the browser captured, and the proprietary API response envelope {"code":10000,"msg":...}.
Saved crt.sh PDFs — kept as exhibits, not parsed; their SHA-256 hashes go into chain of custody.

Then I added live (but third-party) sources to thicken the fingerprint set:

crt.sh wildcard query   → SANs of every cert ever issued for *.upstox-api.com
Wayback Machine CDX     → archived HTML + favicon → mmh3-hashed
WhoisXMLAPI             → registrar timeline
AlienVault OTX, DNSDumpster, Shodan InternetDB → IP context

Each extractor implements a tiny Extractor interface, runs async, and pushes findings through a FingerprintStore that dedupes and tags each entry with {source, extracted_at, seed_domain, evidence}. The module dumps a fingerprints.json validated against a JSON Schema, so downstream modules can consume it without surprises.

Engineering decision worth flagging: crt.sh is a flaky service — returns HTTP 502 about half the time. My first instinct was to retry inline with exponential backoff, which mostly worked but made my module's runtime depend on whether crt.sh happened to be up. I ended up building a small cache + a long-running "warmer" daemon (more on this later). When upstream is flaky, decouple your modules from it — cache-first reads, write-behind warmer.

Module 1 — Origin resolution

Even when a domain is live, its public IP is often a CDN edge, not the real server. Finding the real server (the "origin") matters for attribution.

If the syndicate is sloppy and not behind a CDN, there's a beautiful trick: enumerate every IP the domain has ever resolved to, then probe each one with a spoofed Host: header. If the server still serves the syndicate's site when you ask it for that hostname directly (bypassing DNS), you've found the origin.

The module does this in two phases:

Phase 1 — IP discovery. Pull candidate IPs from six sources in parallel: urlscan.io, HackerTarget reverse-IP, DNSDumpster, AlienVault OTX, crt.sh SAN expansion, Shodan InternetDB. Dedupe by IP with provenance preserved.

Phase 2 — host-header probe. For each non-CDN candidate IP (we tag and skip Cloudflare's 104.16-28/172.64-67 and Fastly's 151.101.x.x ranges — they'll never serve the origin via spoof), issue one GET to https://{ip}/api/sms/config?status=1 with Host: {target_domain} and TLS-hostname-verification disabled. Score the response:

Verdict	Trigger
`CONFIRMED_HIDDEN_ORIGIN`	Returns the proprietary `{"code":10000}` envelope
`CONFIRMED_FRONTEND_ORIGIN`	Returns matching UniApp build hash
`CDN_EDGE`	CDN headers in response
`PARKED`	Parking-page content
`DEAD`	Connection refused / timeout
`RESPONDING_NOT_TARGET`	200 OK but unrelated content
`UNKNOWN`	Anything else

For this case the seed domains were behind Cloudflare nameservers and the discovery surfaced no hidden origin via this technique. But — and this is the moment the case opened up — Phase 1's reverse-IP step on the IPs that did historically host the sister domain (upstox-capital.com) surfaced four sibling brand subdomains all co-hosted on a single ColoCrossing IP. Same operator. Four "investment platform" front names I hadn't known existed.

The single most important line of code in this module is the rate limiter — one request per IP per second, with a per-host lock. Hammering a low-end colo host even with passive-looking traffic can get your access pattern flagged in the host's logs. In passive OSINT, politeness is the difference between getting useful data and getting your IP nullrouted.

Modules 2 & 3 — Funnel mapper and smurf detector (the money side)

These two operate on the I4C JMIS transactions CSV available for the case. The CSV is sensitive — real bank account numbers, real names. The first thing both modules do on load is anonymise: every real account number is mapped to A1..An, the mapping written to a private file (account_map.json), and the rest of the pipeline never sees the real numbers. Shareable outputs are anonymous by construction, not by promise.

What money laundering actually looks like

Three-stage model, taught in every AML training, explained well here:

Placement — dirty money enters the banking system (the L1 mule account that receives the victim's transfer).
Layering — funds are split, recombined, and bounced between accounts to obscure the trail.
Integration — the now-"clean" money is reintroduced to the legitimate economy (real estate, cash withdrawal, crypto cash-out).

The pig-butchering syndicate's layering is unusually deep. The case CSV had transactions at 10 distinct laundering layers.

The funnel mapper turns the CSV into a networkx directed graph: VICTIM → L1_super → L2_super → ... → L10_aggregator. Per-account, per-layer, and per-bank aggregations come out as CSVs; a Gephi-friendly GraphML export lets the cybercell visually inspect the laundering tree.

For this case, the picture looked like:

Laundering funnel: VICTIM payment flows into L1 (1 account, ₹100), then explodes outward to L3 (499 accounts, ₹7.45L) and L4 (384 accounts, ₹7.73L), the two highlighted aggregator layers, then progressively narrows through L5-L11.

The bulk of value (~66% by amount, ~880 accounts) sits at layers 3–4 — the corporate-shell aggregator hops. These are the cells where many smaller mule flows converge before fanning back out, and they are the subpoena priority for the cybercell because subpoenaing them surfaces the real beneficial owners.

The smurf detector runs five separate analyses on the same data:

Commission signature — exact-value transactions (and ±₹3 tolerance). The case yielded 126 transactions of exactly ₹972 and 138 within tolerance, accounting for 8.84% of all transactions. That's the syndicate's flat-percentage payment-gateway commission cut, visible as a fingerprint. Banks should treat ₹972 outflows from suspect mule accounts as a smurfing flag — they are not random.
Round-amount clustering — count multiples of 100/500/1k/5k/10k. 37% of nonzero amounts were round-100-or-more — a strong human-typing signal.
Sub-threshold structuring — count transactions just below RBI/FIU regulatory windows (₹49k just below the ₹50k STR/CTR window, ₹199k below ₹2L, ₹9.99L below ₹10L AML). Classic structuring/smurfing pattern.
Amount-family clustering — bin into ₹25 ranges, find clusters with ≥10 transactions. Top family was ₹500–525 with 179 transactions.
Velocity analysis — would have measured freeze-time deltas L1→L3/L4, except this is where I made my favorite engineering decision in the whole project (see Section 6).

UTR (Unique Transaction Reference) numbers are hashed with sha256[:12] whenever they're surfaced in outputs — exporting raw UTRs in a shareable artifact creates correlation risk between the case and any other system that has them.

Module 4 — Rotation watcher

The syndicate will stand up new domains. We want to know when.

The rotation-watcher polls Certificate Transparency (crt.sh) and passive DNS sources for keyword + wildcard queries against a brand-name watchlist (upstox-*, attiora, noc41, psdconcepts, …), diffs against persisted state, and emits a sightings log of new certificates or DNS records that match.

First run is a bootstrap — record what already exists as a baseline, don't flag anything. Subsequent runs only emit genuinely new entries. A common bug in change-detection systems is the false-positive flood on first run; explicit bootstrap state avoids it.

The watchlist isn't just the original seeds — it includes the sibling brand names discovered by Module 1, plus the cPanel reseller infrastructure tags (noc41, reseller12) that the syndicate uses for hosting. If they spin up a new front under either reseller, we should see it before any new victim does.

Module 5 — The correlator (where the project pays off)

This is where structured data turns into a picture. The correlator walks case_file.json and constructs a bipartite graph:

Entity nodes: domains, subdomains, IPs.
Attribute nodes: favicon hashes, build hashes, cert patterns, registrars, ASNs, co-host IPs, shared nameservers.
Edges: entity ↔ attribute, with the originating evidence string.

Then it projects the bipartite graph onto an entity-entity graph — for each attribute node, fully connect its entity neighbours — and runs connected-component analysis. Each connected component is a syndicate cluster.

Concept for SDE-1s: a bipartite graph has two distinct kinds of nodes and edges only go between them. Think "users ↔ products purchased". To find which users bought similar products, you project onto user-user edges: connect two users if they share at least one product. Same idea here — connect two domains if they share at least one attribute (an IP, a build hash, a cert pattern).

What this returned on the case:

Fig 3 — The 41-entity cluster after bipartite projection. Apex domains (red), shared IPs (green), sibling brand fronts (yellow).

The largest cluster: 41 entities — 5 apex domains, 30 subdomains, 6 server IPs — glued by 230 shared co_host_ip edges, plus shared-nameserver and ASN edges. The single largest co-host pivot was 148.163.124.2 with 14 sibling virtual hosts on one server, all part of the noc41 cPanel reseller cluster.

Crucially, the smaller satellite components (C-002, C-003) did not merge into the main cluster. They're almost certainly false positives — Cloudflare CDN nodes, a Bitwarden favicon source. The graph is producing signal, not blob. If everything ended up in one cluster, the clustering wouldn't be telling us anything.

Both graphs (bipartite + projection) export as GraphML so they can be opened in Gephi for visual review. Every edge carries an evidence string traceable back to a chain-of-custody entry.

Extra module — DNS history extractor

About two-thirds of the way through the project I noticed dnshistory.org had deep historical NS records that no API I was using provided. The catch: the site is behind Cloudflare's "Just a moment…" JS challenge — which blocks any non-browser fetch.

My options:

Automate a real browser via Playwright — but this violates the no-browser-impersonation rule.
Skip the source.
Use the human in the loop: ask my browser, which I am sitting at, to extract structured data for me.

I went with (3). I wrote a prompt for Claude's browser extension that walked the pages I needed and emitted a JSON document. A separate Python module then parses that JSON offline, merges new IPs into the correlator, and identifies cross-domain pivots.

This is the design decision I'm proudest of, because the human is the chain-of-custody anchor. I pulled the records personally; the extension just structured them. We solved the technical problem without crossing the operational rule.

This is also where the case really came together. The dnshistory.org chronology for upstox-capital.com told a clean story:

Hosting timeline for upstox-capital.com from January 2023 to January 2024: starts on Nigerian hosting providers (whogohost / hostnownow), moves to a US cPanel reseller (noc41) for most of the year across three IPs, and finally lands on NameSilo (dnsowl) in late December 2023.

For the first two months of its life, upstox-capital.com was hosted on Nigerian hosting providers — whogohost.com (Lagos, Nigeria) and hostnownow.com. Then it moved to a US cPanel reseller (noc41.com). Then to NameSilo.

And one of those IPs — 91.195.240.12 — also hosted psdconcepts.com for a window in 2024. That overlap is what links psdconcepts.com into the syndicate cluster.

Section 5 — Putting attribution together (the layered model)

Pig-butchering operations don't have a single nationality. Saying "this scam is Nigerian" or "this scam is Chinese" loses information. What you can actually attribute is each layer of the operation:

Layer	What we found	Evidence
Code substrate	Chinese scam-platform boilerplates: FUNNELL, xiaonuo, Snowy / SnowyAdmin. Frontend built with UniApp (Chinese cross-platform framework).	`{"code":10000,"msg":...}` API envelope captured in HAR. `WAkTUj0f` UniApp build hash in asset paths.
Operator layer	Nigerian. The domain registered in early 2023 used `whogohost.com` and `hostnownow.com` — both Nigeria-based hosts.	dnshistory.org NS chronology.
Hosting layer	US cPanel reseller infrastructure (`noc41.com`), ColoCrossing colo (`198.12.125.130`), Cloudflare DNS for the more recently-set-up domain (`upstox-api.com`).	Reverse-IP enumeration; co-host IP analysis.
Money-mule layer	Indian retail banking. SBI, HDFC, PNB, Bank of Baroda, India Post Payments dominate top-bank exposure. ~1,400 accounts across 10 laundering layers.	I4C JMIS transactions CSV.

This layered picture is more useful than "it's a Chinese scam" or "it's a Nigerian scam" because it tells the cybercell which lever to pull at which layer:

The code substrate is detectable cross-case via the {"code":10000} and WAkTUj0f fingerprints. Recommend FIU-IND and CERT-In add these to their response-body scanners — any honeypot or live-fraud-detection system that scans server response bodies should flag these signatures.
The operator layer needs MLAT cooperation with Nigerian authorities for the early-window hosting records (whogohost.com, hostnownow.com).
The hosting layer needs US-based subpoenas: ColoCrossing for IP-level tenant records, Cloudflare for the proxied origin of upstox-api.com, Dynadot for the domain registrant.
The mule layer is Indian banking — the cybercell is already on it, with anonymised aggregator account labels mapped to real account numbers in a private file.

The point: pig-butchering attribution is not a single-country story. It is a layered international supply chain. Investigators who insist on naming one country are erasing the others, and that erasure is what lets the operators rotate.

Section 6 — Engineering decisions worth flagging

The kind of decisions that I think generalise beyond this case. Most of them came out of mistakes, not foresight.

1. Cache-first, not retry-inline, for flaky upstreams

crt.sh was failing about half my queries on any given day. My first instinct was to retry inline with exponential backoff — and that mostly worked, but it meant my fingerprint module's runtime was at the mercy of whether crt.sh was up.

The eventual solution: a small CrtShCache keyed by query hash, plus a long-running crtsh_warmer.py script that lives in a tmux pane, walks the query list every 6 hours, and retries with patient backoff (30s → 2m → 5m → 15m → 60m) until each query lands. Modules read cache-first and only fall back to live fetch on a stale cache.

This decoupled module runtimes from upstream availability. When an upstream is flaky, push the retry burden to a background process and read from cache. The modules become deterministic; the warmer absorbs the unpredictability.

2. Anonymisation as a default, not a filter

It is much harder to anonymise an output after generating it than to anonymise the input on load. The financial modules rewrite real account numbers to A1..An labels at CSV-load time, write the mapping to a private file, and the rest of the pipeline never sees the real numbers. Outputs are anonymous by construction. When you anonymise on output, every single new export needs to remember the rule. When you anonymise on input, the rule is enforced once, in code review, and everything downstream is safe.

3. Hard guards against accidental live hits

The HTTP client has an allowlist that raises an exception if a request would resolve to a seed-domain hostname. Costs nothing to add and saves you from a single bad refactor undoing the entire operational discipline.

4. The dnshistory.org workaround

Already described in Section 4. The principle: when an automated solution would violate your operational rules, look for the human-in-the-loop solution. A browser extension structuring HTML I pulled by hand is more honest than a headless Chrome impersonating a user.

5. Honest negative findings

The freeze timestamps in the CSV ran in investigation order, not transaction order. Deltas between L1 and L3 were negative. I had two choices: invent a transformation, or admit it. I admitted it, in writing, in the module's output:

"Velocity NOT measurable from this CSV — freeze timestamps are inverted relative to transaction order. Recommend requesting per-UTR transaction timestamps from the originating banks."

A finding that says "I cannot measure this with the data I was given" is a better finding than a fudged number that looks like a measurement.

Same principle when I re-probed two newly-discovered candidate IPs (104.194.9.178, 91.195.240.12) from the dnshistory analysis — both timed out. That's a real finding: the syndicate is mobile, and any infrastructure-based attribution has a half-life. Logged it.

6. Per-module structured output, single `case_file.json` aggregation

Each module writes its own module_<ts>.json and upserts a slot under case_file.json::modules.<module_name>. The aggregator is forward-compatible: new modules just need to add their slot. The downstream correlator and report-assembler both read from case_file.json and don't care which modules ran in which order.

7. Reports built for two audiences

The same data, two audiences:

Police report: Part A is plain-language narrative for the investigating officer ("what happened, who's behind it, where the money went, recommended subpoenas"); Part B is the full technical appendix for the cyber-forensics analyst.
Public write-up (this post): redacts personal financial details, focuses on methodology and findings.

The voice matters. A police report that reads like a technical paper will not be read by the IO. A blog post that reads like a police report will not be read by anyone.

Section 7 — Limitations and what I'd build next

The toolkit has real limits. Worth naming them.

For the more operationally-hardened of the two seed domains (upstox-api.com), passive OSINT could not surface the origin. It went Cloudflare-only from day one (Aug 2025), and dnshistory.org's A-page was empty. The origin will only come out via Cloudflare subpoena. The infrastructure attribution we have is all from upstox-capital.com's leakier early history — that's what makes it the prosecution-grade evidence even though it wasn't the domain at the centre of the active complaint.
The financial CSV's velocity analysis is dead until we get per-UTR transaction timestamps from the originating banks.
crt.sh is a single point of failure for cert-side discovery. Adding Censys as a parallel source with a paid PAT would close this gap; the free-tier PAT I tried didn't authenticate against the v2 API.

If I had another two weekends:

A SecurityTrails or mnemonic Passive-DNS extractor to close the dnshistory.org gap without the browser-extension intermediate step. The local-parse pattern still applies — just from a real API.
A JARM / TLS-fingerprint pivot module so we can pivot on TLS-stack fingerprints alongside favicon/build hashes. This would catch operators that change brand frequently but not their TLS stack.
A cross-case correlator that ingests multiple case_file.json blobs from different investigators and looks for shared entities or attributes. Pig-butchering syndicates are not one syndicate; they are a constellation. Federation across investigators matters more than any single investigator going deeper.
A real DOCX exporter with embedded GraphML thumbnails. Indian police prefer Word documents over Markdown.

Section 8 — Why this matters, and what comes next

Pig butchering is one of the most industrialised fraud operations running in the world right now. The compounds in Southeast Asia operate like factories. The code substrate (FUNNELL / xiaonuo / Snowy / UniApp) is bought and sold like SaaS. The mule networks span continents. Investigators who go deeper on a single case keep finding new attack surface — but what would actually move the needle is investigators who can compare notes across cases.

That's practically a question of data format and clustering schemes. If case_file.json blobs from different investigators were interoperable, a federated correlator could surface cross-syndicate clusters automatically — operators rotating brand names but reusing the same code substrate, the same hosting reseller, the same favicon. Every individual case ends with "subpoena this hoster", but the cross-case picture answers the harder question: how many distinct syndicates are there, really?

The interesting engineering discipline here is not the scam — the scam is depressingly mundane. It is reproducible, evidentiary investigation when you cannot just nmap your way out of a problem. Passive OSINT, chain-of-custody-first, anonymisation-by-default, cache-first for flaky upstreams, honest negative findings, the human-in-the-loop for sources you cannot ethically automate. None of those are hard individually. Holding all of them at once, across modules, with an unambiguous output that can survive cross-examination — that is a real engineering problem, and it is solvable.

References (in order they appear above)

The full toolkit (Python modules, JSON schemas, chain-of-custody logs) is private to the active case but available on request for verification. The methodology described here is generalisable, not case-specific.
— Sahil Rajpal

What this is

Section 1 — What is pig butchering, and how big is the problem

The term

The financial scale

India is a major target

The transnational layered model

Section 2 — The technical problem

Section 3 — The case in 90 seconds

Section 4 — The toolkit, module by module

Module 0 — Fingerprint extraction from a dead body

Module 1 — Origin resolution

Modules 2 & 3 — Funnel mapper and smurf detector (the money side)

What money laundering actually looks like

Module 4 — Rotation watcher

Module 5 — The correlator (where the project pays off)

Extra module — DNS history extractor

Section 5 — Putting attribution together (the layered model)

Section 6 — Engineering decisions worth flagging

1. Cache-first, not retry-inline, for flaky upstreams

2. Anonymisation as a default, not a filter

3. Hard guards against accidental live hits

4. The dnshistory.org workaround

5. Honest negative findings

6. Per-module structured output, single case_file.json aggregation

7. Reports built for two audiences

Section 7 — Limitations and what I'd build next

Section 8 — Why this matters, and what comes next

References (in order they appear above)

6. Per-module structured output, single `case_file.json` aggregation