How Anti-Bot Systems Actually Work
A technical breakdown of how Cloudflare, DataDome, PerimeterX, and Akamai detect scrapers — IP reputation, TLS fingerprinting, behavioral analysis, and what actually works against each layer.
Every major website runs anti-bot software. Cloudflare, DataDome, Akamai Bot Manager, PerimeterX (now HUMAN Security) — they've made scraping measurably harder over the last five years. But "anti-bot" is not one thing. It's a stack of independent detection layers, each targeting a different signal. Understanding the layers individually tells you exactly what you're up against — and where residential proxies help versus where they don't.
Layer 1: IP reputation
The first check happens before your request body is parsed. Every inbound IP is scored against databases maintained by the anti-bot vendor, enriched with feeds from IPQualityScore, MaxMind, and their own historical data. The check is asking one question: does this IP look like a person?
The most decisive signal is the ASN — the Autonomous System Number that identifies which network the IP belongs to. AWS has documented ASN ranges. So do GCP, Azure, DigitalOcean, Hetzner, OVH, and every other major hosting provider. These ranges are published, exhaustively catalogued, and immediately suspicious. A request from 54.x.x.x (AWS us-east-1) hits Cloudflare and Cloudflare knows in milliseconds it's a datacenter IP.
A residential IP from Comcast in Dallas, or BT in London, or Chunghwa Telecom in Taipei, shows up in an ISP-assigned residential range. The ASN resolves to an internet service provider, not a hosting company. The Whois record looks like every other home user on that network. IP reputation checks pass.
This is the specific problem residential proxies solve. They don't solve everything — but they decisively solve this layer. Datacenter IPs cannot credibly impersonate residential connections because the ASN data is public and accurate.
Layer 2: TLS fingerprinting
When an HTTP client connects to a server, the TLS handshake contains a characteristic signature based on the client's supported cipher suites, TLS extensions, and elliptic curves — in the order they're listed. This gets hashed into what's called a JA3 fingerprint (or the newer JA4+, which is harder to spoof).
Different clients have reliably distinct JA3 hashes. Python requests has one. curl has another. Node.js undici has another. Chrome 124 on macOS has another. These hashes are well-documented and maintained by the anti-bot vendors.
The attack this detects: sending a User-Agent header of Chrome/124.0.0.0 from a Python requests session. The User-Agent says Chrome; the JA3 says Python. Cloudflare sees the discrepancy instantly. Setting a realistic User-Agent header has been largely ineffective against modern anti-bot systems for years.
The fix is either: use a real browser via Playwright/Puppeteer (which sends Chrome's actual TLS fingerprint), or use a library like curl-impersonate that mimics Chrome's TLS stack specifically.
Layer 3: HTTP/2 fingerprinting
Less discussed but increasingly enforced. HTTP/2 clients have distinct characteristics based on their SETTINGS frames, window sizes, HEADERS frame ordering, and PRIORITY frame values. Chrome's HTTP/2 fingerprint differs from curl's, which differs from Python httpx's.
Akamai Bot Manager is particularly aggressive about this. They've published research on their HTTP/2 fingerprinting approach, calling it "akamai fingerprint" or AKA-fingerprint. A residential IP sending Chrome-like HTTP headers but curl-like HTTP/2 frames is still detectable.
Again, the practical fix is using a real browser — Chromium via Playwright will send real Chrome HTTP/2 frames. Headless Chrome is slightly different from headed Chrome and anti-bot vendors have started tracking that distinction too.
Layer 4: Browser fingerprinting
For sites that serve JavaScript challenges — Cloudflare's managed challenge page, DataDome's interstitials — the browser itself gets fingerprinted. This runs client-side and collects:
- Canvas fingerprint — the GPU renders a hidden canvas element; output varies by GPU driver and OS font rendering stack
- WebGL renderer —
WEBGL_debug_renderer_infoexposes GPU model and driver version - AudioContext fingerprint — tiny numerical differences in audio processing between hardware configurations
- Font enumeration — measuring which fonts are installed via
measureText() - Navigator properties —
navigator.plugins,navigator.hardwareConcurrency, screen resolution, color depth - Timezone and locale consistency
Headless Chrome used to be trivially detectable via navigator.webdriver being true. That's been patched in Playwright for years. But there are subtler signals: headless Chrome has no screen resolution unless explicitly set, has no plugins array, and renders some fonts differently from headed Chrome. Tools like Rebrowser patches and Camoufox patch these discrepancies.
Layer 5: Behavioral analysis
DataDome and HUMAN specialize here. Their JavaScript SDK instruments the page extensively, collecting signals about how a user actually interacts:
- Mouse movement trajectories — human movement curves, hesitations, overshoots; bot movement is too linear or too random
- Scroll patterns — humans scroll unevenly, pause to read, change direction
- Keystroke dynamics — time between keystrokes, dwell time per key
- Click coordinates — humans don't click the exact center of a button every time
- Time-on-page patterns — pages visited too quickly suggest no rendering or reading is happening
A Playwright session with a residential IP can still fail a DataDome check because the automation moves the mouse in straight lines between elements and clicks in round-number milliseconds. You can patch this with mouse movement emulation libraries, but it's a continuous cat-and-mouse.
Behavioral analysis is the hardest layer to bypass purely at the network level. Residential proxies don't help here — the problem is in the browser automation, not the IP.
Layer 6: Rate limiting and session signals
Beyond behavioral signals, anti-bot systems look at session-level patterns:
- Request rate — how many requests per IP per time window, both absolute and relative to baseline
- Cookie staleness — real browsers accumulate cookies over time; a fresh session with no cookie history looks suspicious
- Referrer consistency — legitimate users come from Google, type directly, click links; scrapers often hit internal pages directly
- Accept-Language and header order — HTTP header order and values must match what the declared browser actually sends
- IP-to-session ratio — one IP being reused across many sessions with different fingerprints is a signal
What this means in practice
Anti-bot detection is layered. Each vendor emphasizes different layers:
| Vendor | Primary layers |
|---|---|
| Cloudflare Bot Management | IP reputation, TLS fingerprint, JS challenge, ML scoring |
| DataDome | Behavioral analysis, device fingerprint, ML per-request scoring |
| Akamai Bot Manager | HTTP/2 fingerprint, IP reputation, JS challenge |
| HUMAN (PerimeterX) | Behavioral analysis, device fingerprint, network graph analysis |
| Kasada | JS obfuscation + runtime analysis, TLS fingerprint |
For most targets — standard e-commerce, social platforms, news sites — the IP layer is the first and most significant barrier. Residential IPs get past Cloudflare's IP reputation check and most sites stop there. A residential proxy plus a correctly configured browser user-agent gets you to 80% of the public web without additional work.
For targets running DataDome or HUMAN with behavioral analysis enabled, you need more: a real headless browser with anti-detection patches, realistic mouse movement, and session management that accumulates cookies over time. The IP layer still matters — behavioral analysis results feed into a risk score that's also weighted by IP reputation — but IP alone isn't sufficient.
The practical takeaway: start with residential IPs. Most of the time, that's the only layer that's actually blocking you. For the subset of targets that use heavy behavioral analysis, the IP is necessary but not sufficient, and you'll need to invest in proper browser automation too.
Further reading
- ja3.zone — check your TLS fingerprint
- curl-impersonate — curl fork that mimics Chrome/Firefox TLS stacks
- rebrowser-patches — Playwright patches for headless detection
