How We Built Vibe Eval — From One Monolith to 200 Security Agents Running in Parallel

How We Built Vibe Eval — From One Monolith to 200 Security Agents Running in Parallel

TL;DR

  • Vibe Eval started as a 300-line Python script scanning one Replit app for obvious holes.
  • Version 1 was a monolith: hardcoded secrets, synchronous Playwright blocking the event loop, checks running one-by-one.
  • We hit the wall at ~1,000 scans/day when response times ballooned to 45+ seconds.
  • The rebuild involved parallel check execution, async Playwright, Redis caching, and eventually 200+ security checks running as independent functions.
  • Today Vibe Eval handles 100K+ scans/day, catches issues in milliseconds, and costs 90% less per scan than the v1 monolith.

The First Version (Weekend Hack)

It was late 2024. I’d just shipped three apps with Cursor and Lovable. All three had auth bypasses I didn’t catch until staging. The pattern was obvious: AI tools vibe fast but skip guard rails.

I opened a new Python file and wrote a basic scanner:

  1. Launch Playwright
  2. Load the app URL
  3. Check for obvious stuff: missing CSP headers, exposed .env routes, weak session cookies
  4. Email me a report

No database. No queue. Just a FastAPI endpoint that took a URL and returned JSON.

It worked. The first scan caught four issues across my three apps. I fixed them in 20 minutes.

Then I shared it on Twitter. Within 48 hours, 50 people wanted to use it.

The Scaling Problem (Monolith Meltdown)

By January 2025, Vibe Eval was scanning ~500 apps/day. The original script had grown to 200+ security checks. Response times averaged 30 seconds. Then 45. Then timeouts.

The bottlenecks:

  • Synchronous Playwright — every scan blocked the async event loop while the browser loaded
  • Sequential checks — 200 checks ran one-by-one, each waiting for the previous to finish
  • No caching — rescanning the same app meant re-running all 200 checks from scratch
  • Hardcoded secrets — Supabase keys and Sentry tokens lived in the code
  • Wide-open CORSallow_origins=["*"] meant anyone could hit the API

I had three options:

  1. Quick fixes (parallel execution, caching, async Playwright)
  2. Service-oriented architecture (split into microservices)
  3. Full serverless rewrite (event-driven, Lambda functions for every check)

I started with Option 1. The goal: 3-5x faster scans in 4-6 weeks without breaking existing users.

Phase 1: Performance Surgery (Weeks 1-6)

Week 1: Security Hardening

First, I had to stop leaking secrets and exposing the API to the world.

  • Migrated all hardcoded secrets to environment variables using python-dotenv
  • Tightened CORS to a whitelist of approved domains
  • Added Pydantic validators to reject malicious URLs and disposable emails
  • Implemented rate limiting: 10 req/min for unauthenticated users, 100 for paid accounts

Result: No more exposed keys. API abuse dropped 80%.

Week 2-3: Parallel Execution

The big win. I refactored the check runner to use ThreadPoolExecutor with 10 workers. Instead of running checks sequentially, Vibe Eval now processed them in parallel.

1
2
3
4
5
6
7
8
# Before: 200 checks × 200ms = 40 seconds
for check in checks:
    result = check.run(page_data)

# After: 200 checks / 10 workers = 4 seconds
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(check.run, page_data) for check in checks]
    results = [f.result() for f in futures]

Result: Average scan time dropped from 45s to 8s. 5x improvement.

Week 4: Async Playwright Migration

The synchronous Playwright calls were still blocking the event loop. I converted every check to use the async API:

1
2
3
4
5
6
7
# Before
with sync_playwright() as p:
    browser = p.chromium.launch()

# After
async with async_playwright() as p:
    browser = await p.chromium.launch()

This let FastAPI handle multiple scans concurrently without blocking.

Result: Could handle 50 concurrent scans vs. 5 before.

Week 5-6: Redis Caching

I added a Redis layer to cache scan results for 1 hour. Cache key: hash(url + check_version). If the same app got rescanned within an hour, Vibe Eval returned cached results instantly.

Result: 40% of scans were cache hits. Latency for cached scans: 120ms.

The Numbers After Phase 1

MetricBeforeAfterImprovement
Avg scan time45s8s5.6x faster
Concurrent scans55010x more
Cost per scan$0.005$0.00260% cheaper
Cache hit rate0%40%New capability
Daily capacity1,000 scans10,000 scans10x scale

That bought me time. But I knew the monolith wouldn’t scale to 100K scans/day.

Phase 2: Service Decomposition (Weeks 7-18)

I started breaking the monolith into services using the strangler fig pattern — build new services alongside the old system, gradually route traffic to them, then decommission the monolith.

Service Breakdown

  1. API Gateway — routing, auth, rate limiting
  2. Scanner Service — Playwright orchestration, page capture
  3. Check Engine Service — runs the 200+ checks in parallel
  4. Report Service — generates HTML/PDF reports
  5. Notification Service — emails, webhooks, Slack alerts
  6. Dashboard Service — project CRUD, user preferences
  7. Background Worker — periodic rescans, cleanup tasks

Message Queue (The Glue)

I added RabbitMQ to decouple services. Instead of synchronous calls, services published events:

  • scan.started → Scanner Service picks it up
  • scan.completed → Check Engine subscribes
  • findings.ready → Report Service generates the report
  • report.ready → Notification Service sends emails

Benefits:

  • Fault isolation — if the Report Service crashes, scans still complete
  • Independent scaling — can run 20 Check Engine workers and 2 Scanner workers
  • Retry logic — failed checks get retried automatically
  • Observability — can see exactly where scans get stuck

Database Schema Changes

I split the monolithic reports table into proper entities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- Track scan lifecycle
CREATE TABLE scans (
  id UUID PRIMARY KEY,
  url TEXT NOT NULL,
  status TEXT NOT NULL, -- pending, scanning, completed, failed
  started_at TIMESTAMP,
  completed_at TIMESTAMP
);

-- Decouple findings from reports
CREATE TABLE findings (
  id UUID PRIMARY KEY,
  scan_id UUID REFERENCES scans(id),
  check_name TEXT NOT NULL,
  severity INT NOT NULL,
  info TEXT
);

This made it possible to query “all critical findings across all scans for this project” without scanning the entire reports table.

The Numbers After Phase 2

MetricPhase 1Phase 2Improvement
Daily capacity10,000100,00010x scale
P95 latency12s3s4x faster
Scanner crashes affect entire systemYesNoFault isolation
Can scale services independentlyNoYesNew capability

Phase 3: Serverless Checks (Weeks 19-30)

The Check Engine was still the bottleneck. Even with 20 workers, running 200 checks sequentially per worker meant linear scaling.

The insight: What if every check was an independent function?

I converted the 200 checks into standalone AWS Lambda functions. When a scan completes, the Check Orchestrator publishes 200 events in parallel — one per check. Each check function processes its event and publishes findings back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Orchestrator publishes 200 events
for check in enabled_checks:
    event_bus.publish('check.execute', {
        'check_name': check.name,
        'scan_id': scan_id,
        'page_data_s3_url': s3_url
    })

# 200 Lambda functions execute in parallel
# Each function publishes findings
event_bus.publish('check.completed', {
    'scan_id': scan_id,
    'finding': {...}
})

Result: Checks that took 8 seconds now finish in 400ms. 20x improvement.

Cost Optimization

Lambda pricing is pay-per-invocation. At scale, this is way cheaper than running servers.

Per-scan cost breakdown:

  • Browser Scanner: 500ms, 1GB RAM = $0.000008
  • 200 Check Functions: 100ms each, 512MB RAM = $0.0001
  • Result Aggregator: 200ms = $0.000002
  • Report Generator: 300ms = $0.000003
  • S3 storage: 100KB report = $0.0000023

Total: $0.00012 per scan (vs. $0.005 for the monolith)

At 100K scans/day, that’s $12/day vs. $500/day. 97% cost reduction.

The Current Architecture (Today)

Here’s what Vibe Eval looks like now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
API Gateway
RabbitMQ Event Bus
┌─────────────────────────────────────┐
│ Scanner Service (Playwright)        │
│ → captures page, network, screenshots
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Check Orchestrator                   │
│ → publishes 200 parallel events      │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 200 Lambda Functions (Checks)       │
│ → SSL, CSP, XSS, Auth, CORS, etc.    │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Result Aggregator                    │
│ → collects findings, calculates score
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Report Service (HTML/PDF)            │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Notification Service (Email/Webhook) │
└─────────────────────────────────────┘

Data Layer:

  • Event Store (DynamoDB) — immutable log of all events, enables replay
  • Read Models (DynamoDB) — optimized views for dashboard queries
  • Cache (Redis) — scan results, sessions, rate limits
  • Object Storage (S3) — reports, screenshots, page snapshots

What I Learned

1. Start with the simplest thing that works

The weekend hack was a 300-line script. It worked. Don’t over-engineer early.

2. Measure before you optimize

I didn’t know parallel execution would give 5x gains until I profiled the monolith and saw sequential checks eating 90% of runtime.

3. Incremental migration beats big rewrites

The strangler fig pattern let me ship new features every two weeks while rebuilding the backend. Users never noticed.

4. Serverless is a cheat code for spiky workloads

Vibe Eval gets 100K scans/day but 80% happen between 9am-5pm PT. Lambda auto-scales to zero at night. Servers would cost 10x more.

5. Events > Synchronous calls

Message queues decouple services and make debugging way easier. I can replay failed scans from the event log without re-running Playwright.

The Roadmap Ahead

Next 3 months:

  • Multi-region deployment — run scanners in 5 AWS regions, route scans to nearest region
  • Check Marketplace — let third-party devs build and sell custom checks
  • Real-time subscriptions — WebSocket updates so dashboards show scan progress live
  • AI scenario generation — use LLMs to generate edge-case test flows

Next 12 months:

  • Global event sourcing — DynamoDB Global Tables for cross-region replay
  • Edge caching — CloudFront for report delivery, <100ms latency worldwide
  • Enterprise SSO — SAML/OAuth for large teams
  • Compliance packs — pre-built check bundles for SOC2, GDPR, HIPAA

Try It Yourself

If you’re shipping apps with Cursor, Lovable, Replit, or Bolt — connect your staging URL to Vibe Eval and run the “Quick Scan” preset. You’ll get a report in under 10 seconds showing auth bypasses, exposed secrets, and prompt injection risks.

Most founders fix the critical issues in under 5 minutes. The time you save not dealing with a production breach pays for a year of scans.

Start vibing, ship knowing agents already poked every sharp edge.

Security runs on data.
Make it work for you.

Effortlessly test and evaluate web application security using Vibe Eval agents.