How We Built Vibe Eval — From One Monolith to 200 Security Agents Running in Parallel

How We Built Vibe Eval — From One Monolith to 200 Security Agents Running in Parallel

TL;DR

  • Vibe Eval started as a 300-line Python script scanning one Replit app for obvious holes.
  • Version 1 was a monolith: hardcoded secrets, synchronous Playwright blocking the event loop, checks running one-by-one.
  • We hit the wall at ~1,000 scans/day when response times ballooned to 45+ seconds.
  • The rebuild involved parallel check execution, async Playwright, Redis caching, and eventually 200+ security checks running as independent functions.
  • Today Vibe Eval handles 100K+ scans/day, catches issues in milliseconds, and costs 90% less per scan than the v1 monolith.

The First Version (Weekend Hack)

It was late 2024. I’d just shipped three apps with Cursor and Lovable. All three had auth bypasses I didn’t catch until staging. The pattern was obvious: AI tools vibe fast but skip guard rails.

I opened a new Python file and wrote a basic scanner:

  1. Launch Playwright
  2. Load the app URL
  3. Check for obvious stuff: missing CSP headers, exposed .env routes, weak session cookies
  4. Email me a report

No database. No queue. Just a FastAPI endpoint that took a URL and returned JSON.

It worked. The first scan caught four issues across my three apps. I fixed them in 20 minutes.

Then I shared it on Twitter. Within 48 hours, 50 people wanted to use it.

The Scaling Problem (Monolith Meltdown)

By January 2025, Vibe Eval was scanning ~500 apps/day. The original script had grown to 200+ security checks. Response times averaged 30 seconds. Then 45. Then timeouts.

The bottlenecks:

  • Synchronous Playwright — every scan blocked the async event loop while the browser loaded
  • Sequential checks — 200 checks ran one-by-one, each waiting for the previous to finish
  • No caching — rescanning the same app meant re-running all 200 checks from scratch
  • Hardcoded secrets — Supabase keys and Sentry tokens lived in the code
  • Wide-open CORSallow_origins=["*"] meant anyone could hit the API

I had three options:

  1. Quick fixes (parallel execution, caching, async Playwright)
  2. Service-oriented architecture (split into microservices)
  3. Full serverless rewrite (event-driven, Lambda functions for every check)

I started with Option 1. The goal: 3-5x faster scans in 4-6 weeks without breaking existing users.

Phase 1: Performance Surgery (Weeks 1-6)

Week 1: Security Hardening

First, I had to stop leaking secrets and exposing the API to the world.

  • Migrated all hardcoded secrets to environment variables using python-dotenv
  • Tightened CORS to a whitelist of approved domains
  • Added Pydantic validators to reject malicious URLs and disposable emails
  • Implemented rate limiting: 10 req/min for unauthenticated users, 100 for paid accounts

Result: No more exposed keys. API abuse dropped 80%.

Week 2-3: Parallel Execution

The big win. I refactored the check runner to use ThreadPoolExecutor with 10 workers. Instead of running checks sequentially, Vibe Eval now processed them in parallel.

1
2
3
4
5
6
7
8
# Before: 200 checks × 200ms = 40 seconds
for check in checks:
    result = check.run(page_data)

# After: 200 checks / 10 workers = 4 seconds
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(check.run, page_data) for check in checks]
    results = [f.result() for f in futures]

Result: Average scan time dropped from 45s to 8s. 5x improvement.

Week 4: Async Playwright Migration

The synchronous Playwright calls were still blocking the event loop. I converted every check to use the async API:

1
2
3
4
5
6
7
# Before
with sync_playwright() as p:
    browser = p.chromium.launch()

# After
async with async_playwright() as p:
    browser = await p.chromium.launch()

This let FastAPI handle multiple scans concurrently without blocking.

Result: Could handle 50 concurrent scans vs. 5 before.

Week 5-6: Redis Caching

I added a Redis layer to cache scan results for 1 hour. Cache key: hash(url + check_version). If the same app got rescanned within an hour, Vibe Eval returned cached results instantly.

Result: 40% of scans were cache hits. Latency for cached scans: 120ms.

The Numbers After Phase 1

MetricBeforeAfterImprovement
Avg scan time45s8s5.6x faster
Concurrent scans55010x more
Cost per scan$0.005$0.00260% cheaper
Cache hit rate0%40%New capability
Daily capacity1,000 scans10,000 scans10x scale

That bought me time. But I knew the monolith wouldn’t scale to 100K scans/day.

Phase 2: Service Decomposition (Weeks 7-18)

I started breaking the monolith into services using the strangler fig pattern — build new services alongside the old system, gradually route traffic to them, then decommission the monolith.

Service Breakdown

  1. API Gateway — routing, auth, rate limiting
  2. Scanner Service — Playwright orchestration, page capture
  3. Check Engine Service — runs the 200+ checks in parallel
  4. Report Service — generates HTML/PDF reports
  5. Notification Service — emails, webhooks, Slack alerts
  6. Dashboard Service — project CRUD, user preferences
  7. Background Worker — periodic rescans, cleanup tasks

Message Queue (The Glue)

I added RabbitMQ to decouple services. Instead of synchronous calls, services published events:

  • scan.started → Scanner Service picks it up
  • scan.completed → Check Engine subscribes
  • findings.ready → Report Service generates the report
  • report.ready → Notification Service sends emails

Benefits:

  • Fault isolation — if the Report Service crashes, scans still complete
  • Independent scaling — can run 20 Check Engine workers and 2 Scanner workers
  • Retry logic — failed checks get retried automatically
  • Observability — can see exactly where scans get stuck

Database Schema Changes

I split the monolithic reports table into proper entities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- Track scan lifecycle
CREATE TABLE scans (
  id UUID PRIMARY KEY,
  url TEXT NOT NULL,
  status TEXT NOT NULL, -- pending, scanning, completed, failed
  started_at TIMESTAMP,
  completed_at TIMESTAMP
);

-- Decouple findings from reports
CREATE TABLE findings (
  id UUID PRIMARY KEY,
  scan_id UUID REFERENCES scans(id),
  check_name TEXT NOT NULL,
  severity INT NOT NULL,
  info TEXT
);

This made it possible to query “all critical findings across all scans for this project” without scanning the entire reports table.

The Numbers After Phase 2

MetricPhase 1Phase 2Improvement
Daily capacity10,000100,00010x scale
P95 latency12s3s4x faster
Scanner crashes affect entire systemYesNoFault isolation
Can scale services independentlyNoYesNew capability

Phase 3: Serverless Checks (Weeks 19-30)

The Check Engine was still the bottleneck. Even with 20 workers, running 200 checks sequentially per worker meant linear scaling.

The insight: What if every check was an independent function?

I converted the 200 checks into standalone AWS Lambda functions. When a scan completes, the Check Orchestrator publishes 200 events in parallel — one per check. Each check function processes its event and publishes findings back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Orchestrator publishes 200 events
for check in enabled_checks:
    event_bus.publish('check.execute', {
        'check_name': check.name,
        'scan_id': scan_id,
        'page_data_s3_url': s3_url
    })

# 200 Lambda functions execute in parallel
# Each function publishes findings
event_bus.publish('check.completed', {
    'scan_id': scan_id,
    'finding': {...}
})

Result: Checks that took 8 seconds now finish in 400ms. 20x improvement.

Cost Optimization

Lambda pricing is pay-per-invocation. At scale, this is way cheaper than running servers.

Per-scan cost breakdown:

  • Browser Scanner: 500ms, 1GB RAM = $0.000008
  • 200 Check Functions: 100ms each, 512MB RAM = $0.0001
  • Result Aggregator: 200ms = $0.000002
  • Report Generator: 300ms = $0.000003
  • S3 storage: 100KB report = $0.0000023

Total: $0.00012 per scan (vs. $0.005 for the monolith)

At 100K scans/day, that’s $12/day vs. $500/day. 97% cost reduction.

The Current Architecture (Today)

Here’s what Vibe Eval looks like now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
API Gateway
RabbitMQ Event Bus
┌─────────────────────────────────────┐
│ Scanner Service (Playwright)        │
│ → captures page, network, screenshots
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Check Orchestrator                   │
│ → publishes 200 parallel events      │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ 200 Lambda Functions (Checks)       │
│ → SSL, CSP, XSS, Auth, CORS, etc.    │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Result Aggregator                    │
│ → collects findings, calculates score
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Report Service (HTML/PDF)            │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Notification Service (Email/Webhook) │
└─────────────────────────────────────┘

Data Layer:

  • Event Store (DynamoDB) — immutable log of all events, enables replay
  • Read Models (DynamoDB) — optimized views for dashboard queries
  • Cache (Redis) — scan results, sessions, rate limits
  • Object Storage (S3) — reports, screenshots, page snapshots

What I Learned

1. Start with the simplest thing that works

The weekend hack was a 300-line script. It worked. Don’t over-engineer early.

2. Measure before you optimize

I didn’t know parallel execution would give 5x gains until I profiled the monolith and saw sequential checks eating 90% of runtime.

3. Incremental migration beats big rewrites

The strangler fig pattern let me ship new features every two weeks while rebuilding the backend. Users never noticed.

4. Serverless is a cheat code for spiky workloads

Vibe Eval gets 100K scans/day but 80% happen between 9am-5pm PT. Lambda auto-scales to zero at night. Servers would cost 10x more.

5. Events > Synchronous calls

Message queues decouple services and make debugging way easier. I can replay failed scans from the event log without re-running Playwright.

The Roadmap Ahead

Next 3 months:

  • Multi-region deployment — run scanners in 5 AWS regions, route scans to nearest region
  • Check Marketplace — let third-party devs build and sell custom checks
  • Real-time subscriptions — WebSocket updates so dashboards show scan progress live
  • AI scenario generation — use LLMs to generate edge-case test flows

Next 12 months:

  • Global event sourcing — DynamoDB Global Tables for cross-region replay
  • Edge caching — CloudFront for report delivery, <100ms latency worldwide
  • Enterprise SSO — SAML/OAuth for large teams
  • Compliance packs — pre-built check bundles for SOC2, GDPR, HIPAA

Try It Yourself

If you’re shipping apps with Cursor, Lovable, Replit, or Bolt — connect your staging URL to Vibe Eval and run the “Quick Scan” preset. You’ll get a report in under 10 seconds showing auth bypasses, exposed secrets, and prompt injection risks.

Most founders fix the critical issues in under 5 minutes. The time you save not dealing with a production breach pays for a year of scans.

Start vibing, ship knowing agents already poked every sharp edge.

AI Coding Security Insights.
Ship Vibe-Coded Apps Safely.

Effortlessly test and evaluate web application security using Vibe Eval agents.