How We Built Vibe Eval — From One Monolith to 200 Security Agents Running in Parallel

TL;DR

Vibe Eval started as a 300-line Python script scanning one Replit app for obvious holes.
Version 1 was a monolith: hardcoded secrets, synchronous Playwright blocking the event loop, checks running one-by-one.
We hit the wall at ~1,000 scans/day when response times ballooned to 45+ seconds.
The rebuild involved parallel check execution, async Playwright, Redis caching, and eventually 200+ security checks running as independent functions.
Today Vibe Eval handles 100K+ scans/day, catches issues in milliseconds, and costs 90% less per scan than the v1 monolith.

The First Version (Weekend Hack)

It was late 2024. I’d just shipped three apps with Cursor and Lovable. All three had auth bypasses I didn’t catch until staging. The pattern was obvious: AI tools vibe fast but skip guard rails.

I opened a new Python file and wrote a basic scanner:

Launch Playwright
Load the app URL
Check for obvious stuff: missing CSP headers, exposed .env routes, weak session cookies
Email me a report

No database. No queue. Just a FastAPI endpoint that took a URL and returned JSON.

It worked. The first scan caught four issues across my three apps. I fixed them in 20 minutes.

Then I shared it on Twitter. Within 48 hours, 50 people wanted to use it.

The Scaling Problem (Monolith Meltdown)

By January 2025, Vibe Eval was scanning ~500 apps/day. The original script had grown to 200+ security checks. Response times averaged 30 seconds. Then 45. Then timeouts.

The bottlenecks:

Synchronous Playwright — every scan blocked the async event loop while the browser loaded
Sequential checks — 200 checks ran one-by-one, each waiting for the previous to finish
No caching — rescanning the same app meant re-running all 200 checks from scratch
Hardcoded secrets — Supabase keys and Sentry tokens lived in the code
Wide-open CORS — allow_origins=["*"] meant anyone could hit the API

I had three options:

Quick fixes (parallel execution, caching, async Playwright)
Service-oriented architecture (split into microservices)
Full serverless rewrite (event-driven, Lambda functions for every check)

I started with Option 1. The goal: 3-5x faster scans in 4-6 weeks without breaking existing users.

Phase 1: Performance Surgery (Weeks 1-6)

Week 1: Security Hardening

First, I had to stop leaking secrets and exposing the API to the world.

Migrated all hardcoded secrets to environment variables using python-dotenv
Tightened CORS to a whitelist of approved domains
Added Pydantic validators to reject malicious URLs and disposable emails
Implemented rate limiting: 10 req/min for unauthenticated users, 100 for paid accounts

Result: No more exposed keys. API abuse dropped 80%.

Week 2-3: Parallel Execution

The big win. I refactored the check runner to use ThreadPoolExecutor with 10 workers. Instead of running checks sequentially, Vibe Eval now processed them in parallel.

1
2
3
4
5
6
7
8
# Before: 200 checks × 200ms = 40 seconds
for check in checks:
    result = check.run(page_data)

# After: 200 checks / 10 workers = 4 seconds
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(check.run, page_data) for check in checks]
    results = [f.result() for f in futures]

Result: Average scan time dropped from 45s to 8s. 5x improvement.

Week 4: Async Playwright Migration

The synchronous Playwright calls were still blocking the event loop. I converted every check to use the async API:

1
2
3
4
5
6
7
# Before
with sync_playwright() as p:
    browser = p.chromium.launch()

# After
async with async_playwright() as p:
    browser = await p.chromium.launch()

This let FastAPI handle multiple scans concurrently without blocking.

Result: Could handle 50 concurrent scans vs. 5 before.

Week 5-6: Redis Caching

I added a Redis layer to cache scan results for 1 hour. Cache key: hash(url + check_version). If the same app got rescanned within an hour, Vibe Eval returned cached results instantly.

Result: 40% of scans were cache hits. Latency for cached scans: 120ms.

The Numbers After Phase 1

Metric	Before	After	Improvement
Avg scan time	45s	8s	5.6x faster
Concurrent scans	5	50	10x more
Cost per scan	$0.005	$0.002	60% cheaper
Cache hit rate	0%	40%	New capability
Daily capacity	1,000 scans	10,000 scans	10x scale

That bought me time. But I knew the monolith wouldn’t scale to 100K scans/day.

Phase 2: Service Decomposition (Weeks 7-18)

I started breaking the monolith into services using the strangler fig pattern — build new services alongside the old system, gradually route traffic to them, then decommission the monolith.

Service Breakdown

API Gateway — routing, auth, rate limiting
Scanner Service — Playwright orchestration, page capture
Check Engine Service — runs the 200+ checks in parallel
Report Service — generates HTML/PDF reports
Notification Service — emails, webhooks, Slack alerts
Dashboard Service — project CRUD, user preferences
Background Worker — periodic rescans, cleanup tasks

Message Queue (The Glue)

I added RabbitMQ to decouple services. Instead of synchronous calls, services published events:

scan.started → Scanner Service picks it up
scan.completed → Check Engine subscribes
findings.ready → Report Service generates the report
report.ready → Notification Service sends emails

Benefits:

Fault isolation — if the Report Service crashes, scans still complete
Independent scaling — can run 20 Check Engine workers and 2 Scanner workers
Retry logic — failed checks get retried automatically
Observability — can see exactly where scans get stuck

Database Schema Changes

I split the monolithic reports table into proper entities:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- Track scan lifecycle
CREATE TABLE scans (
  id UUID PRIMARY KEY,
  url TEXT NOT NULL,
  status TEXT NOT NULL, -- pending, scanning, completed, failed
  started_at TIMESTAMP,
  completed_at TIMESTAMP
);

-- Decouple findings from reports
CREATE TABLE findings (
  id UUID PRIMARY KEY,
  scan_id UUID REFERENCES scans(id),
  check_name TEXT NOT NULL,
  severity INT NOT NULL,
  info TEXT
);

This made it possible to query “all critical findings across all scans for this project” without scanning the entire reports table.

The Numbers After Phase 2

Metric	Phase 1	Phase 2	Improvement
Daily capacity	10,000	100,000	10x scale
P95 latency	12s	3s	4x faster
Scanner crashes affect entire system	Yes	No	Fault isolation
Can scale services independently	No	Yes	New capability

Phase 3: Serverless Checks (Weeks 19-30)

The Check Engine was still the bottleneck. Even with 20 workers, running 200 checks sequentially per worker meant linear scaling.

The insight: What if every check was an independent function?

I converted the 200 checks into standalone AWS Lambda functions. When a scan completes, the Check Orchestrator publishes 200 events in parallel — one per check. Each check function processes its event and publishes findings back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Orchestrator publishes 200 events
for check in enabled_checks:
    event_bus.publish('check.execute', {
        'check_name': check.name,
        'scan_id': scan_id,
        'page_data_s3_url': s3_url
    })

# 200 Lambda functions execute in parallel
# Each function publishes findings
event_bus.publish('check.completed', {
    'scan_id': scan_id,
    'finding': {...}
})

Result: Checks that took 8 seconds now finish in 400ms. 20x improvement.

Cost Optimization

Lambda pricing is pay-per-invocation. At scale, this is way cheaper than running servers.

Per-scan cost breakdown:

Browser Scanner: 500ms, 1GB RAM = $0.000008
200 Check Functions: 100ms each, 512MB RAM = $0.0001
Result Aggregator: 200ms = $0.000002
Report Generator: 300ms = $0.000003
S3 storage: 100KB report = $0.0000023

Total: $0.00012 per scan (vs. $0.005 for the monolith)

At 100K scans/day, that’s $12/day vs. $500/day. 97% cost reduction.

The Current Architecture (Today)

Here’s what Vibe Eval looks like now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
API Gateway
    ↓
RabbitMQ Event Bus
    ↓
┌─────────────────────────────────────┐
│ Scanner Service (Playwright)        │
│ → captures page, network, screenshots
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Check Orchestrator                   │
│ → publishes 200 parallel events      │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ 200 Lambda Functions (Checks)       │
│ → SSL, CSP, XSS, Auth, CORS, etc.    │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Result Aggregator                    │
│ → collects findings, calculates score
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Report Service (HTML/PDF)            │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│ Notification Service (Email/Webhook) │
└─────────────────────────────────────┘

Data Layer:

Event Store (DynamoDB) — immutable log of all events, enables replay
Read Models (DynamoDB) — optimized views for dashboard queries
Cache (Redis) — scan results, sessions, rate limits
Object Storage (S3) — reports, screenshots, page snapshots

What I Learned

1. Start with the simplest thing that works

The weekend hack was a 300-line script. It worked. Don’t over-engineer early.

2. Measure before you optimize

I didn’t know parallel execution would give 5x gains until I profiled the monolith and saw sequential checks eating 90% of runtime.

3. Incremental migration beats big rewrites

The strangler fig pattern let me ship new features every two weeks while rebuilding the backend. Users never noticed.

4. Serverless is a cheat code for spiky workloads

Vibe Eval gets 100K scans/day but 80% happen between 9am-5pm PT. Lambda auto-scales to zero at night. Servers would cost 10x more.

5. Events > Synchronous calls

Message queues decouple services and make debugging way easier. I can replay failed scans from the event log without re-running Playwright.

The Roadmap Ahead

Next 3 months:

Multi-region deployment — run scanners in 5 AWS regions, route scans to nearest region
Check Marketplace — let third-party devs build and sell custom checks
Real-time subscriptions — WebSocket updates so dashboards show scan progress live
AI scenario generation — use LLMs to generate edge-case test flows

Next 12 months:

Global event sourcing — DynamoDB Global Tables for cross-region replay
Edge caching — CloudFront for report delivery, <100ms latency worldwide
Enterprise SSO — SAML/OAuth for large teams
Compliance packs — pre-built check bundles for SOC2, GDPR, HIPAA

Try It Yourself

If you’re shipping apps with Cursor, Lovable, Replit, or Bolt — connect your staging URL to Vibe Eval and run the “Quick Scan” preset. You’ll get a report in under 10 seconds showing auth bypasses, exposed secrets, and prompt injection risks.

Most founders fix the critical issues in under 5 minutes. The time you save not dealing with a production breach pays for a year of scans.

Start vibing, ship knowing agents already poked every sharp edge.