TL;DR
- Vibe Eval started as a 300-line Python script scanning one Replit app for obvious holes.
- Version 1 was a monolith: hardcoded secrets, synchronous Playwright blocking the event loop, checks running one-by-one.
- We hit the wall at ~1,000 scans/day when response times ballooned to 45+ seconds.
- The rebuild involved parallel check execution, async Playwright, Redis caching, and eventually 200+ security checks running as independent functions.
- Today Vibe Eval handles 100K+ scans/day, catches issues in milliseconds, and costs 90% less per scan than the v1 monolith.
The First Version (Weekend Hack)
It was late 2024. I’d just shipped three apps with Cursor and Lovable. All three had auth bypasses I didn’t catch until staging. The pattern was obvious: AI tools vibe fast but skip guard rails.
I opened a new Python file and wrote a basic scanner:
- Launch Playwright
- Load the app URL
- Check for obvious stuff: missing CSP headers, exposed
.envroutes, weak session cookies - Email me a report
No database. No queue. Just a FastAPI endpoint that took a URL and returned JSON.
It worked. The first scan caught four issues across my three apps. I fixed them in 20 minutes.
Then I shared it on Twitter. Within 48 hours, 50 people wanted to use it.
The Scaling Problem (Monolith Meltdown)
By January 2025, Vibe Eval was scanning ~500 apps/day. The original script had grown to 200+ security checks. Response times averaged 30 seconds. Then 45. Then timeouts.
The bottlenecks:
- Synchronous Playwright — every scan blocked the async event loop while the browser loaded
- Sequential checks — 200 checks ran one-by-one, each waiting for the previous to finish
- No caching — rescanning the same app meant re-running all 200 checks from scratch
- Hardcoded secrets — Supabase keys and Sentry tokens lived in the code
- Wide-open CORS —
allow_origins=["*"]meant anyone could hit the API
I had three options:
- Quick fixes (parallel execution, caching, async Playwright)
- Service-oriented architecture (split into microservices)
- Full serverless rewrite (event-driven, Lambda functions for every check)
I started with Option 1. The goal: 3-5x faster scans in 4-6 weeks without breaking existing users.
Phase 1: Performance Surgery (Weeks 1-6)
Week 1: Security Hardening
First, I had to stop leaking secrets and exposing the API to the world.
- Migrated all hardcoded secrets to environment variables using
python-dotenv - Tightened CORS to a whitelist of approved domains
- Added Pydantic validators to reject malicious URLs and disposable emails
- Implemented rate limiting: 10 req/min for unauthenticated users, 100 for paid accounts
Result: No more exposed keys. API abuse dropped 80%.
Week 2-3: Parallel Execution
The big win. I refactored the check runner to use ThreadPoolExecutor with 10 workers. Instead of running checks sequentially, Vibe Eval now processed them in parallel.
| |
Result: Average scan time dropped from 45s to 8s. 5x improvement.
Week 4: Async Playwright Migration
The synchronous Playwright calls were still blocking the event loop. I converted every check to use the async API:
| |
This let FastAPI handle multiple scans concurrently without blocking.
Result: Could handle 50 concurrent scans vs. 5 before.
Week 5-6: Redis Caching
I added a Redis layer to cache scan results for 1 hour. Cache key: hash(url + check_version). If the same app got rescanned within an hour, Vibe Eval returned cached results instantly.
Result: 40% of scans were cache hits. Latency for cached scans: 120ms.
The Numbers After Phase 1
| Metric | Before | After | Improvement |
|---|---|---|---|
| Avg scan time | 45s | 8s | 5.6x faster |
| Concurrent scans | 5 | 50 | 10x more |
| Cost per scan | $0.005 | $0.002 | 60% cheaper |
| Cache hit rate | 0% | 40% | New capability |
| Daily capacity | 1,000 scans | 10,000 scans | 10x scale |
That bought me time. But I knew the monolith wouldn’t scale to 100K scans/day.
Phase 2: Service Decomposition (Weeks 7-18)
I started breaking the monolith into services using the strangler fig pattern — build new services alongside the old system, gradually route traffic to them, then decommission the monolith.
Service Breakdown
- API Gateway — routing, auth, rate limiting
- Scanner Service — Playwright orchestration, page capture
- Check Engine Service — runs the 200+ checks in parallel
- Report Service — generates HTML/PDF reports
- Notification Service — emails, webhooks, Slack alerts
- Dashboard Service — project CRUD, user preferences
- Background Worker — periodic rescans, cleanup tasks
Message Queue (The Glue)
I added RabbitMQ to decouple services. Instead of synchronous calls, services published events:
scan.started→ Scanner Service picks it upscan.completed→ Check Engine subscribesfindings.ready→ Report Service generates the reportreport.ready→ Notification Service sends emails
Benefits:
- Fault isolation — if the Report Service crashes, scans still complete
- Independent scaling — can run 20 Check Engine workers and 2 Scanner workers
- Retry logic — failed checks get retried automatically
- Observability — can see exactly where scans get stuck
Database Schema Changes
I split the monolithic reports table into proper entities:
| |
This made it possible to query “all critical findings across all scans for this project” without scanning the entire reports table.
The Numbers After Phase 2
| Metric | Phase 1 | Phase 2 | Improvement |
|---|---|---|---|
| Daily capacity | 10,000 | 100,000 | 10x scale |
| P95 latency | 12s | 3s | 4x faster |
| Scanner crashes affect entire system | Yes | No | Fault isolation |
| Can scale services independently | No | Yes | New capability |
Phase 3: Serverless Checks (Weeks 19-30)
The Check Engine was still the bottleneck. Even with 20 workers, running 200 checks sequentially per worker meant linear scaling.
The insight: What if every check was an independent function?
I converted the 200 checks into standalone AWS Lambda functions. When a scan completes, the Check Orchestrator publishes 200 events in parallel — one per check. Each check function processes its event and publishes findings back.
| |
Result: Checks that took 8 seconds now finish in 400ms. 20x improvement.
Cost Optimization
Lambda pricing is pay-per-invocation. At scale, this is way cheaper than running servers.
Per-scan cost breakdown:
- Browser Scanner: 500ms, 1GB RAM = $0.000008
- 200 Check Functions: 100ms each, 512MB RAM = $0.0001
- Result Aggregator: 200ms = $0.000002
- Report Generator: 300ms = $0.000003
- S3 storage: 100KB report = $0.0000023
Total: $0.00012 per scan (vs. $0.005 for the monolith)
At 100K scans/day, that’s $12/day vs. $500/day. 97% cost reduction.
The Current Architecture (Today)
Here’s what Vibe Eval looks like now:
| |
Data Layer:
- Event Store (DynamoDB) — immutable log of all events, enables replay
- Read Models (DynamoDB) — optimized views for dashboard queries
- Cache (Redis) — scan results, sessions, rate limits
- Object Storage (S3) — reports, screenshots, page snapshots
What I Learned
1. Start with the simplest thing that works
The weekend hack was a 300-line script. It worked. Don’t over-engineer early.
2. Measure before you optimize
I didn’t know parallel execution would give 5x gains until I profiled the monolith and saw sequential checks eating 90% of runtime.
3. Incremental migration beats big rewrites
The strangler fig pattern let me ship new features every two weeks while rebuilding the backend. Users never noticed.
4. Serverless is a cheat code for spiky workloads
Vibe Eval gets 100K scans/day but 80% happen between 9am-5pm PT. Lambda auto-scales to zero at night. Servers would cost 10x more.
5. Events > Synchronous calls
Message queues decouple services and make debugging way easier. I can replay failed scans from the event log without re-running Playwright.
The Roadmap Ahead
Next 3 months:
- Multi-region deployment — run scanners in 5 AWS regions, route scans to nearest region
- Check Marketplace — let third-party devs build and sell custom checks
- Real-time subscriptions — WebSocket updates so dashboards show scan progress live
- AI scenario generation — use LLMs to generate edge-case test flows
Next 12 months:
- Global event sourcing — DynamoDB Global Tables for cross-region replay
- Edge caching — CloudFront for report delivery, <100ms latency worldwide
- Enterprise SSO — SAML/OAuth for large teams
- Compliance packs — pre-built check bundles for SOC2, GDPR, HIPAA
Try It Yourself
If you’re shipping apps with Cursor, Lovable, Replit, or Bolt — connect your staging URL to Vibe Eval and run the “Quick Scan” preset. You’ll get a report in under 10 seconds showing auth bypasses, exposed secrets, and prompt injection risks.
Most founders fix the critical issues in under 5 minutes. The time you save not dealing with a production breach pays for a year of scans.
Start vibing, ship knowing agents already poked every sharp edge.