Intro
- Vibe-Eval spins up AI agents to run Playwright-style browser tests against your deployed vibe-coded app, exercising the real UI and API.
- It layers security probes for auth bypasses, exposed endpoints, insecure defaults, and secrets that often leak from AI-generated scaffolding.
- Functional validation catches regressions after every “vibed” regeneration, so solo builders can ship fast without breaking core flows.
- Pair it with periodic human-led “vibe coding audits” for architecture and debt fixes that automation will miss.
What Counts as a Vibe-Coded App?
Anything shipped rapidly via Lovable, Bolt.new, Cursor, Replit Agent, or similar prompt-first builders. They generate a lot of scaffolding and glue code that changes with each prompt. The upside is speed; the downside is unstable auth, fragile state, and surprise regressions when you re-prompt.
Why Vibe-Eval Exists
- Prompt drift breaks happy-path flows after “regenerate component” or “rewrite backend” requests.
- Auth and role logic are brittle: missing middleware, broad CORS, and default API keys linger.
- Manual smoke tests miss edge states (multiple tabs, expired tokens, slow API) that real users hit.
- Small teams need something lighter than a full QA department yet stronger than a single scripted test.
How Vibe-Eval Works
- Attach a target: Provide staging URL, environment variables, and any seeded test accounts or fixtures.
- Spin up agents: Parallel AI agents drive real browsers (headful/headless) using Playwright-like controls.
- Scenario generation: Agents infer flows from your sitemap/schema or run from provided checklists (e.g., “signup → verify email → create record → export”).
- Stateful replay: Tokens, cookies, and local storage are shared so the agent can jump between users/roles to probe authorization.
- Signal capture: Console errors, network traces, API responses, screenshots, and DOM diffs are stored as evidence.
- Verdicts and prompts: Findings are summarized with reproduction steps and suggested prompt fixes to prevent repeat regressions.
Security Scanning Focus Areas
- Auth bypass: Tries unauthenticated access, downgraded roles, and tampered JWTs/session cookies.
- Exposed endpoints: Looks for unprotected admin/router paths, open CORS, default API keys, and debug flags.
- Input handling: Probes for missing validation, insecure redirects, SSRF-like fetches, and file upload pitfalls.
- Secret leaks: Checks rendered pages, source maps, and API responses for keys, tokens, or credentials.
- State abuse: Exercises multi-tab/session races (e.g., cancel vs. submit) that vibe-coded frontends often forget.
Functional & Regression Coverage
- Critical paths: Signup/login, role switching, CRUD, payments, and exports covered by generated scripts plus your “must not break” checklist.
- UI drift: Detects missing buttons/labels and 404/500s introduced by re-prompted components.
- Data contracts: Monitors shape changes in API responses that silently break client-side parsing.
- Performance basics: Flags slow first meaningful interaction and oversized bundles that spike after codegen.
When You Still Need a Vibe Coding Audit
A “vibe coding audit” is a broader, usually manual or hybrid service (e.g., VibeAudits.com, “vibe coding rescue” shops) that reviews architecture, secrets management, tenancy boundaries, observability, and long-term debt. Vibe-Eval automates dynamic testing; the audit fixes systemic issues and rewrites shaky foundations.
Key Alternatives (Late 2025 Snapshot)
- Vibe-Eval: Agent-driven browser tests + targeted security probes; best for rapid releases and regen-heavy codebases.
- QA Wolf / Reflect.run AI modes: Strong hosted Playwright with some AI flakiness recovery; less security depth than Vibe-Eval.
- Checkly + LLM co-pilot: Great synthetic monitoring; security coverage is minimal, and flows need manual hints.
- LangSmith/Traceloop + Playwright DIY: Maximum control and observability; more setup/maintenance, but flexible for bespoke stacks.
- StackHawk/OWASP ZAP/Detectify: Strong DAST/SAST for APIs; great companion for backend-heavy services, but no UX flow validation.
- Manual vibe coding audits (VibeAudits.com, boutique “vibe rescue” firms): Human-led reviews that fix architecture, tenancy, and prompt practices; pair with Vibe-Eval for continuous coverage.
Quick Start Recipe
- Point Vibe-Eval at staging with seeded test users (
admin,member,no-scope) and toggle flags mirroring production. - List your non-negotiable flows (“checkout”, “workspace invite”, “data export”) plus must-not-open endpoints (admin APIs, debug tools).
- Run nightly and on every regen; require green runs before prod deploys.
- Triage findings weekly: fix prompt patterns that reintroduce issues, and add guardrails (middleware, schema validation) where Vibe-Eval repeatedly flags risk.
- Schedule a quarterly vibe coding audit to pay down architecture and security debt the bots cannot rewrite.