Vibe-Eval: Automated Regression, Security, and Functional Checks for Vibe-Coded Apps

Intro

Vibe-Eval spins up AI agents to run Playwright-style browser tests against your deployed vibe-coded app, exercising the real UI and API.
It layers security probes for auth bypasses, exposed endpoints, insecure defaults, and secrets that often leak from AI-generated scaffolding.
Functional validation catches regressions after every “vibed” regeneration, so solo builders can ship fast without breaking core flows.
Pair it with periodic human-led “vibe coding audits” for architecture and debt fixes that automation will miss.

What Counts as a Vibe-Coded App?

Anything shipped rapidly via Lovable, Bolt.new, Cursor, Replit Agent, or similar prompt-first builders. They generate a lot of scaffolding and glue code that changes with each prompt. The upside is speed; the downside is unstable auth, fragile state, and surprise regressions when you re-prompt.

Why Vibe-Eval Exists

Prompt drift breaks happy-path flows after “regenerate component” or “rewrite backend” requests.
Auth and role logic are brittle: missing middleware, broad CORS, and default API keys linger.
Manual smoke tests miss edge states (multiple tabs, expired tokens, slow API) that real users hit.
Small teams need something lighter than a full QA department yet stronger than a single scripted test.

How Vibe-Eval Works

Attach a target: Provide staging URL, environment variables, and any seeded test accounts or fixtures.
Spin up agents: Parallel AI agents drive real browsers (headful/headless) using Playwright-like controls.
Scenario generation: Agents infer flows from your sitemap/schema or run from provided checklists (e.g., “signup → verify email → create record → export”).
Stateful replay: Tokens, cookies, and local storage are shared so the agent can jump between users/roles to probe authorization.
Signal capture: Console errors, network traces, API responses, screenshots, and DOM diffs are stored as evidence.
Verdicts and prompts: Findings are summarized with reproduction steps and suggested prompt fixes to prevent repeat regressions.

Security Scanning Focus Areas

Auth bypass: Tries unauthenticated access, downgraded roles, and tampered JWTs/session cookies.
Exposed endpoints: Looks for unprotected admin/router paths, open CORS, default API keys, and debug flags.
Input handling: Probes for missing validation, insecure redirects, SSRF-like fetches, and file upload pitfalls.
Secret leaks: Checks rendered pages, source maps, and API responses for keys, tokens, or credentials.
State abuse: Exercises multi-tab/session races (e.g., cancel vs. submit) that vibe-coded frontends often forget.

Functional & Regression Coverage

Critical paths: Signup/login, role switching, CRUD, payments, and exports covered by generated scripts plus your “must not break” checklist.
UI drift: Detects missing buttons/labels and 404/500s introduced by re-prompted components.
Data contracts: Monitors shape changes in API responses that silently break client-side parsing.
Performance basics: Flags slow first meaningful interaction and oversized bundles that spike after codegen.

When You Still Need a Vibe Coding Audit

A “vibe coding audit” is a broader, usually manual or hybrid service (e.g., VibeAudits.com, “vibe coding rescue” shops) that reviews architecture, secrets management, tenancy boundaries, observability, and long-term debt. Vibe-Eval automates dynamic testing; the audit fixes systemic issues and rewrites shaky foundations.

Key Alternatives (Late 2025 Snapshot)

Vibe-Eval: Agent-driven browser tests + targeted security probes; best for rapid releases and regen-heavy codebases.
QA Wolf / Reflect.run AI modes: Strong hosted Playwright with some AI flakiness recovery; less security depth than Vibe-Eval.
Checkly + LLM co-pilot: Great synthetic monitoring; security coverage is minimal, and flows need manual hints.
LangSmith/Traceloop + Playwright DIY: Maximum control and observability; more setup/maintenance, but flexible for bespoke stacks.
StackHawk/OWASP ZAP/Detectify: Strong DAST/SAST for APIs; great companion for backend-heavy services, but no UX flow validation.
Manual vibe coding audits (VibeAudits.com, boutique “vibe rescue” firms): Human-led reviews that fix architecture, tenancy, and prompt practices; pair with Vibe-Eval for continuous coverage.

Quick Start Recipe

Point Vibe-Eval at staging with seeded test users (admin, member, no-scope) and toggle flags mirroring production.
List your non-negotiable flows (“checkout”, “workspace invite”, “data export”) plus must-not-open endpoints (admin APIs, debug tools).
Run nightly and on every regen; require green runs before prod deploys.
Triage findings weekly: fix prompt patterns that reintroduce issues, and add guardrails (middleware, schema validation) where Vibe-Eval repeatedly flags risk.
Schedule a quarterly vibe coding audit to pay down architecture and security debt the bots cannot rewrite.