Testing AI Apps Is Different (And Most People Get It Wrong)
Here’s something nobody wants to admit: testing vibe-coded applications with traditional methods is like trying to debug quantum mechanics with a ruler.
I’ve watched teams spend weeks writing unit tests for AI-generated code that changes every sprint. The tests become technical debt faster than you can say “regression suite.”
The real problem? AI-generated apps don’t fail the way hand-coded apps fail. They fail in weird, emergent ways that unit tests never catch. A chatbot that suddenly starts giving financial advice. A form that validates everything except the one field users actually care about.
You need a different approach.
The Vibe Eval Framework: Recording Reality
Here’s what we built at Vibe Eval. It’s dead simple, which is why it works.
The core idea: use an AI agent to generate Playwright test scripts based on real user scenarios. Record the interaction. Save the file. Replay it as many times as you want.
No brittle selectors. No flaky waits. Just the actual user flow, captured and crystallizable.
How It Actually Works
The framework uses the Agno agent library with GPT-4o-mini (though you can swap models). You describe what you want to test in plain English:
“Test the login flow with an invalid email, then a valid one. Make sure error messages show up.”
The agent generates a Playwright script. The script runs headless. You get a recording you can replay forever.
Here’s the magic: the agent validates the Python code before executing it. If there’s a syntax error, it retries up to three times. This auto-correction loop means you’re not debugging AST parse errors at 2 AM.
The Code Is Embarrassingly Simple
The validation function uses Python’s ast module to parse code without executing it. It handles code blocks wrapped in triple backticks, which LLMs love to generate.
If the code is valid, you get back clean Playwright Python. If it’s not, the agent tries again with context about what failed.
The template structure is intentionally minimal. It uses sync_playwright() because async introduces complexity you don’t need for testing. The agent fills in the actual test actions based on your description.
Why This Beats Traditional Testing for Vibe-Coded Apps
Traditional testing assumes stable code. Vibe coding breaks that assumption completely.
When your AI generates a new component structure every time you tweak a prompt, your Jest tests become archaeology. “This test references a component that hasn’t existed for three sprints.”
But user flows? Those are stable. Login is still login. Checkout is still checkout. The implementation changes, but the user journey doesn’t.
Setting Up Vibe Eval Testing
Get AI-powered Playwright testing running in your vibe-coded project
Install Dependencies
pip install playwright agno-ai then playwright install to grab browser binaries.Create the Validation Module
Configure the Agent
Generate and Run Tests
Integrate into CI/CD
The Production Reality Check
We’ve run this in production for months now. Testing vibe-coded SaaS apps that change weekly.
The framework catches real issues. Not theoretical edge cases, but actual “users can’t log in” problems. Because it tests what users do, not what your code structure looks like.
The retry mechanism is clutch. LLMs mess up syntax about 15% of the time on the first try. By attempt three, you’re at 95%+ success rate. That’s good enough for automated testing.
One unexpected benefit: the generated tests serve as documentation. When a new developer asks “how does the checkout flow work?”, you can just show them the Playwright script. It’s readable. It’s accurate. It’s always up to date.
The Limits (Because Nothing Is Perfect)
This isn’t a silver bullet. It won’t catch logic errors in your backend. It won’t find SQL injection vulnerabilities. It won’t test your WebSocket reconnection logic.
What it does test is the user experience. Can users accomplish their goals? Do the flows work? Are error messages showing up?
For vibe-coded apps where the implementation is fluid but user journeys are fixed, that’s exactly what you need.
The other limitation: it requires decent prompt engineering. Garbage in, garbage out. If you describe a test scenario poorly, the agent generates a poor test.
But that’s also an advantage. It forces you to think clearly about user journeys. What should happen? In what order? With what feedback?
FAQ
Why not just use Playwright's built-in codegen?
What happens if the agent generates invalid code three times?
Can I use this with non-AI-generated apps?
Does this replace all other testing?
Which AI model works best for generating tests?
Conclusion
Key Takeaways
- Traditional testing breaks down for vibe-coded apps because implementation changes constantly
- User flows are stable even when code structure isn’t—test those instead
- AI agents can generate Playwright scripts from natural language descriptions
- Automatic code validation with retry loops handles LLM syntax errors reliably
- Generated tests double as always-accurate documentation of user journeys
- This approach catches real user-facing issues in production environments
- It’s not a replacement for all testing, but it fills a critical gap for AI-generated codebases
The bottom line? If you’re building with AI code generators and still trying to maintain traditional test suites, you’re fighting the wrong battle. Test user behavior, not code structure. Let AI generate the tests. Replay them religiously.
It’s the only testing approach I’ve found that actually works when your codebase is in constant flux.