The Best AI Coding Setup I've Ever Used (For Testing Vibe-Coded Apps)

Testing AI Apps Is Different (And Most People Get It Wrong)

Here’s something nobody wants to admit: testing vibe-coded applications with traditional methods is like trying to debug quantum mechanics with a ruler.

I’ve watched teams spend weeks writing unit tests for AI-generated code that changes every sprint. The tests become technical debt faster than you can say “regression suite.”

Vibe Coding : A development approach where developers describe what they want in natural language and AI tools generate the implementation, often requiring minimal traditional coding.

The real problem? AI-generated apps don’t fail the way hand-coded apps fail. They fail in weird, emergent ways that unit tests never catch. A chatbot that suddenly starts giving financial advice. A form that validates everything except the one field users actually care about.

You need a different approach.

The Vibe Eval Framework: Recording Reality

Here’s what we built at Vibe Eval. It’s dead simple, which is why it works.

Playwright : An end-to-end testing framework that controls real browsers to test web applications exactly how users interact with them, supporting automated recording and playback of user interactions.

The core idea: use an AI agent to generate Playwright test scripts based on real user scenarios. Record the interaction. Save the file. Replay it as many times as you want.

No brittle selectors. No flaky waits. Just the actual user flow, captured and crystallizable.

How It Actually Works

The framework uses the Agno agent library with GPT-4o-mini (though you can swap models). You describe what you want to test in plain English:

“Test the login flow with an invalid email, then a valid one. Make sure error messages show up.”

The agent generates a Playwright script. The script runs headless. You get a recording you can replay forever.

Here’s the magic: the agent validates the Python code before executing it. If there’s a syntax error, it retries up to three times. This auto-correction loop means you’re not debugging AST parse errors at 2 AM.

The Code Is Embarrassingly Simple

The validation function uses Python’s ast module to parse code without executing it. It handles code blocks wrapped in triple backticks, which LLMs love to generate.

If the code is valid, you get back clean Playwright Python. If it’s not, the agent tries again with context about what failed.

The template structure is intentionally minimal. It uses sync_playwright() because async introduces complexity you don’t need for testing. The agent fills in the actual test actions based on your description.

Why This Beats Traditional Testing for Vibe-Coded Apps

Traditional testing assumes stable code. Vibe coding breaks that assumption completely.

When your AI generates a new component structure every time you tweak a prompt, your Jest tests become archaeology. “This test references a component that hasn’t existed for three sprints.”

But user flows? Those are stable. Login is still login. Checkout is still checkout. The implementation changes, but the user journey doesn’t.

Setting Up Vibe Eval Testing

Get AI-powered Playwright testing running in your vibe-coded project

Install Dependencies

Install Playwright and the Agno agent library. You’ll need Python 3.11+ and pip. Run pip install playwright agno-ai then playwright install to grab browser binaries.

Create the Validation Module

Set up the code validation function that uses Python’s ast module to verify generated Playwright scripts. This catches syntax errors before execution and enables the retry loop.

Configure the Agent

Initialize an Agno agent with your chosen model (GPT-4o-mini works great). Pass it the Playwright template and instructions to generate headless browser tests focused on user flows.

Generate and Run Tests

Describe your test scenario in plain language. The agent generates the Playwright script, validates it, and executes it. The recorded file gets saved for replay.

Integrate into CI/CD

Add the generated Playwright scripts to your test suite. They run like any other Playwright test, but they were created by describing user behavior instead of coding selectors.

The Production Reality Check

We’ve run this in production for months now. Testing vibe-coded SaaS apps that change weekly.

The framework catches real issues. Not theoretical edge cases, but actual “users can’t log in” problems. Because it tests what users do, not what your code structure looks like.

The retry mechanism is clutch. LLMs mess up syntax about 15% of the time on the first try. By attempt three, you’re at 95%+ success rate. That’s good enough for automated testing.

AST (Abstract Syntax Tree) : A tree representation of code structure that allows programmatic analysis and validation without executing the code, enabling safe code inspection and syntax checking.

One unexpected benefit: the generated tests serve as documentation. When a new developer asks “how does the checkout flow work?”, you can just show them the Playwright script. It’s readable. It’s accurate. It’s always up to date.

The Limits (Because Nothing Is Perfect)

This isn’t a silver bullet. It won’t catch logic errors in your backend. It won’t find SQL injection vulnerabilities. It won’t test your WebSocket reconnection logic.

What it does test is the user experience. Can users accomplish their goals? Do the flows work? Are error messages showing up?

For vibe-coded apps where the implementation is fluid but user journeys are fixed, that’s exactly what you need.

The other limitation: it requires decent prompt engineering. Garbage in, garbage out. If you describe a test scenario poorly, the agent generates a poor test.

But that’s also an advantage. It forces you to think clearly about user journeys. What should happen? In what order? With what feedback?

FAQ

Why not just use Playwright's built-in codegen?

Playwright’s codegen is great for one-off recordings, but it doesn’t validate code or handle errors. The Vibe Eval framework adds AI-powered generation, automatic validation, and retry logic. You describe behavior, not click sequences.

What happens if the agent generates invalid code three times?

The framework returns the last generated code even if invalid, so you can debug it manually. In practice, this rarely happens with clear test descriptions. Most failures are on attempt one, fixed by attempt two.

Can I use this with non-AI-generated apps?

Absolutely. The framework works for any web app. It’s just particularly valuable for vibe-coded apps where traditional test suites become stale quickly.

Does this replace all other testing?

No. You still need unit tests for critical logic, security scanning for vulnerabilities, and integration tests for APIs. This handles end-to-end user flow testing specifically.

Which AI model works best for generating tests?

We use GPT-4o-mini for cost efficiency. GPT-4 works great too. Claude and other models work fine—just swap the model configuration in the Agno agent setup.

Conclusion

Key Takeaways

Traditional testing breaks down for vibe-coded apps because implementation changes constantly
User flows are stable even when code structure isn’t—test those instead
AI agents can generate Playwright scripts from natural language descriptions
Automatic code validation with retry loops handles LLM syntax errors reliably
Generated tests double as always-accurate documentation of user journeys
This approach catches real user-facing issues in production environments
It’s not a replacement for all testing, but it fills a critical gap for AI-generated codebases

The bottom line? If you’re building with AI code generators and still trying to maintain traditional test suites, you’re fighting the wrong battle. Test user behavior, not code structure. Let AI generate the tests. Replay them religiously.

It’s the only testing approach I’ve found that actually works when your codebase is in constant flux.