We recently teamed up with one of our clients, Yaser Mahmoud, Chief Product Officer at FrontrowMD, to experiment with ChatGPT agent for QA automation. Our livestream effort proved the overall concept works: AI can identify QA issues without explicit test cases. But for production use, we wanted something programmable to check dozens of client websites.
To continue our QA with AI exploration, Chad and Development Team Lead Clarissa Borges moved beyond ChatGPT agent, which you can only use by typing into the interface, to find a scriptable solution. Watch the full replay on YouTube or read on for the highlights of what worked and what didn’t.
Where we started
In case you missed our previous post on QA using AI, let’s start with a little context. FrontrowMD enables health brands to build trust with shoppers through doctor ratings and reviews for health products. The company embeds widgets on ecommerce sites to bring these insights to consumers. Unfortunately, things sometimes break: Badges get squished, reviews disappear, and CTAs stop working.
The ChatGPT agent successfully identified these problems in test cases, but there aren’t any exposed APIs. To work around this limitation, we explored an open source library called Browser-Use. It leverages ChatGPT to control Chrome on your computer, take screenshots, and identify elements on the page. This gave us a path for programming in Python, but unfortunately, it was slower and less accurate than using ChatGPT agent alone.
Programmability with Playwright MCP
We needed to find something faster more accurate, and after some research, we found Playwright MCP from Microsoft. This discovery changed our whole approach to AI QA testing. Model Context Protocol (MCP) servers expose external tools and data sources for LLMs to use directly. Microsoft built an MCP for Playwright, its open source tool for scaling web app testing, and it uses the accessibility tree instead of screenshots.
The difference versus browser-use was dramatic. QA checks completed faster, and because this approach relies on accessibility APIs, if something isn’t accessible, the automation won’t work either. You can run Playwright MCP in a standalone mode and then write scripts against it or connect it to your LLM. In addition, Microsoft has made plugins for editors like Visual Studio Code, Gemini CLI, and others.
Choosing the right LLM model
As we started testing this approach, we found different models produced very different results. GPT 4.0 failed to identify some missing reviews on a webpage. When we moved to GPT 5 with the same prompt language, it took a completely different approach. It opened both pages in tabs, captured full screen shots and checked console logs. This slowed things down but provided results as accurate as when we originally used ChatGPT agent. We completed these comparisons within Visual Studio Code for ease of setup, but we could just as easily have been talking directly from a Ruby or Python script.
Next, we tried Claude. It was chattier with a completely different tone than the other models, but it wasn’t consistent with checking pages. We also hit context window limits multiple times. The breakthrough came when we asked Claude to generate its own QA prompt. It created a detailed prompt that outlined instructions with specific specs that we weren’t sure were correct (e.g. pixel sizes). When we asked it to make the instructions more generic assuming it might not know what the issues will be, it produced a prompt that covered these broader areas (with several bullets under each topic):
- Discovery and inventory phase
- Comprehensive dimensional analysis
- Constraint detection matrix
- Content accessibility evaluation
- Technical performance inspection
- Cross-component consistency check
- User experience impact assessment
- Issue classification framework
This meta-approach of having the AI design its own detection strategy produced better results than our hand-crafted instructions. The Claude-generated prompt was able to find all the issues across our test pages. However, if we started programming this in a script, we’d have to find a way to plan for the context getting too big and figuring out how to compress it. Overall, we feel this exercise proved the concept in our minds and has potential with further exploration.
Ready to leverage AI to improve your internal processes like QA? Let’s talk about what’s possible.