ChatGPT agent: Experimenting with QA automation

Manual QA is tedious, especially when you’re checking the same widgets across dozens of client websites. As part of our AI in Focus series,) Chad sat down with one of our clients, Yaser Mahmoud, Chief Product Officer at FrontrowMD, to experiment with ChatGPT agent for QA automation.

FrontrowMD enables health brands to build trust with shoppers through doctor recommendations–think customer ratings and reviews for health products, but from medical experts. The company embeds widgets on a wide variety of ecommerce sites to bring these insights to the point of purchase. Unfortunately, these widgets can break in unpredictable ways: Badges get squished, reviews disappear, and CTAs stop working.

Manual checking doesn’t scale, but can ChatGPT agent handle real product QA? Watch the full video or read on to find out when ChatGPT agent succeeds and where it falls short. You may also want to check out our follow-up video where we take this process even further with Playwright MCP.

Why ChatGPT Agent?

ChatGPT’s agent mode runs a virtual computer on OpenAI’s servers. Unlike traditional automation tools, it can see and interact with web pages like a human QA tester would, but it’s worth noting that it only supports Chrome. We wanted to find out whether it could identify visual and functional issues without writing explicit test cases for every scenario.

The experiment

We began our automated QA experiment with a straightforward prompt: You are a quality assurance agent for FrontrowMD. For baseline context, we directed it to a webpage with a working badge, banner and review in place along with some basic direction about where each element appears (e.g. upper left of the product image). We also asked agent to remember this example as a working implementation.

Next, we provided a URL to a page we’d manually broken and asked it to identify and report on any differences. The results were surprisingly good. ChatGPT agent correctly identified:

  • Incorrect badge sizing and padding issues
  • Missing review section
  • Non-functional CTAs
  • Layout problems with scrollable containers

What impressed us most? It assessed issues holistically rather than pixel-peeping. Instead of flagging “2-pixel padding difference” it identified that the badge was “confined to a fixed height box with its own vertical scroll bar, making text hard to read.”

To keep experimenting, we fed it another broken page and asked for “issues” instead of “differences.” It still compared to the page we’d identified as the working implementation and identified the main issues even with less precise language. To scale this method, we gave it an entire list of URLs to check and asked it to report back on problem pages only. This also worked well overall: ChatGPT agent started to go of course but self-corrected.

The trade-offs

Agent mode worked well for exploration, but there are limitations:

Programmability: There’s no API. You can only use agent mode by typing into the ChatGPT interface, which doesn’t scale like something you could program against.

Cost: Each command consumed agent credits, so testing hundreds of client sites daily could add up quickly.

Privacy: Unless you’re self-hosting an open source model, there can be data and privacy concerns, especially in highly regulated industries like healthcare and finance.

Black box: You’re not explicitly seeing all the steps the agent is taking, so it’s much harder to make small tweaks to that process vs. other testing solutions.

What we learned

ChatGPT agent proved the concept works. AI can identify QA issues across diverse website implementations without explicit test cases. The challenge isn’t high level capability; it’s making this approach programmable and cost-effective.

For teams with similar needs, agent mode works well for:

  • One-off QA audits
  • Exploring what’s possible with AI-powered testing
  • Validating whether AI can catch the types of issues you care about

But for production use? We needed something more scriptable. That’s where our exploration took an interesting turn, which we covered in a second livestream.

Try it yourself

If you want to experiment with agent mode for QA, here are a few starting points:

  • Identify a working reference implementation
  • Describe what should be present (badges, reviews, CTAs, etc.)
  • Give it pages to check with the instruction: Compare with the working example and identify issues or differences
  • Provide guidance about what matters (functional vs. cosmetic issues)
  • Treat AI like a junior QA engineer: give clear examples, explain what to look for, and refine based on what it misses.

If you want support exploring AI automation use cases for your team, we’d love to work with you.

About thoughtbot

We've been helping engineering teams deliver exceptional products for over 20 years. Our designers, developers, and product managers work closely with teams to solve your toughest software challenges through collaborative design and development. Learn more about us.