From 380 to 700+ Tests: How We Built an Autonomous QA Team with Claude Code


Try OpenObserve Cloud today for more efficient and performant observability.
Get Started For Free
We built the "Council of Sub Agents" - eight specialized AI agents powered by Claude Code that automate our entire E2E testing pipeline at OpenObserve. Feature analysis dropped from 45-60 minutes to 5-10 minutes, flaky tests reduced by 85%, and test coverage grew from 380 to 700+ tests. The best part? The Council caught a production bug while writing tests - a silent ServiceNow integration failure that no customer had reported yet. This is how we did it.
If you're a QA engineer at a fast-moving startup, you know this struggle: developers ship features faster than you can automate tests for them. The backlog grows. Manual test creation is tedious and slow. Edge cases slip through. Automation always lags behind.
At OpenObserve, we faced the same bottleneck. Our observability platform was evolving rapidly, but our QA process had clear pain points: feature analysis alone took 45-60 minutes, we had 30+ flaky tests causing false failures, and our test coverage (around 380 tests) couldn't keep pace with development velocity. We needed to think bigger.
The breakthrough? Stop trying to make humans faster. Build a system where AI agents are the QA team - analyzing features, writing tests, auditing code, debugging failures, and documenting everything with minimal human intervention.
Enter the Council of Sub Agents.
We designed the Council as a 6-phase pipeline where each agent has a specialized role, like assembling a dream QA team where nobody sleeps and everyone loves debugging.
┌─────────────────────────────────────────┐
│ THE ORCHESTRATOR │
│ (Pipeline Manager & Router) │
└──────────────┬──────────────────────────┘
│
┌──────────▼──────────┐
│ PHASE 1: ANALYSIS │
│ 👔 The Analyst │
└──────────┬──────────┘
│ Feature Design Doc
┌──────────▼──────────┐
│ PHASE 2: PLANNING │
│ 🏗️ The Architect │
└──────────┬──────────┘
│ Test Plan (P0/P1/P2)
┌──────────▼──────────┐
│ PHASE 3: GENERATION │
│ ⚙️ The Engineer │
└──────────┬──────────┘
│ Playwright Tests
┌──────────▼──────────┐
│ PHASE 4: AUDIT │
│ 🛡️ The Sentinel │
│ ★ QUALITY GATE ★ │
└──────────┬──────────┘
│ BLOCKS if issues found
┌──────────▼──────────┐
│ PHASE 5: HEALING │
│ 🔧 The Healer │
└──────────┬──────────┘
│ Iterates until passing
┌──────────▼──────────┐
│ PHASE 6: DOCS │
│ 📝 The Scribe │
└─────────────────────┘
┌──────────────┐
│🔍 Test │
│ Inspector │
│ (PR Reviewer)│
└──────────────┘
The Orchestrator leads the show - routing features to the right agents, deciding if we're testing OSS or Enterprise functionality, and keeping the pipeline moving.
The Analyst (Phase 1) acts as our business analyst. It dives into source code, extracts every data-test selector, maps user workflows, and identifies edge cases we'd likely miss. Outputs a Feature Design Document that becomes the foundation.
The Architect (Phase 2) is our QA strategist. Takes the analysis and creates a prioritized test plan - P0 for critical paths, P1 for core functionality, P2 for edge cases.
The Engineer (Phase 3) writes the actual Playwright test code, following our Page Object Model patterns (reusable UI component abstractions) and using only verified selectors from The Analyst.
The Sentinel (Phase 4) is our favorite - the quality guardian who doesn't compromise. Audits generated code for framework violations (raw selectors in tests, missing assertions), anti-patterns (no awaits, brittle locators), and security issues (hardcoded credentials). If it finds critical problems, it blocks the entire pipeline. No exceptions.
The Healer (Phase 5) is where the magic happens. Runs tests, diagnoses failures, fixes issues like selector problems or timing bugs, and iterates up to 5 times until tests pass. This is what makes the system truly autonomous - it doesn't just generate code, it makes it work.
The Scribe (Phase 6) closes the loop by documenting everything in TestDino, our test management system, ensuring we have a single source of truth.
The Test Inspector operates independently, reviewing GitHub PRs that contain E2E test changes and applying the same audit rules as The Sentinel.
Here's where things got interesting. We set out to automate our new "Prebuilt Alert Destinations" feature - integrations with Slack, Discord, Teams, PagerDuty, and ServiceNow. Standard CRUD operations: create, read, update, delete.
The Engineer and Architect generated the tests. Everything looked good. Then The Healer ran them.
The form was stuck showing "Loading destination data..." forever. The Healer analyzed the failure and traced it back to the Vue component handling ServiceNow URLs. Here's what it found:
Before (broken):
hostname.split('.').slice(-3, -1).join('.') === 'service-now'
// For "dev12345.service-now.com" → returns "dev12345.service-now"
// Does NOT equal "service-now" ❌
After (fixed):
hostname.endsWith('.service-now.com') ✅
The bug was subtle - the URL parsing logic tried to extract "service-now" from hostnames, but the array slicing returned the wrong substring. Every ServiceNow destination edit silently failed in production. (PR #10154)
This bug was LIVE in production. No customer had reported it yet - it was a silent failure hiding in plain sight. Our E2E test automation, running through the Council's pipeline, caught it before it became a support nightmare.
We came to write tests. We accidentally became better debuggers.
That's when we realized the Council wasn't just automation - it was a quality amplifier. The combination of The Engineer generating comprehensive test coverage and The Healer's diagnostic capabilities meant we were testing paths that humans might skip and finding issues we'd never think to look for.
Six months in, the Council transformed our QA workflow:
Speed & Efficiency
Quality & Coverage
Team Velocity
The ServiceNow bug was our proof of concept, but the real win is systemic: we're catching issues earlier, testing more comprehensively, and freeing our QA team to focus on strategic quality decisions instead of writing repetitive test code.
The entire Council runs as Claude Code slash commands - markdown files in .claude/commands/ that define each agent's role, responsibilities, and guardrails. Think of it as infrastructure-as-code for AI agents.
Why Claude Code?
The Council integrates with our existing stack: Playwright for E2E testing, our Page Object Model for maintainability, TestDino for test case management, and GitHub for PR reviews. It supports both OSS and Enterprise testing paths.
The key insight? Specialization over generalization. Early iterations tried using one "super agent" to do everything. It failed. Bounded agents with clear roles work infinitely better - The Analyst focuses solely on feature analysis, The Sentinel only audits, The Healer only debugs. This separation of concerns mirrors good software architecture and makes prompts more effective.
The Quality Gate Changed Everything
The Sentinel blocking the pipeline for critical issues was controversial initially. "What if it's wrong? What if we need to override it?" But that hard gate forced us to improve our Page Object Model, standardize patterns, and write better prompts. The friction created long-term quality.
Iteration is the Secret Weapon
The Healer's ability to iterate up to 5 times per test is what makes autonomous testing real. Tests rarely pass on first try - selectors change, APIs evolve, timing issues emerge. A system that gives up after one failure isn't autonomous, it's just automated.
Context Chaining is Critical
Each agent receives rich context from previous phases. The Engineer sees The Analyst's feature doc AND The Architect's test plan. The Healer sees the test code AND failure logs. This context is what enables intelligent decisions instead of blind generation.
Human Review Still Matters
The Council is autonomous, not unsupervised. We review final output, especially for P0 tests. But review time dropped from hours to minutes because code is consistent, well-structured, and pre-audited.
We're expanding the Council's capabilities:
Full CI/CD Integration
Making the pipeline trigger automatically on feature PR merges via GitHub Actions. The workflow: developer merges feature → Council pipeline triggers → generates test PR → QA reviews and merges. Zero manual kickoff, complete automation from feature to test coverage.
Visual Regression Testing
Teaching The Engineer to generate screenshot comparison tests, catching UI regressions that functional tests miss.
Performance Test Generation
Extending the pipeline to load testing scenarios - The Architect planning performance tests, The Engineer generating them, The Healer diagnosing bottlenecks.
API Test Coverage
Applying the same 6-phase approach to backend API testing, ensuring comprehensive coverage beyond just E2E UI tests.
Self-Improvement Loop
Agents that learn from past failures and update their own prompts - a QA system that gets smarter with every test run.
The vision? A fully autonomous QA workflow that doesn't just automate testing but continuously improves the quality of automation itself, integrated seamlessly into our development pipeline.
The Council of Sub Agents represents our philosophy at OpenObserve: AI-first engineering amplifies what humans can do, it doesn't replace them. We didn't eliminate QA engineers - we eliminated the tedious parts so our team focuses on strategic decisions, exploratory testing, and continuous improvement.
The results speak for themselves: 6-10x faster analysis, 85% fewer flaky tests, 84% more coverage, and a production bug caught by automation before customers noticed.
If you're drowning in manual test creation or evaluating AI tools for your engineering team, we'd love to share what we learned.
Want to see the Council in action? Contact us at hello@openobserve.ai to schedule a demo and discuss how AI-powered testing can work for your team.
Join the conversation: Connect with our engineering team in our community Slack to discuss AI testing approaches and learn from teams pushing the boundaries of automated QA.
Try OpenObserve: Experience what an AI-first observability platform can do. Get started here - built by a team that uses AI to move fast and ship quality.

Shrinath Rao is Lead QA Engineer at OpenObserve, running QA through an AI-driven operating model. He architects autonomous testing systems using Claude Code and AI sub-agents for analysis, validation, and release workflows. With 9+ years in automation-first testing across observability and enterprise platforms, he's an active code contributor who embeds AI capabilities into CI/CD pipelines to strengthen quality engineering