From 380 to 700+ Tests: How We Built an Autonomous QA Team with Claude Code

Shrinath Rao

January 27, 2026

9 min read

Don’t forget to share!

Ready to get started?

Try OpenObserve Cloud today for more efficient and performant observability.

Get Started For Free

Table of Contents

From 380 to 700+ Tests: How We Built an Autonomous QA Team with Claude Code

TL;DR

We built the "Council of Sub Agents" - eight specialized AI agents powered by Claude Code that automate our entire E2E testing pipeline at OpenObserve. Feature analysis dropped from 45-60 minutes to 5-10 minutes, flaky tests reduced by 85%, and test coverage grew from 380 to 700+ tests. The best part? The Council caught a production bug while writing tests - a silent ServiceNow integration failure that no customer had reported yet. This is how we did it.

The Challenge: When QA Can't Keep Up

If you're a QA engineer at a fast-moving startup, you know this struggle: developers ship features faster than you can automate tests for them. The backlog grows. Manual test creation is tedious and slow. Edge cases slip through. Automation always lags behind.

At OpenObserve, we faced the same bottleneck. Our observability platform was evolving rapidly, but our QA process had clear pain points: feature analysis alone took 45-60 minutes, we had 30+ flaky tests causing false failures, and our test coverage (around 380 tests) couldn't keep pace with development velocity. We needed to think bigger.

The breakthrough? Stop trying to make humans faster. Build a system where AI agents are the QA team - analyzing features, writing tests, auditing code, debugging failures, and documenting everything with minimal human intervention.

Enter the Council of Sub Agents.

The Solution: Eight Agents, One Mission

We designed the Council as a 6-phase pipeline where each agent has a specialized role, like assembling a dream QA team where nobody sleeps and everyone loves debugging.

┌─────────────────────────────────────────┐
│         THE ORCHESTRATOR                │
│    (Pipeline Manager & Router)          │
└──────────────┬──────────────────────────┘
               │
    ┌──────────▼──────────┐
    │  PHASE 1: ANALYSIS  │
    │   👔 The Analyst    │
    └──────────┬──────────┘
               │ Feature Design Doc
    ┌──────────▼──────────┐
    │  PHASE 2: PLANNING  │
    │  🏗️ The Architect   │
    └──────────┬──────────┘
               │ Test Plan (P0/P1/P2)
    ┌──────────▼──────────┐
    │ PHASE 3: GENERATION │
    │   ⚙️ The Engineer   │
    └──────────┬──────────┘
               │ Playwright Tests
    ┌──────────▼──────────┐
    │  PHASE 4: AUDIT     │
    │  🛡️ The Sentinel   │
    │   ★ QUALITY GATE ★  │
    └──────────┬──────────┘
               │ BLOCKS if issues found
    ┌──────────▼──────────┐
    │ PHASE 5: HEALING    │
    │   🔧 The Healer     │
    └──────────┬──────────┘
               │ Iterates until passing
    ┌──────────▼──────────┐
    │ PHASE 6: DOCS       │
    │   📝 The Scribe     │
    └─────────────────────┘

        ┌──────────────┐
        │🔍 Test       │
        │  Inspector   │
        │ (PR Reviewer)│
        └──────────────┘

Meet the Council

The Orchestrator leads the show - routing features to the right agents, deciding if we're testing OSS or Enterprise functionality, and keeping the pipeline moving.

The Analyst (Phase 1) acts as our business analyst. It dives into source code, extracts every data-test selector, maps user workflows, and identifies edge cases we'd likely miss. Outputs a Feature Design Document that becomes the foundation.

The Architect (Phase 2) is our QA strategist. Takes the analysis and creates a prioritized test plan - P0 for critical paths, P1 for core functionality, P2 for edge cases.

The Engineer (Phase 3) writes the actual Playwright test code, following our Page Object Model patterns (reusable UI component abstractions) and using only verified selectors from The Analyst.

The Sentinel (Phase 4) is our favorite - the quality guardian who doesn't compromise. Audits generated code for framework violations (raw selectors in tests, missing assertions), anti-patterns (no awaits, brittle locators), and security issues (hardcoded credentials). If it finds critical problems, it blocks the entire pipeline. No exceptions.

The Healer (Phase 5) is where the magic happens. Runs tests, diagnoses failures, fixes issues like selector problems or timing bugs, and iterates up to 5 times until tests pass. This is what makes the system truly autonomous - it doesn't just generate code, it makes it work.

The Scribe (Phase 6) closes the loop by documenting everything in TestDino, our test management system, ensuring we have a single source of truth.

The Test Inspector operates independently, reviewing GitHub PRs that contain E2E test changes and applying the same audit rules as The Sentinel.

The "Aha" Moment: When Tests Find Production Bugs

Here's where things got interesting. We set out to automate our new "Prebuilt Alert Destinations" feature - integrations with Slack, Discord, Teams, PagerDuty, and ServiceNow. Standard CRUD operations: create, read, update, delete.

The Engineer and Architect generated the tests. Everything looked good. Then The Healer ran them.

The ServiceNow Edit Test Kept Failing

The form was stuck showing "Loading destination data..." forever. The Healer analyzed the failure and traced it back to the Vue component handling ServiceNow URLs. Here's what it found:

Before (broken):

hostname.split('.').slice(-3, -1).join('.') === 'service-now'
// For "dev12345.service-now.com" → returns "dev12345.service-now" 
// Does NOT equal "service-now" ❌

After (fixed):

hostname.endsWith('.service-now.com')  ✅

The bug was subtle - the URL parsing logic tried to extract "service-now" from hostnames, but the array slicing returned the wrong substring. Every ServiceNow destination edit silently failed in production. (PR #10154)

The Punchline

This bug was LIVE in production. No customer had reported it yet - it was a silent failure hiding in plain sight. Our E2E test automation, running through the Council's pipeline, caught it before it became a support nightmare.

We came to write tests. We accidentally became better debuggers.

That's when we realized the Council wasn't just automation - it was a quality amplifier. The combination of The Engineer generating comprehensive test coverage and The Healer's diagnostic capabilities meant we were testing paths that humans might skip and finding issues we'd never think to look for.

Real-World Impact: The Numbers

Six months in, the Council transformed our QA workflow:

Speed & Efficiency

Feature analysis: 45-60 minutes → 5-10 minutes (6-10x faster)
Time to first passing test: ~1 hour → 5 minutes
RCA and test maintenance: What used to take a full day now takes minutes

Quality & Coverage

Flaky tests: 30-35 → 4-5 (85% reduction)
Test coverage: 380 → 700+ tests and growing (84% increase)
Edge case detection: The Analyst consistently identifies scenarios we'd miss manually
Production bugs caught: At least one critical silent failure (and counting)

Team Velocity

QA now contributes automation directly in feature PRs (no more lag)
Releases ship with confidence - E2E tests already exist
Standardized patterns across all tests via The Sentinel's enforcement

The ServiceNow bug was our proof of concept, but the real win is systemic: we're catching issues earlier, testing more comprehensively, and freeing our QA team to focus on strategic quality decisions instead of writing repetitive test code.

How It Works: Built on Claude Code

The entire Council runs as Claude Code slash commands - markdown files in .claude/commands/ that define each agent's role, responsibilities, and guardrails. Think of it as infrastructure-as-code for AI agents.

Why Claude Code?

Version controlled: Agents evolve through PR reviews like any code
Modular: Each agent updates independently
Transparent: We see exactly what each agent does and why
Context-aware: Agents share artifacts (documents, code) seamlessly

The Council integrates with our existing stack: Playwright for E2E testing, our Page Object Model for maintainability, TestDino for test case management, and GitHub for PR reviews. It supports both OSS and Enterprise testing paths.

The key insight? Specialization over generalization. Early iterations tried using one "super agent" to do everything. It failed. Bounded agents with clear roles work infinitely better - The Analyst focuses solely on feature analysis, The Sentinel only audits, The Healer only debugs. This separation of concerns mirrors good software architecture and makes prompts more effective.

Lessons Learned

The Quality Gate Changed Everything
The Sentinel blocking the pipeline for critical issues was controversial initially. "What if it's wrong? What if we need to override it?" But that hard gate forced us to improve our Page Object Model, standardize patterns, and write better prompts. The friction created long-term quality.

Iteration is the Secret Weapon
The Healer's ability to iterate up to 5 times per test is what makes autonomous testing real. Tests rarely pass on first try - selectors change, APIs evolve, timing issues emerge. A system that gives up after one failure isn't autonomous, it's just automated.

Context Chaining is Critical
Each agent receives rich context from previous phases. The Engineer sees The Analyst's feature doc AND The Architect's test plan. The Healer sees the test code AND failure logs. This context is what enables intelligent decisions instead of blind generation.

Human Review Still Matters
The Council is autonomous, not unsupervised. We review final output, especially for P0 tests. But review time dropped from hours to minutes because code is consistent, well-structured, and pre-audited.

What's Next

We're expanding the Council's capabilities:

Full CI/CD Integration
Making the pipeline trigger automatically on feature PR merges via GitHub Actions. The workflow: developer merges feature → Council pipeline triggers → generates test PR → QA reviews and merges. Zero manual kickoff, complete automation from feature to test coverage.

Visual Regression Testing
Teaching The Engineer to generate screenshot comparison tests, catching UI regressions that functional tests miss.

Performance Test Generation
Extending the pipeline to load testing scenarios - The Architect planning performance tests, The Engineer generating them, The Healer diagnosing bottlenecks.

API Test Coverage
Applying the same 6-phase approach to backend API testing, ensuring comprehensive coverage beyond just E2E UI tests.

Self-Improvement Loop
Agents that learn from past failures and update their own prompts - a QA system that gets smarter with every test run.

The vision? A fully autonomous QA workflow that doesn't just automate testing but continuously improves the quality of automation itself, integrated seamlessly into our development pipeline.

Conclusion: AI-First Engineering in Action

The Council of Sub Agents represents our philosophy at OpenObserve: AI-first engineering amplifies what humans can do, it doesn't replace them. We didn't eliminate QA engineers - we eliminated the tedious parts so our team focuses on strategic decisions, exploratory testing, and continuous improvement.

The results speak for themselves: 6-10x faster analysis, 85% fewer flaky tests, 84% more coverage, and a production bug caught by automation before customers noticed.

If you're drowning in manual test creation or evaluating AI tools for your engineering team, we'd love to share what we learned.

Get Involved

Want to see the Council in action? Watch our tutorial on Building an Autonomous QA Team with Subagents to see how AI-powered testing works in practice.

Join the conversation: Connect with our engineering team in our community Slack to discuss AI testing approaches and learn from teams pushing the boundaries of automated QA.

Try OpenObserve: Experience what an AI-first observability platform can do. Get started here - built by a team that uses AI to move fast and ship quality.

Resources

About the Author

Shrinath Rao

Shrinath Rao is Lead QA Engineer at OpenObserve, running QA through an AI-driven operating model. He architects autonomous testing systems using Claude Code and AI sub-agents for analysis, validation, and release workflows. With 9+ years in automation-first testing across observability and enterprise platforms, he's an active code contributor who embeds AI capabilities into CI/CD pipelines to strengthen quality engineering

Latest From Our Blogs

View all posts

Engineering

Datadog Pricing: The Hidden Costs Every Engineering Team Should Know

Datadog's per-host billing, custom metric taxes, and two-part log pricing can turn a modest monitoring setup into a six-figure annual spend. See how OpenObserve's usage-based pricing compares — no host charges, no OTel penalties, no surprise bills.

OpenTelemetry Collector Contrib: A Comprehensive Guide

Learn how to use the OpenTelemetry Collector Contrib distribution to collect, process, and export telemetry data. This guide covers architecture, key components, configuration examples, and practical deployment tips.

Simran Kumari

2026-03-08

Implementing Distributed Tracing in a Java Application with OpenObserve

How to

OpentelemetryApplication

Implementing Distributed Tracing in a Java Application with OpenObserve

Learn how to implement distributed tracing in a Java Spring Boot microservices application using the OpenTelemetry Java Agent and OpenObserve. Covers zero-code auto-instrumentation, JVM metrics, cross-service trace propagation, flamegraphs, and Gantt charts , with working source code and curl examples.

Top 10 Dynatrace Alternatives in 2026: Complete Comparison Guide

Looking for a Dynatrace alternative? Whether you're frustrated by DDU pricing complexity, vendor lock-in, or the steep learning curve, this guide covers the 10 best Dynatrace alternatives in 2026 from open-source platforms to enterprise SaaS tools.

Observability vs. Monitoring: What's the Difference?

Observability vs monitoring explained. Learn the key differences, use cases, and why modern teams move beyond monitoring to observability.

Top 10 New Relic Alternatives in 2026: Complete Comparison Guide

Explore top New Relic alternatives that offer better pricing, open-source flexibility, and full-stack observability for modern DevOps and SRE teams.

Full Stack Observability: The Complete Guide

A complete guide to full stack observability - covering frontend, backend, infrastructure, traces, logs, metrics, and OpenTelemetry for DevOps and SRE teams.

Top 10 Grafana Alternatives in 2026: Complete Comparison Guide

Discover the top open-source Grafana alternatives in 2026. Compare features like dashboards, alerting, metrics, logs, traces, scalability, and ease of use for modern DevOps teams.

Top 10 Elasticsearch Alternatives in 2026: Complete Comparison Guide

Discover the best Elasticsearch alternatives in 2026. Compare open-source and commercial tools for search, log analytics, and observability. Find cost-effective solutions with our comprehensive guide covering OpenObserve, OpenSearch, Solr, and more.

Top 11 Splunk Alternatives: Complete Comparison Guide

Discover the best Splunk alternatives in 2026. Compare open-source and enterprise tools for log management, SIEM, and observability. Find cost-effective solutions with our comprehensive guide.

Simran Kumari

2026-02-04