From Zero to 90%: How We Unlocked Business Function Monitoring for Australia's Largest Bank
We transformed CBA's dormant ObStack platform from zero foundational service onboardings to comprehensive business function monitoring across 71 critical services, shifting from diagnostic health checks to proactive detection of issues affecting tens of thousands of employees.
PISR: Problem, Impact, Solution, Result
-
Problem: Commonwealth Bank of Australia, an enterprise leader in financial services, had successfully built their ObStack platform (Prometheus, Grafana with long-term storage for SLI metrics and SLO tracking) but faced critical challenges in service onboarding across their organisation. Despite the platform's capabilities, internal teams and partner organisations were unable to onboard any services to monitor business functions, relying instead on diagnostic metrics from health endpoints that provided insufficient coverage of user-impacting issues.
-
Business Impact: This resulted in zero foundational service onboardings to ObStack in FYQ1 2024 despite significant engineering effort and platform readiness, inability to monitor business functions across critical internal and external services affecting tens of thousands of employees, and continued reliance on diagnostic health endpoints that failed to detect user-impacting issues. Teams had no visibility into SLO breaches for business functions, resulting in delayed incident detection and unclear routing of issues to responsible teams.
-
Our Solution: Over 12 months, ClearRoute's team of 2 QCAs, 2 SQCEs, and 2 QCEs partnered with CBA to re-engineer their service onboarding approach. We implemented a standardised guided delivery framework with active project management, developed a Kubernetes-based Probe Fabric platform enabling teams without deployment capability to host monitoring workloads, and established Critical User Journey (CUJ) monitoring deployed as close to customer ingress points as possible to ensure authentic user experience validation rather than internal system diagnostics.
-
Tangible Result: The transformation drove successful onboarding of 71 foundational services (representing 90% of CBA's critical infrastructure, up from 0), implementation of 508 SLOs and 82 Critical User Journeys, and near-instantaneous Mean Time to Discovery with direct SLO breach notifications. This directly enabled CBA to shift from diagnostic health endpoint monitoring to comprehensive business function visibility across their most critical services affecting tens of thousands of employees.
The Challenge
Business & Client Context
- Primary Business Goal: Enable foundational service teams to successfully onboard to the existing ObStack platform with business function monitoring, moving beyond diagnostic health endpoints to comprehensive Critical User Journey validation for CBA's most critical infrastructure and internal services.
- Pressures: CTO mandate for FY24 observability implementation following FYQ1 delivery failure, regulatory compliance requirements for financial services infrastructure monitoring, and competitive pressure to achieve modern SRE practices comparable to leading technology organisations.
- Technology Maturity: CBA had successfully built ObStack platform infrastructure with Prometheus, Grafana, and long-term storage capabilities, but teams across the organisation were unable to onboard services for business function monitoring. Existing monitoring relied on diagnostic metrics from health endpoints, providing insufficient visibility into user-impacting issues affecting their 55,000+ employees and millions of customers.
Current State Assessment: Key Pain Points
- Onboarding Capability Gap: Despite having built the ObStack platform, internal teams and partner organisations lacked the capability to onboard services for business function monitoring, resulting in zero successful foundational service onboardings in FYQ1 2024 after significant effort and platform readiness.
- Documentation Fragmentation: Existing onboarding documentation was scattered across multiple sources, contained conflicting information, and was often irrelevant to actual implementation needs, creating confusion and blocking team progress despite substantial previous documentation efforts.
- Political Resistance and Team Apprehension: Foundational service teams exhibited strong resistance to ObStack platform adoption, viewing it as additional overhead rather than value-adding capability. Many teams were nervous about running synthetic workloads in production environments, particularly on databases and critical platforms, despite synthetic monitoring being standard practice for validating customer-facing services.
- SaaS Tooling Team Conflicts: Teams managing SaaS platforms with existing monitoring capabilities (Office 365, VPN systems) strongly resisted synthetic testing approaches, arguing that existing log monitoring was sufficient and failing to understand the customer perspective value that business function probes provide versus passive log observation.
- Service Complexity Ambiguity: No standardised approach existed for categorising services by onboarding complexity, leading to unrealistic expectations, failed implementation attempts, and inconsistent delivery timelines across diverse foundational services.
- Platform Diversity Challenges: Teams struggled with the heterogeneous nature of foundational services, lacking generic monitoring solutions that could be adapted across different platform types whilst maintaining consistent deployment and operational patterns.
Baseline Metrics (Where Available)
| Metric Category | Baseline | Notes |
|---|---|---|
| Foundational Services Onboarded to ObStack | 0 services | Despite platform readiness and significant FYQ1 effort |
| Business Function Monitoring Coverage | 0% | Only diagnostic health endpoints monitored |
| Mean Time to Discovery | Unknown/Extended | No SLO breach detection for business functions |
| Team Onboarding Capability | Self-service blocked | Most teams unable to deploy probe workloads |
| Critical User Journeys Monitored | 0 CUJs | No user experience validation |
| SLO Implementation | 0 SLOs | No business function thresholds established |
Solution Overview
Engagement Strategy & Phases
- Phase 1: Foundation & Documentation Alignment: Conducted comprehensive assessment of ObStack platform readiness and team resistance factors. Consolidated fragmented and conflicting documentation into coherent implementation guidance. Ran observability workshops to educate teams on CUJ principles and synthetic monitoring value. Developed service complexity categorisation framework to set realistic onboarding expectations. This foundational phase was critical as initial attempts at self-service onboarding failed due to insufficient team capability, requiring us to pivot to a more hands-on consulting approach.
- Phase 2: Framework Development & First Success: Established standardised onboarding framework with active project management. Built Kubernetes-based Probe Fabric platform enabling teams without deployment capability to host monitoring workloads. Achieved first successful foundational service onboarding by week 8, demonstrating the guided delivery approach's effectiveness after months of failed self-service attempts.
- Time to First Value: Delivered first successful service onboarding with business function monitoring in week 8, proving the consulting approach after self-service failures.
- Phase 3: Scaled Delivery & Process Optimisation: With proven framework established, accelerated delivery across multiple foundational service teams. Key to success was maintaining tight control over the onboarding pipeline, managing expectations, and keeping new services on track through active project management. Deployed probes as close to customer ingress points as possible to ensure monitoring reflected actual user experience rather than internal system diagnostics.
- Phase 4: Platform Optimisation & Velocity Tracking: Extended Probe Fabric capabilities to support complex network environments and integrated comprehensive SLO breach notifications. Created velocity tracking dashboard within ObStack providing real-time visibility into onboarding progress across services, SLOs, and CUJs, enabling proactive identification of bottlenecks and clear demonstration of delivery acceleration after framework establishment.
- Phase 5: Knowledge Transfer & Sustainment: Embedded QCE practices within SRE teams, documented architectural patterns and operational runbooks, and established a Centre for Enablement for ongoing platform evolution with target expansion to 200+ services in FY25.
Architectural Overview
Reusable Probe Archetypes
A critical breakthrough was developing standardised probe archetypes that teams could adopt based on their service characteristics. Rather than each team building custom monitoring solutions, we created three containerised probe patterns that covered 90% of foundational service monitoring requirements:
Promwright: UI Journey Testing Archetype
- Purpose: End-to-end user interface validation using Playwright automation
- Best For: Web applications, customer portals, mainframe terminal interfaces, SaaS platforms with UI components
- Implementation: Containerised Playwright scripts executing multi-step Critical User Journeys from customer perspective
- Key Features: Screenshot capture on failure, detailed step timing, network request validation, accessibility checks
- CBA Examples: Internet banking portals, mainframe terminal access, Office 365 login flows
- Team Feedback: "Super easy to get started on and have a synthetic job scripted and running quickly" - teams appreciated familiar Playwright syntax
Terraprom: Infrastructure Provisioning Archetype
- Purpose: Infrastructure-as-Code validation through actual resource provisioning and teardown
- Best For: Cloud platforms, virtualisation infrastructure, database-as-a-service, compute platforms
- Implementation: Containerised Terraform workflows that provision, validate, and destroy test resources
- Key Features: Real infrastructure testing, cost-controlled ephemeral resources, infrastructure drift detection
- CBA Examples: VMware VM provisioning, Oracle database creation, AWS resource allocation, storage provisioning
- Validation Approach: Full lifecycle testing ensuring infrastructure services can actually deliver what teams promise to customers
Cloudprober: API & Network Testing Archetype
- Purpose: Generic HTTP/HTTPS, TCP, and custom protocol validation using Google's proven Cloudprober framework
- Best For: REST APIs, network services, authentication systems, legacy protocols, SaaS integrations
- Implementation: Containerised Cloudprober configurations with custom metric collection and alerting
- Key Features: High-frequency testing, protocol flexibility, minimal resource overhead, reliable metric collection
- CBA Examples: VPN connectivity, API gateway health, authentication services, network infrastructure
- Reliability: Cloudprober's Google heritage provided confidence for production deployment across critical services
Containerisation & Deployment Strategy
Each probe archetype was packaged as lightweight containers deployable to the Probe Fabric platform:
- Standardised Base Images: Common dependency management and security scanning
- Configuration-Driven: Teams modified YAML configurations rather than rebuilding containers
- Resource Optimisation: Efficient resource allocation across Kubernetes nodes
- Network Positioning: Strategic deployment as close to customer ingress points as possible
- Auto-scaling: Dynamic scaling based on test frequency and resource requirements
This archetype approach solved the "every team builds their own monitoring" problem by providing proven patterns whilst maintaining flexibility for specific service requirements.
QCE Disciplines Applied
- Platform Engineering: Delivered Probe Fabric as a Kubernetes-based enablement infrastructure with three standardised probe archetypes (Promwright for UI journeys, Terraprom for infrastructure provisioning, Cloudprober for API/network testing) that covered 90% of foundational service monitoring requirements. Reduced onboarding friction through containerised, configuration-driven approaches whilst maintaining flexibility for diverse service types. Ensured probes deployed as close to customer ingress points as possible for authentic user experience monitoring rather than internal system diagnostics.
- Quality Engineering: Embedded business function monitoring philosophy through Critical User Journey definition and generic probe patterns, shifting focus from diagnostic health endpoints to user experience validation. Addressed team concerns about synthetic workloads by emphasising that testing what customers experience should be a confidence-building exercise, not a source of anxiety - if teams aren't comfortable testing what customers use, how can they be confident in their service delivery?
- Developer Experience: Created streamlined onboarding processes through the standardised guided delivery framework with active project management, consolidated fragmented documentation, and provided hands-on consulting that addressed teams' capability gaps whilst providing clear guidance for navigating political and technical adoption challenges that self-service approaches failed to overcome.
The Results: Measurable & Stakeholder-Centric Impact
Headline Success Metrics
The ObStack velocity dashboard tracked our progress in real-time across three critical metrics, demonstrating the transformation from zero foundational service onboardings to achieving FY25 targets:
| Metric | Before Engagement | After Engagement | Improvement |
|---|---|---|---|
| Foundational Services Onboarded | 0 services | 71 services (90% coverage) | +71 services |
| Total SLOs Implemented | 0 SLOs | 508 SLOs | +508 SLOs |
| Critical User Journeys Monitored | 0 CUJs | 82 CUJs | +82 CUJs |
| Business Function Coverage | 0% (health endpoints only) | Comprehensive CUJ monitoring | Complete transformation |
| Mean Time to Discovery | Unknown/Extended | Near-instantaneous | Paradigm shift |
| Team Onboarding Success Rate | 0% (self-service failed) | 100% (guided delivery) | Process transformation |
Value Delivered by Stakeholder
- For the CTO / CIO:
- Unlocked the value of existing ObStack platform investment by enabling successful service onboarding across 71 foundational services representing 90% of critical infrastructure affecting tens of thousands of employees. (platform_utilisation: "ObStack adoption achieved after FYQ1 failure")
- Mitigated business risk by establishing comprehensive business function monitoring capability providing visibility into user-impacting issues rather than diagnostic health metrics, ensuring employee productivity infrastructure reliability. (risk_mitigation: "CUJ monitoring vs health endpoints")
- Enabled direct SLO breach notifications ensuring immediate routing to responsible teams for business function failures affecting organisational operations, meeting CTO mandate for FY24 observability implementation. (operational_excellence: "Near-instantaneous incident routing")
- For the VP/Director of Engineering:
- Enabled foundational service teams to successfully adopt ObStack platform for business function monitoring after months of failed self-service attempts, demonstrating that guided delivery succeeds where documentation alone fails. (adoption_success: "0 to 71 service onboardings with guided approach")
- Provided teams with deployment infrastructure (Probe Fabric) when they lacked capability to host monitoring workloads themselves, removing technical barriers to platform adoption. (capability_enablement: "Kubernetes-based monitoring infrastructure")
- Established consulting model that successfully transfers Critical User Journey definition and probe development skills to internal teams through hands-on delivery rather than failed self-service documentation approaches. (knowledge_transfer: "Active project management with skill embedding")
- For the Platform Engineering / SRE Manager:
- Delivered Probe Fabric platform with three standardised probe archetypes (Promwright, Terraprom, Cloudprober) enabling teams without deployment capability to leverage ObStack for business function monitoring, providing proven patterns for 90% of monitoring requirements while maintaining service-specific flexibility. (platform_enablement: "Containerised archetype-based monitoring infrastructure")
- Achieved comprehensive monitoring coverage for foundational services including complex internal systems (Office 365, VPNs) alongside core infrastructure, with velocity tracking dashboard providing real-time visibility into progress and clear demonstration of delivery acceleration after framework establishment. (coverage_expansion: "71 services with progress visibility")
- Established repeatable guided delivery framework and reusable probe patterns that teams can follow with active project management, transitioning from failed self-service to successful consulting model with clear path to team capability development. (pattern_replication: "Standardised framework with archetype-based capability transfer")
Client Testimonials
"Promwright is super easy to get started on and have a synthetic job scripted and running quickly. It's proving to be reliable, and our team intends to continue developing Promwright jobs for more customer facing applications."
— Yong Tie, Systems Engineer, Commonwealth Bank of Australia
"Adapting Promwright enabled us to exercise several multi-step CUJs so we can confidently track end-to-end operational state and performance of our applications. Promwright's detailed documentation and its use of an already familiar testing tool, Playwright, made it easy to onboard without any external assistance."
— Shaun Mansell, Staff SRE, Commonwealth Bank of Australia
Lessons, Patterns & Future State
-
What Worked Well: The standardised guided delivery framework with active project management overcame both technical and organisational barriers that self-service approaches consistently failed to address. Three containerised probe archetypes (Promwright, Terraprom, Cloudprober) provided teams with proven patterns covering 90% of monitoring requirements, eliminating the "build your own monitoring" problem whilst maintaining flexibility for specific service needs. Deploying probes as close to customer ingress points as possible ensured monitoring reflected actual user experience rather than internal system diagnostics. Consolidating conflicting documentation into coherent guidance within a repeatable framework reduced confusion and resistance. The velocity tracking dashboard provided clear visibility into delivery acceleration, proving the framework's effectiveness.
-
Challenges Overcome: Initial self-service approach failed completely due to insufficient team capability, requiring us to pivot to guided delivery model with active project management - this was a significant strategic shift that proved essential for success. Strong political resistance to ObStack platform adoption was addressed through stakeholder alignment, proven value demonstration, and careful management of synthetic workload concerns. Fragmented documentation was consolidated, but more importantly, we learned that documentation alone doesn't overcome capability gaps. Organisational approval bottlenecks (mainframe security reviews, service account provisioning) often exceeded technical implementation time, requiring proactive stakeholder management throughout the process.
-
Key Takeaway for Similar Engagements: For enterprise platform adoption at scale, teams often lack the capability for self-service onboarding regardless of documentation quality - this is a fundamental insight that should influence engagement strategy from the start. Active project management and guided delivery frameworks are essential for success when teams have capability gaps. Customer perspective monitoring requires deployment as close to ingress points as possible, and synthetic workload concerns dissolve when teams understand they're validating their actual service offerings. Don't underestimate organisational bottlenecks - they often exceed technical implementation time and require dedicated stakeholder management.
-
Replicable Assets Created:
- Three Standardised Probe Archetypes: Promwright (containerised Playwright for UI journeys), Terraprom (containerised Terraform for infrastructure provisioning), and Cloudprober (API/network testing) covering 90% of service monitoring requirements with configuration-driven deployment
- Standardised Six-Step Onboarding Framework: Proven guided delivery process with active project management (workshop → CUJ definition → infrastructure setup → probe deployment → observability configuration)
- Service Complexity Categorisation Framework: Systematic approach for assessing onboarding difficulty and setting realistic implementation expectations across diverse platform types
- Kubernetes-based Probe Fabric Platform: Enabling infrastructure for teams without deployment capability, with strategic network positioning as close to customer ingress points as possible
- ObStack Velocity Dashboard: Real-time tracking of onboarding progress across services, SLOs, and CUJs with clear visualisation of delivery acceleration phases
- Consolidated Documentation: Rationalised implementation guidance eliminating conflicting information and focusing on proven archetype patterns
-
Client's Future State / Next Steps: With 71 foundational services (90% of critical infrastructure) successfully onboarded using the proven guided delivery framework, CBA is positioned to expand observability to application-tier services and implement advanced SLO-based reliability engineering. The established Centre for Enablement will scale these patterns across business units with target expansion to 200+ services in FY25, leveraging the guided delivery approach and their growing ex-Google SRE talent pool to drive cultural transformation towards customer-perspective monitoring.