2 min read

Improving Service Stability Through Layered Testing, Monitoring, and On-Call Response

Our testing strategy has long placed a strong emphasis on the pull-request and pre-deployment stages. Every change is validated before it reaches production, combining multiple layers of automated and human review to minimise the risk of regressions. This approach has served us well, but it also comes with an important limitation: even thoroughly tested code can still fail at runtime once it is deployed into a live, interconnected system.

In the wake of two incidents of service instability we experienced in Q4 of 2025, we took the opportunity to reassess how we both prevent issues from being introduced and detect them quickly when they occur in production. Alongside resolving the root cause of the instability, we invested time in strengthening our testing, protocols, monitoring, and incident response processes to better mitigate service downtime going forward.

On the pre-deployment side, our pull-request workflow applies a comprehensive set of automated checks. Each PR is validated through a combination of static type checking, linting, and unit testing, alongside container-level system-under-test (SUT) integration tests that exercise services in an environment close to how they run in production. These automated checks are complemented by both manual and AI-assisted code review, providing additional scrutiny for logic errors, edge cases, and maintainability concerns. All required checks must pass, and at least one human reviewer must approve the PR before it can be merged. In addition, any newly introduced dependency is subject to proactive security analysis to identify known vulnerabilities or supply-chain risks early.

While this layered pre-deployment testing significantly reduces risk, it does not eliminate the possibility of runtime failures caused by configuration issues, external dependencies, data-specific edge cases, or unexpected interactions between services. To address this gap, we have expanded our investment in automated runtime testing and monitoring.

We now run several tiers of automated tests against live systems to continuously validate critical functionality:

  • Continual automated smoke testing of authentication and authorization, including evaluation of roles and permissions. These checks run every five minutes to quickly detect access-related regressions.
  • Hourly automated smoke testing of invite creation, using browser automation to validate the full end-to-end flow exactly as a user would experience it.
  • Nightly automated testing of critical vID flows, again using browser automation to closely simulate real user behaviour. These tests perform comprehensive checks across vID web and Cortex functionality, including email delivery and webhook verification, helping us catch issues that would be difficult to detect through API-level testing alone.

If any of these automated tests fail, our on-call system is immediately notified and alerts are also sent to our internal communications platform, ensuring visibility across the team. During working hours, engineers can quickly collaborate to assess and resolve the issue. Outside of normal working hours, an on-call engineer is automatically notified via email and push notification to their mobile device, enabling a timely and coordinated response even out of hours. Clear escalation paths and shared context allow incidents to be triaged efficiently and resolved with minimal disruption.

Looking ahead, we plan to continue expanding both our automated runtime test coverage and our pre-deployment testing strategy. By combining strong preventative controls with realistic, user-centric runtime testing and a responsive on-call process, our goal is not only to reduce the likelihood of incidents, but also to detect, communicate, and resolve them as quickly as possible when they do occur.

Subscribe to our newsletter

Subscribe to our newsletter to get the latest updates and news