← All articles

Design for Testability — why testability is an architectural decision

by Rainer Haupt

TL;DR: Testability is an architectural property with two dimensions — controllability and observability. It is created not by good tests but by design decisions taken long before: structured logging with correlation IDs, dependency injection instead of static-method coupling, idempotent APIs for repeatable tests, feature flags for safe releases, and Architecture Decision Records framed to be testable. Teams that don’t pay this “testability tax” buy it back later as regression cost — usually at a multiple.

Reading time approx. 14 min · As of: 2026-04


In most projects testing starts when the architecture is already in place. That is the most expensive variant. When a microservice is testable only through the GUI because nobody thought about logging at the interfaces; when classes instantiate dependencies via new and resist mocking; when requests run through half a dozen services and nobody can correlate them — these are not test problems. They are architecture problems showing up as test problems.

This article gathers the patterns that build testability in from the start, organised by architectural layer.

The two dimensions: controllability and observability

ISO/IEC 25010:2023 defines testability as a sub-characteristic of maintainability: “Degree of effectiveness and efficiency with which test criteria can be established for a system, product or component and tests can be performed to determine whether those criteria have been met.” Peter Zimmerer (Siemens AG) split this into two dimensions that have shaped the architecture conversation since:

  • Controllability — how well can inputs and system state be controlled in isolation? Can tests precisely set a particular input dataset, trigger an error mode, replace external services with stubs?
  • Observability — how well can outputs, states, error conditions and resource usage be observed? Does the system emit enough signal for a test assertion to be formulated cleanly?

Both dimensions are architectural properties, not test properties. Ignoring them produces systems where tests are structurally hard. arc42 adds an important critique of the ISO model: testability does not belong only under maintainability — it cuts across reliability, usability and security too. The standard provides no direct software metrics; organisations have to define their own contextually.

Headless systems: no logging, no verification

Backend services, batch jobs and message-driven systems have no visual surface. Without consistent logging at the interfaces, there is no way to verify that inputs were processed correctly, outputs produced properly and errors handled. Integration tests then degrade to plain DB-state checks or file comparisons — both brittle, both blind to processing logic.

The Google Testing Blog has recommended logging requests and responses between major components, plus all significant state changes, since 2013. Pete Hodgson sharpened that into the Domain Probes pattern in 2019: instrumentation is wrapped in domain-oriented abstractions instead of written directly against a logging framework. Tests then verify probe invocations, not logger calls — making observability itself testable.

For message-driven systems with Kafka or RabbitMQ, logging at the producer and consumer is the only observable artefact. Sparse logging here removes the option for integration assertions entirely.

Structured logs as a test-assertion substrate

Unstructured plain-text logs cannot be parsed reliably. Regex-based log assertions break on every text change. The ThoughtWorks Technology Radar has Structured Logging in the Adopt ring: “Treating logs as data gives us greater insight into operational activity.”

Concretely: JSON-structured logs allow field-based assertions on correlationId, orderId, event instead of substring searches. The practical best practice: don’t assert on exact messages — assert on keywords, fields and identifiers from the input request. Established test doubles per language:

LanguageToolCapability
Pythonpytest-structloglog.has("event_name", field=value), ordered event verification
PythonEliot@capture_logging, assertHasAction(), assertContainsFields()
JavaMemoryAppender (Logback)In-memory appender with search(text, level)
.NETFakeLogger<T> + SnapshooterStructured-log capture, snapshot verification
Node.jsPinoJSON-native, stdout capture with JSON parse

A related pattern that bridges test logs and production logs: Temporary Log Queues. During processing an in-memory queue accumulates events. On success the queue is discarded and a summary logged; on failure the entire queue is written out. The effect: production logs stay slim, in failure cases the full trace is available. spf4j-slf4j-test implements this at test level with automatic dump on test failure.

Correlation IDs and distributed tracing

The Microsoft Engineering Fundamentals Playbook frames the problem precisely: “In a distributed system architecture, it is highly difficult to understand a single end-to-end customer transaction flow through the various components.” Without a correlation mechanism, logs from different services are isolated. The lifecycle of a request across service boundaries can no longer be reconstructed.

Until 2020 this space was vendor-fragmented: Zipkin B3, X-Google-Cloud-Trace, proprietary Datadog headers. The W3C Trace Context standard ended that — traceparent and tracestate are now the common ground. OpenTelemetry, the second-most-active CNCF project after Kubernetes, propagates these headers automatically across HTTP, gRPC and message queues.

For tests this opens concrete patterns:

PatternDescription
Inject + AssertTest injects a known correlation ID in the header → after the API call, query the log/trace system by that ID → assert on service calls, ordering, parameters
Custom ID + W3C parallelX-Correlation-Id (human-readable, searchable) plus W3C traceparent (machine-oriented) — tests grep logs, trace tools assert deeper
Propagation validation in CITests verify specifically that correlation IDs exist in all log outputs — missing data breaks the pipeline
Test isolation via correlation IDEach test run generates a unique ID → parallel test execution without interference, post-mortem via ID query

The OpenTelemetry demo documents this extensively: 26 trace-based tests for 10 services verify statements like “Order was placed”, “User was charged”, “Product was shipped” — and during test development uncovered a real bug (HTTP 500 on direct EmailService call due to JSON casing mismatch).

Patterns that pay the “testability tax”

Six architectural patterns cost effort to build but pay back directly in testability.

Dependency injection instead of new. Miško Hevery (Google) put it succinctly in 2008: “There are no tricks to writing tests, there are only tricks to writing testable code.” Classes that ask for their dependencies via constructor are testable in isolation. Classes that fetch singletons or use service locators are not. Hevery’s four critical flaws — mixing object-graph construction with logic, “looking for things” instead of “asking for things”, work in constructors, global state — have been standard code-review repertoire ever since.

Interfaces for test doubles. Concrete-class coupling forces tests to instantiate real databases, HTTP clients and file systems. Final/sealed classes block subclassing for test doubles — Michael Feathers calls object seams in Working Effectively with Legacy Code “pretty much the most useful seams available in OOP.” Martin Fowler distinguishes four test-double types — Mock, Stub, Fake, Self-Initializing Fake — all of which require programming to interfaces.

Feature flags for test scenarios. Hodgson’s Feature Toggles (2017) provides a toggle taxonomy directly relevant to testability: Release Toggles for unfinished code, Experiment Toggles for A/B tests, Ops Toggles as kill switches, Permissioning Toggles for user segments. Critically: automated tests must exercise both toggle paths. Per-request overrides via header or cookie allow test scenarios against production machines without changing the global toggle state.

Idempotent APIs. Non-idempotent APIs make tests unrepeatable. The same test run twice → duplicate bookings, duplicates, divergent results. Stripe’s Idempotency-Key pattern is the canonical implementation: client sends a UUID, server processes once and replays the cached response on retry. HTTP method semantics help: GET, PUT and DELETE are idempotent by definition; POST is not.

Test-specific configuration profiles. Spring Boot’s application-test.properties with H2 in-memory DB and mock auth, appsettings.Test.json for .NET, NODE_ENV=test with dotenv. Industry studies report around 54% of applications experiencing configuration issues at environment switches. A dedicated test profile — not reusing local — sidesteps an entire swarm of these problems.

Seams for legacy code. Feathers’ seam definition: “A seam is a place where you can alter behavior in your program without editing in that place.” Object seams (polymorphism via interfaces) are OOP’s most powerful tool for breaking entrenched dependencies. Fowler’s 2024 extension: seams matter not only for legacy code but for every new system, because every new system becomes legacy sooner or later.

Trace-based testing — the new E2E layer

Ted Young (Lightstep) coined Trace Driven Development at KubeCon 2018: “Trace Tests can span across multiple network calls, languages, and services, while still retaining unit-test-like ability to observe fine-grained internal behavior.” The approach uses distributed traces as the primary verification mechanism instead of classical mock-based integration tests.

The workflow:

  1. Trigger an operation against the system, capture the trace ID
  2. Wait until the system has reported the full trace to the telemetry store
  3. Fetch the trace by ID
  4. Validate both the API response and the trace data — span attributes, timing, status codes, parent-child relationships

Tracetest (Kubeshop) is the most prominent tool: no code changes needed, works with existing OTel instrumentation, its own selector language for spans (span[tracetest.span.type="database"]), YAML test definitions for CI/CD. Backends: Jaeger, Tempo, Honeycomb, Datadog, Elastic. Sample assertions:

  • All DB queries under 100ms
  • All gRPC return codes = 0
  • A specific downstream service was called
  • The message queue received the expected message

The advantage: the most painful part of classical integration tests — the “test rig” with auth, plumbing code and service access — falls away. On test failure the full distributed trace is already captured; MTTR drops sharply. Trace-based testing positions itself in the integration/E2E tier of the test pyramid, not as a replacement for unit tests.

Anchoring testability in architecture reviews

Architecture Decision Records (ADRs) per Michael Nygard (2011) sit in the ThoughtWorks Radar’s Adopt ring. Kyle Brown (IBM) sharpens the point:

Every architectural decision should be testable and should have a test written to accompany it. If a decision is not testable, then it is merely an opinion or a suggestion and not a decision.

That line yields concrete ADR examples with testable verification:

ADR decisionTestable verification
”Apps are cloud-native on OpenShift”Verify deployment against the specified master node
”Apps are written in Java or Node.js”Validate repo structure and build-file dependencies
”All Node programs use Mocha”Scan template pipelines for Mocha configuration
”We use hexagonal architecture”Verify separation of domain logic and external dependencies
”Pact for consumer-driven contract testing”Check contract-test presence in CI

Neal Ford and Rebecca Parsons formalised this further with architectural fitness functions: ADRs document the decision, fitness functions guard it. Tools like ArchUnit (Java) or ArchUnitTS allow architecture rules to live as unit tests in the CI/CD pipeline.

Zimmerer’s testability review checklist complements this at the architecture level: are control points and observation points well defined? Do scriptable interfaces, ports, hooks, mocks and interceptors exist? Is testability embedded in design specs, with quality gates at milestones?

What the investment is worth

The cost of testability ignorance is documented. The SEI reports testing accounting for over 50% of project budget and schedule in critical systems. Zimmerer describes the vicious cycle: low testing priority in design → system hard to test → automation hard → less investment in automation → priority drops further. Regression cost dominates: “When analyzing the cost of change factors, we usually see that the need to verify nothing has been broken is actually the dominant cost.”

On the other side stand the concrete returns:

InvestmentReturn
DI frameworks and interface abstractionsIsolated unit testing, swappable components
Test infrastructure (containers, CI/CD)Fast feedback, automated regression
Contract testing (Pact)Independent deployment, service compatibility
Observability (logging, tracing)Debuggability, trace-based testing
Test-double architectureFast, deterministic tests
ADRs + fitness functionsContinuous architecture governance

Jeremy Miller sums it up: “Testability is all about creating rapid and effective feedback cycles.” Continuous Delivery presupposes those feedback cycles. Fowler: “For CD to be possible, we need a solid foundation of Testing.” Anyone wanting Continuous Delivery cannot avoid Design for Testability — one is the prerequisite for the other.

Three architectural patterns that structurally carry these investments:

  • Hexagonal Architecture / Ports and Adapters — business logic at the core, infrastructure plugged in via ports
  • Functional Core, Imperative Shell — pure logic separated from side effects
  • Domain Probes — testable observability abstractions, decoupled from the logging framework

These patterns are not new. What may be new is the willingness to treat them as architectural duty, not as test-engineering detail.

Sources


Running architecture reviews, or building a new system where testability shouldn’t surface only at the end? In a UTAA workshop we map Design-for-Testability patterns against your architecture and prioritise project-specifically. More about the method or get in touch directly.

Request callback