Field notes · April 2026

Testing strategy

From Unit Tests
to Operational Stress

How to structure tests for solo, AI-first development.

The safety net that makes autonomy possible.

01 — The Problem

When a solo developer works with coding agents, the question is no longer "should I write tests?" — it's "which tests, in what order, and to protect what?"

Ezkey is a cryptographic MFA platform, developed solo with an AI-first approach. Code is modified by both the human and the agent. Without a clear testing strategy, every change is a gamble. With the right strategy, every change becomes verifiable — quickly, cheaply, and repeatably.

What follows is the result of a year of iteration. It's not an academic recipe — it's what works in practice.

02 — Unit Tests: First Line of Defense

Unit tests are the starting point. They're fast, cheap, and they offer a concrete double advantage:

In Ezkey, unit tests cover business services, mappers, validation logic, and cryptographic utilities. They run without a database, without Docker, without a network. This lightness is what makes them usable as a reflex after every change — whether it's the human running a targeted mvn test, or the agent automatically validating its own work.

Unit tests are the bare minimum. If a change breaks a unit test, you know in seconds — not minutes after a full Docker rebuild.

03 — Tag-Based Segmentation

As the number of tests grows, being able to precisely target what you run becomes essential. Ezkey uses JUnit 5 tags to segment tests into subgroups. Each test carries one or more tags, and Maven profiles let you combine these tags from the command line.

The benefit is direct: you can run only encryption tests after a cryptographic change, or only smoke tests after a deployment, without executing the full suite.

Tag Role
fast Tests completing in < 5 seconds — default CI execution
slow Tests taking 30+ seconds — nightly builds, pre-release
smoke Quick validation of critical functionality — post-deployment sanity check
encryption Key rotation, encryption, keyset synchronization
enrollment Enrollment flow and lifecycle
authentication Auth attempts, challenge-response, tokens
admin Admin authentication, tokens, administrative operations
security Security boundaries — 401 and 403 failures
multi-tenant Tenant isolation, permissions, boundaries
cross-instance Admin API ↔ Auth API interaction — distributed behavior
audit-integrity HMAC signatures, chain checkpoints, tamper detection
database Tests involving direct SQL access to the database
time-dependent Timing-sensitive tests — potentially flaky under load
elective On-demand spot checks — never run automatically
operational-churn Long-running loops simulating production activity

In practice, the most common Maven profiles:

# Quick validation (CI default — excludes slow, elective, operational)
mvn test -pl ezkey-tests

# All standard tests (excl. elective and operational)
mvn test -pl ezkey-tests -P all-tests

# Target a specific family
mvn test -pl ezkey-tests -Dgroups="encryption"
mvn test -pl ezkey-tests -Dgroups="authentication,enrollment"

04 — The Docker Environment: the Clean Room

Before discussing functional tests, we need to explain how we get a reliable environment. The answer in Ezkey is the clean start.

A single script (clean-start.sh) does all the work:

  1. Full stop of the Docker stack, including volumes — a complete reset.
  2. Compilation of all Java modules with an optimized Docker build (BuildKit).
  3. Ordered startup of all services: PostgreSQL, Flyway migrations, Admin API, Auth API, Crypto API, Demo Device.
  4. Automatic extraction of bootstrap credentials from Docker logs.
  5. Admin token initialization for tests.

The result: a clean, complete, reproducible environment. This is the foundation on which all functional tests rest. The coding agent can trigger a clean start, run the tests, and get a reliable verdict — without human intervention.

05 — Functional Tests: the Core of the Strategy

This is where things get interesting. Functional tests in Ezkey are end-to-end tests that talk to the real APIs, on a real Docker stack, with a real database. These are not mocks — it's the actual system.

The 80-20 rule

We don't chase exhaustive coverage of every edge case. We target 80% of the value with 20% of the effort. Every functional test must justify its existence by the value it protects, not by an abstract coverage target.

Three fundamental principles

Independence. Each test can run alone, in any order. It creates its own data (tenants, integrations, enrollments) with unique identifiers. No implicit dependency on execution order.

Idempotence. Running a test twice gives the same result. No surprises from residual data, no assumptions about pagination or pre-existing state.

Opportunistic helpers. Tests don't limit themselves to validating the HTTP response. When relevant, they query the database directly via SQL to confirm that a service actually did its job. This is a complementary validation angle — not a bypass of the APIs, but an independent verification that side effects are correct.

Why sequential?

Functional tests run sequentially by default. This is a deliberate choice, not a technical constraint.

The reason is pragmatic: when a test fails, you want to identify the cause quickly. A sequential log is readable. You can copy-paste it into a conversation with a coding agent and get an efficient root cause analysis. With parallel tests, logs interleave, side effects multiply, and investigation becomes a headache.

Sequential execution isn't a limitation — it's a decision to maximize investigation speed when something breaks.

Data stays

Another intentional choice: functional tests don't clean up their data. Each run adds tenants, enrollments, authentication attempts to the database. This isn't negligence — it's a strategy.

The accumulated volume gives a chance to spot SQL performance issues that would never appear on an empty database. It's a first, modest but free, tier of performance validation. And it's this enriched database that elective and operational tests later leverage.

06 — Elective Tests: On-Demand Spot Checks

Some scenarios are important but change rarely, take time, or only make sense on a system that has accumulated data. Running them on every build would be wasteful. Ignoring them would be a risk.

The solution: elective tests. They never run automatically. You launch them explicitly, when the context justifies it — after several days of uptime, before a release, or after a targeted change in the relevant domain.

Concrete examples in Ezkey:

# Run elective tests
mvn test -pl ezkey-tests -P elective-tests

07 — Operational Tests: Time Compression

This is where the approach becomes, I believe, quite unique.

When I evaluated whether a classic load testing framework (JMeter, Gatling) made sense, I realized the real need wasn't to measure requests per second. The real need was to simulate realistic production activity over an extended duration — and observe what happens when time is compressed.

The concept

An operational test in Ezkey uses the same JUnit 5 framework as every other test. No additional framework, no new dependency. The difference: it lives much longer.

Concretely, the current operational test:

  1. Creates a tenant, an admin, integrations, enrollments — a complete setup.
  2. Then enters a loop of repeated authentications, for a default duration of two hours, with configurable volume (light, medium, sustained).

You can start multiple instances in parallel in separate terminals and let them run. Each instance is independent — it creates its own data.

Why not a load testing framework?

Because the cost of introducing a dedicated framework doesn't justify the return. Ezkey's operational tests accomplish something different: they don't measure raw performance, they exercise parallelism and concurrency under realistic conditions.

Standard functional tests are sequential — that's their strength for investigation. But it also means they never test mutual exclusion mechanisms, distributed locks, or concurrency issues between multiple sessions. Operational tests fill that gap.

Time compression

This is the most interesting extension of this approach. The idea: take operations that, in production, happen rarely — a key rotation per quarter, a monthly re-encryption batch — and compress them into a horizon of a few hours.

For example: introduce a new cryptographic key every 5 minutes instead of every 3 months, with the corresponding re-encryption batches, while authentication load continues in parallel.

Over a few hours, you compress months of operations. Behaviors that would normally surface only once per quarter manifest dozens of times. Race conditions, deadlocks, coordination issues between instances — all of these get a chance to reveal themselves.

Time compression turns a long-running test into a discovery accelerator. In a few hours, you simulate months of operations and force the emergence of problems that would take weeks to surface naturally.

Data accumulation — an intentional choice

Like functional tests, operational tests don't clean up after themselves. After a few two-hour sessions with one or more parallel instances, the database contains a representative volume: thousands of enrollments, authentication attempts, audit entries.

This volume serves two purposes:

08 — The Pyramid: Big Picture

The logic follows a progressive escalation:

  1. Unit tests — feedback in seconds, no infrastructure. First reflex after every change.
  2. Functional tests — a Docker clean start, then sequential end-to-end tests. The reference verdict.
  3. Elective tests — specialized checks launched on demand, for stable but critical topics.
  4. Operational tests — long-running, parallel, time-compressed. Where subtle problems reveal themselves.

At each level, you add complexity and execution time. The rule: start with the fastest and cheapest tests, stabilize at that level, then escalate progressively. If a problem can be caught by a unit test, it shouldn't require a two-hour operational test to be found.

Composable building blocks

A key aspect of this architecture: each test level is an independent component. You can chain them freely, combine them, distribute them. A human can run unit tests while the coding agent handles a clean start and functional tests. Or the reverse. Operational tests run in separate terminals without interfering with anything else.

This composability is essential for AI-first development. The agent needs clear, autonomous validation blocks with unambiguous verdicts. The tests give it exactly that.

Scenario Command Typical duration
Quick validation (CI) mvn test -pl ezkey-tests ~2 min
Full standard suite mvn test -pl ezkey-tests -P all-tests ~10 min
Smoke check mvn test -pl ezkey-tests -P smoke-tests <1 min
Elective tests mvn test -pl ezkey-tests -P elective-tests ~10 min
Operational churn mvn test -pl ezkey-tests -P operational-churn-tests 2+ hours

09 — What I Learned Along the Way

This strategy didn't appear overnight. Ezkey is a solo project, and I started like many do: with naive vibe coding, hardly any tests, and overconfidence that "it works on my machine."

The introduction of coding agents in 2025 changed the game. When an agent modifies code autonomously, without a safety net, regressions arrive fast. A clear contract between human and agent was needed: here are the guardrails, here's how we verify, here are the verdicts.

The progression was gradual:

Each step was motivated by a concrete need — never by a theoretical ambition for coverage. And each step made the next cycle more effective: the more reliable guardrails the agent has, the more work you can delegate with confidence.

The next frontier: UI tests with Playwright. A foundation already exists for validating critical Admin UI scenarios, but it's intentionally limited. The goal is to grow it in an orderly fashion, building on the same foundations — Docker, clean start, test independence — and the same pragmatic values learned along the way. No big bang, the same incremental path.

Ezkey's testing strategy follows a simple thread: start small, validate fast, escalate in complexity only when justified. Each test level has a precise and complementary role. And together they form a system where human and agent collaborate with confidence grounded in concrete verifications — not assumptions.

It's a virtuous cycle: the stronger the tests, the more autonomy is possible. The more autonomy is possible, the more you invest in tests. And it's this cycle that makes a solo AI-first project not just viable, but productive.