← All articles

AI-native testing tools in regulated industries — why keyword-driven is not yet replaceable

by Rainer Haupt

TL;DR: AI-native testing tools such as testRigor, mabl and Virtuoso cannot replace keyword-driven frameworks for compliance-critical test execution in regulated industries today. The core problem is structural, not a maturity issue: self-healing is non-deterministic and collides with FDA, ISO 26262, DO-178C, IEC 62304 and DORA. Keyword-driven frameworks deliver determinism, reproducibility and traceable change management by design. With ISO/IEC/IEEE 29119-5:2024 as a standard vocabulary, the gap widens rather than closes.

Reading time approx. 13 min · As of: 2026-04


The AI-augmented-testing vendor pitch sounds like efficiency: write tests in natural language, selectors heal themselves, the tool generates test cases from user stories. For non-regulated web applications those are real benefits. In industries where every test run has to be legally traceable — pharma, medical devices, automotive, aerospace, banking — the assessment changes fundamentally. This analysis compares the three most prominent AI-native tools with keyword-driven frameworks against the concrete demands of FDA, ISO 26262, DO-178C, IEC 62304 and DORA.

Compliance claims are thin and mostly aspirational

ToolCertificationsFDA claimOn-premiseRegulated customers
testRigorISO 27001, SOC 2 Type II, HIPAA, GDPR21 CFR Part 11 reporting support (not self-validated)in the Enterprise planMedrio (eClinical / pharma)
mablSOC 2 Type IInoneno, cloud-onlyToS forbids PHI and PCI data
VirtuosoSOC 2 Type IImentions 21 CFR Part 11 in marketing, no certificationno, cloud-only (AWS EU)last funding November 2021, valuation around USD 5.1 million

Three findings cut across the table:

  • No tool has formal tool qualification under ISO 26262-8 Clause 11 or DO-330.
  • No tool ships IQ/OQ/PQ validation packages for GxP environments.
  • No tool holds IEC 62304 compliance claims.

For comparison: deterministic tools such as Parasoft C/C++test carry TÜV-SÜD certification for IEC 62304. The certification gap is not a detail, it is a structural difference — and it has a technical reason.

Self-healing versus determinism — the fundamental conflict

What regulators demand:

  • FDA (Software Validation) — requirements must be “consistently fulfilled”.
  • ISO 26262 — tools must “comply with specified requirements” demonstrated through validation testing.
  • DO-178C — provable bidirectional traceability at four points.
  • GAMP 5 — validation rests on control, reproducibility, traceability.

What the AI tools do:

ToolSelf-healing approachProblem for regulation
mabl30+ attributes per UI element, GenAI for semantic matching (e.g. “Confirm” → “Approve”). On a plan run, applied automatically without user approval.The same test can take different interaction paths across runs.
Virtuosoaround 95 % self-healing accuracy. Own docs: “If your test step can’t be healed with confidence… the test step will fail.”A 5 % uncertainty rate is incompatible with safety-critical testing.
testRigorresolves elements from the end-user perspective (visible labels). Self-healing optional, marked with “fixed-by-ai” tags, rollback possible.The most stable of the three — DOM interpretation still varies dynamically.

The FDA explicitly distinguishes “locked” algorithms (same result every time) from “adaptive” ones. All three tools fall into the second category. PMC research on pharma GMP puts it bluntly: “regulatory agencies remain cautious due to the non-deterministic nature of AI/ML algorithms.”

Change control — where AI tools fail most critically

In regulated environments, every change to a validated test must run through formal change control: documented change request, risk assessment, approval by qualified personnel, implementation, verification. That sequence is non-negotiable under FDA 21 CFR Part 11, GAMP 5, ISO 26262 and DO-178C.

AI tools do the opposite — “change first, review later”:

ToolBehaviourRegulatory issue
mablAuto-heal plus automatic element-model update on a passing plan run. Review afterwards on the Insights page.No approval gate, no dual authorisation, no formal CR workflow.
VirtuosoManual acceptance of self-heals required (“click of a button”). Revisions history available.No formal change-impact analysis, no electronic signatures.
testRigor”fixed-by-ai” tags and rollback.Visibility yes — but no pre-execution approval workflow.

The FDA PCCP guidance 2025 for AI/ML puts it sharper: even approved AI adaptations have to specify upfront what changes, how it is validated, what acceptance criteria apply and how rollback works. Self-healing tests that adapt ad hoc structurally fail that framework.

Standardised keyword vocabularies as an unassailable advantage

ISO/IEC/IEEE 29119-5:2024 (second edition, December 2024) formalises keyword-driven testing with a standardised vocabulary, hierarchical keyword structures and a common data exchange format for tool interoperability. Robot Framework’s keyword-driven architecture aligns structurally with the model described in 29119-5 — a de-facto implementation, without the Robot Framework Foundation asserting any formal conformance claim.

The traceability chain in regulated industries then looks like this:

Requirements → Standardised Keywords → Test Specifications → Execution Logs → Results
     ↑                  ↑                       ↑                  ↑               ↑
  Jira / ALM   29119-5-aligned            .robot files         output.xml      report.html
                vocabulary                (Git-versioned)     (step by step)   (audit-ready)

Side by side:

CapabilityRobot FrameworktestRigormablVirtuoso
Native RTM[Tags] REQ-001 per testno, only via TestRail / Jira / Xraynonative requirements feature
Execution logsoutput.xml and log.html with step-by-step traces, timestamps, screenshotsPDF / Word / CSV reportsscreenshots + stepsrevisions history
Electronic signaturesno (via CI/CD infrastructure)nonono
Version control.robot = plain text → Git-nativeproprietary platformcloud-basedcloud-based
Change history / diffstandard Git diffaudit trail (since April 2023)limitedrevisions history

Vendor lock-in as an existential risk

Product lifecycles in regulated industries are long:

  • Medical devices: 10–30+ years
  • Automotive ECUs: 15–25 years
  • Pharma: keep test evidence for product lifetime plus years
  • Aerospace: certification data for aircraft lifetime (30+ years)

What happens if the tool vendor disappears, gets acquired or pivots the platform during that time? The export options of the three tools:

ToolExportVerdict
testRigorno export to Selenium, Playwright, Cypress or standard languagestotal lock-in
mablexport via CLI to Playwright (TS), Selenium IDE, JSON, CSVlossy — no auto-heal, no visual tests, no flows
Virtuosoclaims “various formats” without specificationproprietary NL format, hard to port
Robot Frameworkplain text, Apache 2.0, Foundation-backed, no vendorindefinitely maintainable

For DORA in financial services, the concentration risk picks up regulatory weight. DORA Third-Party Risk Management requires financial firms to assess concentration risk for critical ICT providers. Depending on a single SaaS testing vendor for compliance-critical test infrastructure is exactly the risk type DORA addresses.

Data protection rules out cloud-only tools for many scenarios

RequirementtestRigormablVirtuoso
On-premisein the Enterprise plannono
Air-gapped deploymentunclearnono
PHI processing allowedyes (HIPAA)no, ToS forbids it explicitlyunclear
PCI data allowedunclearno, ToS forbids it explicitlyunclear
EU data residencyunclearnoyes (AWS EU)

That makes mabl and Virtuoso architecturally incompatible with healthcare, financial and defence applications — independent of AI quality.

Verdict, with nuance

The clear claim first: AI-native testing tools cannot replace keyword-driven frameworks for compliance-critical test execution in regulated industries today. But not everything in a regulated organisation has the same compliance demand:

ScenarioAI tools possible?Recommendation
Safety-critical tests (ASIL D, SIL 3/4)nodeterministic keywords, formally qualified
GxP validation (IQ / OQ / PQ)nokeyword-driven with audit trail
Compliance-relevant regression testsonly with self-healing disabledtestRigor with AI features off + on-premise — conceivable
Low-risk smoke and exploratory testsyesAI tools as a complement next to a validated core
Non-regulated subsystemsyesAI tools genuinely useful for efficiency

Among the three tools, testRigor comes architecturally closest to the compliance bar: no locator-based testing, self-healing optional and disableable, on-premise option, HIPAA-certified. With AI features disabled, however, it is essentially a natural-language keyword layer — which largely undoes the AI value proposition.

The regulatory landscape is moving. The FDA PCCP framework, GAMP 5 in its second edition and the developing ISO 29119 Part 11 for AI-based systems acknowledge AI/ML as reality. They add requirements rather than relaxing determinism expectations.

When organisations use standardised keywords aligned with ISO/IEC/IEEE 29119-5, the package becomes hard to replace from a regulatory perspective:

  • Each keyword has defined preconditions and postconditions with documented outcomes.
  • Keywords map directly to requirement IDs.
  • Plain text is Git-versionable — full change control included.
  • Deterministic execution returns identical results for identical inputs.
  • Vendor-independent and maintainable across 30+ year lifecycles.
  • Audit-ready HTML reports without additional tooling.

The honest version of the conclusion: AI-augmented testing has its place in regulated industries — as a complement in areas where the compliance bar is lower. As a replacement for the validated deterministic core, it is not a viable option today.

Sources


Evaluating test tooling for a regulated environment, or assessing an existing setup against FDA, ISO 26262, DO-178C, IEC 62304 or DORA? In the UTAA workshop we set the compliance architecture specifically for your project. More on the method or request directly.

Request callback