Robot Framework is a test framework, not a programming language

TL;DR: Robot Framework delivers its greatest value when used purely as a test specification format. At the test-case level there are only keywords and variables; all technical logic lives in Python libraries. This architectural principle is not a private opinion: ISTQB standards, IEEE 829, the test-automation literature, and RF creator Pekka Klärck all converge on it. A test case with IF/ELSE is, formally, not a single test case but several implicit ones that should be split apart.

Reading time approx. 12 min · As of: 2026-04

Since version 4.0 Robot Framework has systematically introduced programming-language constructs: IF/ELSE, TRY/EXCEPT, WHILE, VAR. The community asked for these features and they have a place in specific scenarios. For test automation, however, the RF maintainers’ recommendation remains unchanged: no complex logic at the test-case level. This article shows why that recommendation is anchored in international standards — and how the architecture works in practice.

Test cases are specifications, not programs

The entire BDD movement rests on one premise: tests describe behaviour, not implementation. Dan North, the founder of behaviour-driven development, described the move from «tests» to «scenarios» as a deep paradigm shift. The Agile Alliance uses the terms «scenario» and «specification» rather than «test». Gojko Adzic documented the principle even more sharply in his Jolt-Award-winning book «Specification by Example»: automation must not change the specification.

What does that mean for Robot Framework? RF test cases like Submit Application or Verify Notice For Correctness are behavioural descriptions in domain language. They state what the system should do. How it does so lives in the Python libraries underneath. As soon as IF/ELSE, FOR loops, or VAR syntax appear at the test-case level, that boundary blurs. The test case becomes a program. And as a programming language, Python is simply better than RF.

The Cucumber documentation phrases the yardstick this way: scenarios describe the intended behaviour of the system, not the implementation. The decisive test reads: does this text need to change when the implementation changes? If yes, it is not a specification but a script.

What ISTQB, IEEE, and ISO say

The demand for logic-free test cases is not a matter of taste. It is anchored in international standards.

The ISTQB Glossary (version 3.7) defines a test case as a set of preconditions, inputs, actions, expected results, and postconditions. That is a fixed data structure, not an algorithm, with no room for branching. Given the preconditions and inputs, there is exactly one expected result. The ISTQB CTFL v4.0 syllabus draws the line even more sharply: test cases describe the «what», test procedures the «how».

The ISTQB Advanced Level Test Analyst Syllabus (CTAL-TA v4.0) demands unambiguity: only one interpretation per test case should exist. Vague terms such as «suitable», «as needed», or «several» should be avoided. A test case with an IF/ELSE branch would necessarily carry multiple interpretations depending on system state at execution time. That contradicts the requirement directly.

IEEE 829 (Standard for Software Test Documentation) specifies test cases with «exact input values» and «exact output values». ISO/IEC/IEEE 29119 Part 3 follows the same scheme: preconditions, concrete inputs, steps, expected results. Part 5 explicitly addresses keyword-driven testing as a method for declarative test description.

None of these standards provides for conditional branching inside a test case.

The gTAA architecture

The Generic Test Automation Architecture (gTAA) from the ISTQB Advanced Level Test Automation Engineer Syllabus defines the architectural separation in layers. The Test Definition Layer holds test cases and test data (declarative). The Test Adaptation Layer connects to the system under test via APIs and protocols (imperative). Control structures only appear with the «structured scripting» technique and belong to the execution layer, not to test definition.

Keyword-driven testing is presented in the TAE syllabus as the highest maturity level of test automation. The reasoning: test definitions are formulated such that test analysts can easily understand them.

The RF Certified Professional (RFCP) syllabus maps Robot Framework explicitly onto this gTAA. .robot files correspond to the Definition Layer. Python libraries correspond to the Adaptation Layer. The architecture is no accident — it is deliberately aligned with the ISTQB standard.

Conditional test logic as a formal test smell

Gerard Meszaros classified conditional test logic as a formal test smell in his standard work «xUnit Test Patterns» (2007). His argument: code with a single execution path always behaves the same. Code with multiple paths makes it considerably harder to gain confidence in correctness. Tests with branches or loops are not fully deterministic. According to Meszaros, control structures in test methods deserve «extreme suspicion».

The Google Testing Blog published the often-cited «Don’t Put Logic in Tests» by Erik Kuefler in 2014. The core message: in tests, simplicity matters more than flexibility. Production code describes general strategies for computing outputs. Tests are concrete examples of input-output pairs. The more operators, loops, and conditions a test contains, the harder it becomes to ensure its correctness.

«Software Engineering at Google» (chapter 12) elevates «Don’t put logic in tests» to a core principle, ranked equally with «Test via public APIs» and «Test behaviours, not methods».

Martin Fowler argues fundamentally in «Eradicating Non-Determinism in Tests» (2011): non-deterministic tests are useless and a virulent infection that can ruin an entire test suite.

Path explosion under conditional logic

The German test-quality firm Qualicen has quantified the problem. Each IF condition in a test case doubles the possible paths. Four conditions yield 16 possible paths. The test result becomes irreproducible because it remains unclear which path was actually taken. The recommendation: split into separate test cases, one path per test case.

This principle is so established that it has been formalised in static analysis rules. The ESLint plugin for Jest contains no-conditional-in-test. Playwright has the same rule. QUnit calls its variant no-conditional-assertions and justifies it explicitly with the detection of non-deterministic tests.

The official RF position confirms the approach

The strongest confirmation comes directly from the RF creator. The official document «How To Write Good Test Cases», maintained by Pekka Klärck and the RF core team, leaves no room: no complex logic at the test-case level, no FOR loops or IF/ELSE constructs. Test cases should not look like scripts. At most 10 steps, domain language, understandable for product owners and customers.

In the RF forum Klärck confirmed: anyone writing the top-level test description in Robot can implement all keywords in Python. Robin Mackaij, an active RF community contributor, puts it even more directly: for test cases Python is a poor choice. He himself puts the bulk of the logic into Python and uses RF as a runner framework.

QaSkills states it explicitly: complex logic belongs in Python keywords, not in RF syntax. The tabular format is not designed for programming.

Role separation resolves the two-language problem

A common objection to RF: you have to learn two languages. RF syntax and Python. Tesena describes RF as less developer-friendly because the entry point goes deeper. In the RF forum a user asks: why not just use Python as the programming language and RF only as a support library?

The resolution is simple when the architecture is applied consistently: nobody learns both languages.

Business testers learn the RF syntax: how to call keywords and pass variables. That takes two hours. Python developers write the keyword libraries in Python — they know it anyway. Product owners read RF tests as domain-level specifications. Each role uses the language it is strong in.

Every new programming-language construct in RF (VAR, IF, WHILE, TRY/EXCEPT) raises the learning curve for business testers. With the pure keyword approach, that curve drops to a minimum. Xebia warns in its three-part RF analysis: when non-programmers start to code, considerable stability and maintainability risks follow.

Python keywords in practice

The architecture has two layers. .robot files hold the test cases: domain language, only keywords and variables, no control structures. Python libraries hold all technical logic: arbitrary complexity, the full Python feature set.

A common counter-argument: user keywords (defined in .robot files) cannot be called from Python directly. That is true. It is also irrelevant when keywords are written as Python methods with the @keyword decorator. Such keywords are regular Python functions. They can be called directly from other Python methods, with no detour via BuiltIn().run_keyword(). Full IDE support — type hints, autocomplete, refactoring. The intermediate layer of .robot user keywords disappears entirely.

A concrete example shows the approach. The .robot file holds only the domain description:

*** Test Cases ***
Application Submission With Valid Date Is Processed
    Set Variable    ${Date}    10.12.2024
    Submit Application With Date    ${Date}
    Notice Should Be Available
    Check Notice Date    ${Date}

The Python library implements all the logic. Validation, database access, interface calls, error handling — all in Python, all testable with standard Python tooling.

@keyword("Submit Application With Date")
def submit_application_with_date(self, date: str):
    parsed = self._parse_date(date)
    response = self.api_client.submit(application_date=parsed)
    assert response.status == 200, f"Submission failed: {response.status}"

A further argument concerns logging: Python logic appears in log.html as an opaque keyword call. In practice that is not the case. Python keywords use robot.api.logger and can log at any level of granularity. Domain-named keywords like Submit Application or Check Notice are more meaningful in the log than IF/ELSE branches with technical conditions.

RF beyond GUI tests

Public perception of Robot Framework is dominated by GUI testing. The most popular libraries are SeleniumLibrary and Browser Library. Most tutorials cover web automation. Comparison articles pit RF against Playwright or Cypress.

That comparison is a category error. Playwright and Cypress are GUI test frameworks. In scenarios with heterogeneous interfaces (API, database, messaging, batch, mainframe) they don’t apply. RF with Python libraries can address any interface, because the library layer is arbitrarily extensible.

Enterprise reality looks different from the tutorial landscape. KONE has tested embedded software for elevators with RF for over 10 years. Texas Instruments runs more than 2 million test cases for calculator firmware via XML-RPC API. Altran (now Capgemini Engineering) developed the Mainframe3270 library for IBM mainframe tests. SnapLogic published a complete RF integration covering API tests, database operations, Docker, JMS, Kafka, and AWS. Sogeti reports experience with RF in complex system landscapes, including API tests, end-to-end tests, and backend tests.

The invisibility has a simple reason. Texas Instruments states it directly: the software is proprietary and cannot be shared. Enterprise backend test automation lives behind closed doors. There is no quantitative survey breaking down RF usage by test type.

The library ecosystem for non-GUI scenarios is broad. For REST APIs there is RequestsLibrary; for databases the DatabaseLibrary with support for Oracle, MySQL, PostgreSQL, SQLite, and DB2. Mainframe tests are covered by the Mainframe3270 library, originally developed by Altran for IBM 3270 terminals. For messaging there are libraries for RabbitMQ and Apache Kafka. From 2026 a new MQLibrary for IBM MQ joins them, presented at RoboCon 2026 in Helsinki. SSH, file systems, and processes are addressed via SSHLibrary, OperatingSystem, and Process.

Any pip-installable Python library can be exposed as an RF keyword in a few lines using the @keyword decorator. Extensibility is unlimited because the adaptation layer is plain Python.

Keyword-driven tests and LLM comprehension

RF keywords use natural language. User Should Be Logged In or Submit Application With Date are sentences that an LLM understands as domain description. That differs fundamentally from Python test code, which an LLM interprets as programming logic.

An ACM study (EASE 2024) confirms it: structured, keyword-like test specifications lead to faster and qualitatively better AI-generated test scripts than unstructured approaches. A Springer study (2025) shows that combining RAG (retrieval-augmented generation) with LLMs significantly improves RF test quality and reliability.

LLMs understand Python as code better than RF syntax. That is not in dispute. In the context of domain semantics the relationship reverses, however. A keyword Cancel Contract End Of Month carries domain meaning immediately. The equivalent Python method cancel_contract_end_of_month() requires the LLM to translate code semantics into domain meaning first.

RF evolves; the test approach stays

Robot Framework has systematically introduced programming-language constructs since version 4.0. IF/ELSE (RF 4.0, 2021), TRY/EXCEPT and WHILE (RF 5.0, 2022), VAR syntax (RF 7.0, 2024). The community asked for these features: the GitHub issues for IF/ELSE and TRY/EXCEPT were classified as «priority: critical».

This evolution has its reasons. Run Keyword If was objectively problematic: no nesting, no multi-step logic. Many RF users work without a Python foundation. For RPA scenarios (Robocorp’s earlier core market) conditional logic at the RF level is indispensable.

For the «RF as test specification format + Python as the technical layer» approach, nothing changes. Set Variable still works. VAR is not enforced. The new control structures are available but not mandatory. Existing code does not break. The official best-practice recommendation from Pekka Klärck remains unchanged: no complex logic at the test-case level.

Anyone using RF consistently as a test specification format benefits from the framework’s continued development (performance, JSON output, Python 3.14 compatibility) without having to use programming-language features at the test-case level.

Sources

Tests filled with IF/ELSE on the test-case level, turning into a maintenance burden? In the UTAA workshop we separate test specification and mechanics along the gTAA architecture. More on the method or get in touch directly.