← All articles

Robot Framework and LLM code generation

by Rainer Haupt

TL;DR: Robot Framework’s keyword-driven syntax hits the sweet spot for LLM generation — natural-language enough to leverage the models’ language ability, structured enough to minimise syntactic errors. Concretely, an LLM only has to master around five rules in Robot Framework (RF) versus around 15 in pytest — and produces around 40 % fewer tokens for the same test logic. Research shows: a DSL with explicit, canonical syntax can deliver 99.9 % syntactically correct output even without training data (Anka study, 2024). The Achilles heel: the invisible whitespace semantics (two or more spaces as separator) — exactly the opposite of what LLMs do well.

Reading time approx. 13 min · As of: 2026-04


What an LLM has to get right in pytest — and not in RF

The decisive difference is not in expressive power but in error surface: how many things can an LLM get wrong before the test no longer runs?

In pytest with Selenium an LLM has to generate the following correctly: exact import paths (from selenium.webdriver.common.by import By — not from selenium import By), fixture decorators with the right scope (@pytest.fixture(scope="module") — not scope="test"), the yield mechanism for teardown, conftest.py in the right place, @pytest.mark.parametrize spelt correctly (not parameterize — a notorious LLM mistake), the distinction between assert and self.assertEqual, plus plugin fixtures like mocker or page that fail with “fixture not found” without installation.

In Robot Framework this reduces to four section headers (*** Settings ***, *** Test Cases ***, *** Keywords ***, *** Variables ***), library names (Library SeleniumLibrary), keyword names, the two-space rule and variable syntax (${var}). That is it.

Here is the same login test in both frameworks — note the amount of “syntactic knowledge” each requires:

Robot Framework:

*** Settings ***
Library    SeleniumLibrary

*** Test Cases ***
Valid Login
    Open Browser    http://localhost:7272    Chrome
    Input Text      username_field    demo
    Input Text      password_field    mode
    Click Button    login_button
    Title Should Be    Welcome Page
    [Teardown]    Close Browser

pytest + Selenium:

import pytest
from selenium import webdriver
from selenium.webdriver.common.by import By

@pytest.fixture
def browser():
    driver = webdriver.Chrome()
    yield driver
    driver.quit()

def test_valid_login(browser):
    browser.get("http://localhost:7272")
    browser.find_element(By.ID, "username_field").send_keys("demo")
    browser.find_element(By.ID, "password_field").send_keys("mode")
    browser.find_element(By.ID, "login_button").click()
    assert browser.title == "Welcome Page"

The RF test takes around 75 tokens, the pytest test around 130 — almost twice as many. More importantly: the pytest code has at least ten places an LLM can break the test (wrong imports, missing yield, wrong By selector, forgotten driver.quit()). The RF code has three to four (wrong library name, wrong keyword spelling, missing [Teardown]). Even keyword spelling is forgiving: Click Button, click_button, CLICK BUTTON and Click_Button all work — RF is case-insensitive and ignores underscores.

Linguistic reasons — why LLMs handle natural language better than code syntax

Three concepts from computational linguistics explain why RF is structurally favourable for LLMs.

Semantic overlap with training data. LLMs were trained on billions of tokens of natural language. RF keywords like Open Browser, Click Button, Page Should Contain or User Should Be Logged In are ordinary English sentences. The model has strong representations for these words and their relationships. When an LLM produces Click Button login, it uses the same semantic links as the sentence “Click the login button”. For driver.find_element(By.ID, "login-btn").click() it has to reproduce a specific API syntax that occurs in only a fraction of training data.

Low ambiguity through canonical forms. In Python there are many paths to the goal — an LLM might produce browser.get(), page.goto(), requests.Session(), driver.navigate() or half a dozen other APIs for a login test. RF has exactly one canonical form per action: Open Browser, Input Text, Click Button. The Anka DSL study (2024) demonstrated this principle: a DSL with explicit, canonical syntax achieved 100 % accuracy on complex tasks while Python plateaued at 60 % — a 40-percentage-point lead, with zero prior training on the DSL.

Flat grammar near regular languages. The ChomskyBench study (2025) showed that LLM performance drops with increasing complexity in the Chomsky hierarchy — regular and simple context-free languages are processed better than context-sensitive ones. RF’s grammar is near context-free: sections → test name → indented lines of keyword arguments. Python has context-sensitive elements (scoping, indentation semantics, type inference). RF has an estimated 20–30 grammar rules; Python has over 100.

These three factors together explain why RF keywords work as a kind of “API in natural language” — they leverage what LLMs do best (language understanding) instead of what they do worst (precise syntax reproduction).

Direct comparison — four test scenarios in five frameworks

API test

Robot Framework:

*** Settings ***
Library    RequestsLibrary

*** Test Cases ***
Get User Details
    ${response}=    GET    https://jsonplaceholder.typicode.com/posts/1
    Status Should Be    200    ${response}
    Should Be Equal As Strings    1    ${response.json()}[id]

pytest + requests:

import requests

def test_get_user_details():
    response = requests.get("https://jsonplaceholder.typicode.com/posts/1")
    assert response.status_code == 200
    assert response.json()["id"] == 1

Cypress:

it('should get user details', () => {
    cy.request('GET', 'https://jsonplaceholder.typicode.com/posts/1')
        .then((response) => {
            expect(response.status).to.eq(200)
            expect(response.body.id).to.eq(1)
        })
})

For API tests the RF advantage is smaller, because pytest with requests is also compact. But: Status Should Be 200 is self-documenting, while in pytest an LLM has to know whether to use assert response.status_code == 200, assert response.ok or response.raise_for_status().

Setup and teardown — where the difference becomes clear

Robot Framework:

*** Settings ***
Library           SeleniumLibrary
Suite Setup       Open Browser    ${URL}    Chrome
Suite Teardown    Close Browser
Test Setup        Go To    ${URL}/login

pytest:

@pytest.fixture(scope="module")
def browser():
    driver = webdriver.Chrome()
    yield driver
    driver.quit()

@pytest.fixture(autouse=True)
def navigate_to_login(browser):
    browser.get("http://localhost:7272/login")
    yield

RF declares setup and teardown — four lines in the settings, no fixtures, no yield, no scope parameter, no autouse. In pytest an LLM has to understand the entire fixture cascade: what does scope="module" mean versus scope="session"? When is yield called? What happens if fixture A depends on fixture B? Those questions simply do not exist in RF.

Data-driven tests — RF’s strength

Robot Framework with Test Template:

*** Settings ***
Test Template    Login Should Fail

*** Test Cases ***                USERNAME        PASSWORD
Invalid Username                 invalid         ${VALID PASS}
Invalid Password                 ${VALID USER}   invalid
Both Invalid                     invalid         invalid
Empty Username                   ${EMPTY}        ${VALID PASS}
Empty Password                   ${VALID USER}   ${EMPTY}

*** Keywords ***
Login Should Fail
    [Arguments]    ${username}    ${password}
    Input Text     username_field    ${username}
    Input Text     password_field    ${password}
    Click Button   login_button
    Page Should Contain    Error

pytest with parametrize:

@pytest.mark.parametrize("username,password", [
    ("invalid", "valid_pass"),
    ("valid_user", "invalid"),
    ("invalid", "invalid"),
    ("", "valid_pass"),
    ("valid_user", ""),
])
def test_login_should_fail(browser, username, password):
    browser.find_element(By.ID, "username_field").send_keys(username)
    browser.find_element(By.ID, "password_field").send_keys(password)
    browser.find_element(By.ID, "login_button").click()
    assert "Error" in browser.page_source

The RF version reads like a data table — every row has its own test name and is shown as a separate test case in the report. An LLM only has to know the Test Template concept. In pytest it has to spell @pytest.mark.parametrize correctly (frequent mistake: parameterize), master the tuple syntax in the list and pass parameter strings correctly inside the decorator.

Gherkin is also natural-language — but has a fundamental problem

Gherkin (Cucumber, BDD) is the most obvious comparison point because it is also natural-language:

Feature: Login
  Scenario: Successful login
    Given the user is on the login page
    When the user enters "admin" as username
    And the user enters "secret" as password
    And the user clicks the login button
    Then the welcome page should be displayed

LLMs generate this with high reliability — GPT-4 produces a syntax error in only 1 in 50 feature files (2024 study). But Gherkin is not executable. Every step needs a step definition in a programming language:

@Given("the user is on the login page")
public void theUserIsOnTheLoginPage() {
    driver.get("http://example.com/login");
}

@When("the user enters {string} as username")
public void theUserEntersUsername(String username) {
    driver.findElement(By.id("username")).sendKeys(username);
}
// ... plus three more step definitions

That is the glue-code gap: the LLM has to work correctly in two different syntax worlds and synchronise them. The industrial AutoUAT study (2024, Critical TechWorks / BMW) quantified this: LLMs generated Gherkin scenarios with 95 % acceptance, but conversion to executable Cypress scripts had only 60 % initial accuracy. A third of the generated tests were unusable on arrival.

RF resolves this by composing keywords from other keywords — without leaving RF syntax:

*** Test Cases ***
Successful Login
    Given The User Is On The Login Page
    When The User Enters Valid Credentials
    Then The Welcome Page Should Be Displayed

*** Keywords ***
The User Is On The Login Page
    Open Browser    ${URL}    ${BROWSER}

The User Enters Valid Credentials
    Input Text    id=username    admin
    Input Text    id=password    secret
    Click Button  id=login-btn

The Welcome Page Should Be Displayed
    Title Should Be    Welcome

Everything in one file, in one syntax. No language switch, no regex pattern matching, no step-definition files. Where Gherkin loses its natural-language LLM advantage to glue-code overhead, RF keeps it throughout.

Gherkin does have one advantage: more training data. Cucumber/Gherkin has around 6.5 % market share versus RF’s roughly 5 % (6sense data), and .feature files are common in repositories. As a pure specification language — feature files without step definitions — Gherkin is even simpler for LLMs.

Weaknesses — where RF syntax causes real LLM problems

RF is not a paradise for LLM generation. Three weaknesses weigh heavily.

The whitespace trap is RF’s biggest LLM problem. RF uses two or more spaces as a column separator, but a single space is a normal character inside keywords and arguments. Click Button login (four spaces → keyword + argument) and Click Button login (one space → a single, unknown keyword called “Click Button login”) look almost identical but mean entirely different things. LLMs generate text token by token, and BPE tokenisers handle whitespace inconsistently — multiple spaces are often compressed into a single token or split unpredictably. The RF User Guide itself acknowledges: “The biggest problem of the space delimited format is that visually separating keywords from arguments can be tricky.” For LLMs the problem is worse than for humans because the distinction is invisible and RF error messages do not point to wrong spacing but to “wrong arguments”. Workaround: the pipe format (| Keyword | arg1 | arg2 |) makes separators explicit and visible — much more LLM-friendly.

Less training data than Python. RF uses .robot files — a niche extension typically not prioritised in LLM training. The Grammar Prompting study (2023) confirms: “DSLs are by definition specialized and thus unlikely to have been encountered often enough during pretraining.” However, the Anka study shows that with appropriate prompting (few-shot examples, grammar descriptions) the problem is almost fully compensated — a DSL with no training data at all reached 99.9 % parse success.

Variable syntax and outdated APIs. RF’s four variable prefixes (${scalar}, @{list}, &{dict}, %{env}) are unique in the programming landscape. The semantic differences are subtle: ${my_list} returns the list object, @{my_list} unpacks it — Log Many @{my_list} works, but Log @{my_list} fails because the second element is interpreted as the level parameter. Add to this that RF 5.0+ introduced RETURN instead of [Return] and RF 7.0+ VAR instead of Set Variable — LLMs trained on older code often produce the deprecated syntax.

Research and open gaps

The evidence is thin but consistent. There is no direct benchmark comparing LLM generation quality across test frameworks — a serious research gap. What does exist:

The Sezgin et al. (INFUS 2025) study is the only paper that explicitly investigates LLM generation of RF code. It uses Retrieval-Augmented Generation (RAG) and evaluates with CodeBLEU and CI/CD pass/fail rates. Core finding: RAG integration significantly improved test quality and reliability. The Anka DSL study (2024) provides the strongest theoretical argument: a DSL with RF-like properties (canonical forms, explicit syntax, natural-language keywords) reached 100 % accuracy on multi-step tasks versus 60 % for Python — without training on the DSL. The AutoUAT study (2024) quantifies the glue-code gap in Gherkin: 95 % scenario acceptance, but only 60 % executable scripts.

From the community there is practical evidence: in the Robot Framework forum users report successful test generation with VS Code + Copilot (Agent Mode, Claude Sonnet 4) combined with MCP servers for library documentation. The RF MCP server (Many Kasiriha, presented at RoboCon 2026) addresses the hallucination problem specifically by using LibDoc to expose only actually-existing keywords to the LLM. At RoboCon 2026, 8 of around 20 sessions were devoted explicitly to AI + RF — a sign of how central the topic has become to the community.

Verdict — RF’s token-efficient niche in the LLM era

DimensionRobot FrameworkpytestGherkin / Cucumber
Tokens for login testaround 75around 130around 50 (feature) + around 200 (steps)
Error sources per testaround 5over 15around 3 (feature) + around 15 (steps)
Case-sensitiveNoYesPartial
Self-executableYesYesNo (needs glue code)
Natural-languageHighLowVery high (feature only)
Training-data coverageLowHighMedium
Whitespace riskHighLowNone

The core thesis reduces to one sentence: RF minimises the distance between what an LLM naturally produces (natural language) and what the machine can execute (formal test logic). No other test syntax hits this sweet spot as well. pytest is more powerful but syntactically more fragile. Gherkin is more natural-language but not self-executable. Cypress and Playwright are modern but require full programming-language proficiency.

RF’s real innovation for the LLM era is not that it is simple — it is that it is fault-tolerant. Case insensitivity, underscore tolerance and the keyword-driven abstraction create a wide corridor in which an LLM can be “approximately right” and the test still runs. In pytest everything has to be exact. In RF “close enough” is often sufficient — and that is exactly what probabilistic language models do best.

Sources


Evaluating AI-assisted test generation for an existing RF setup or comparing frameworks under an LLM angle? In the UTAA workshop we assess toolchain, prompt strategy and MCP integration against your project. More on the method or request directly.

Request callback