← All articles

Evaluating Robot Framework tests with AI

by Rainer Haupt

TL;DR: RF keywords are natural language — sentence embeddings (SBERT) detect business-level redundancy at F-score 87 % (Viggiato et al., IEEE TSE 2023). LLM-as-a-Judge evaluates test coverage against Jira acceptance criteria: GPT-4o Mini reaches 6.07 MAAE at around USD 1.01 per 1,000 evaluations. RF tags and keyword names form a semantic bridge between requirements and tests, with no code understanding required. The biggest open gap: as of 2026, no tool combines an RF suite + Jira stories + LLM into an automated gap analysis — an unfilled niche. Counter-argument: RF keyword abstraction hides implementation details an LLM would need for certain assessments.

Reading time approx. 18 min · As of: 2026-04


Business-level redundancy detection — why RF syntax is more transparent for LLMs

In grown test suites with 500+ tests, business redundancies arise not from copy-paste but from semantic overlap: two tests cover the same business scenario, written by different people at different times. The names differ, the implementation differs — but functionally both test the same thing.

Why RF has a structural advantage

An LLM or embedding model has to recognise the business intent of a test when looking for redundancy. In RF that intent is on the surface.

RF test pair — functionally redundant:

*** Test Cases ***
Verify User Can Login With Valid Credentials
    Open Browser To Login Page
    Input Username    demo
    Input Password    mode
    Submit Credentials
    Welcome Page Should Be Open

Verify Successful Authentication With Correct Data
    Navigate To Login
    Enter Username    demo
    Enter Password    mode
    Click Login
    Dashboard Should Be Visible

A sentence-embedding model (SBERT) computes a high cosine similarity for the test names (around 0.85+) because both consist of common English words. The keyword sequences are also semantically close: “Open Browser To Login Page” sits near “Navigate To Login”, “Submit Credentials” near “Click Login”.

The same pair in pytest:

def test_login_valid(self, driver):
    driver.get("http://localhost:7272")
    driver.find_element(By.ID, "username_field").send_keys("demo")
    driver.find_element(By.ID, "password_field").send_keys("mode")
    driver.find_element(By.ID, "login_button").click()
    assert driver.current_url.endswith("/welcome.html")

def test_auth_success(self, browser):
    browser.navigate().to("http://localhost:7272")
    browser.find_element(By.CSS_SELECTOR, "#username").send_keys("demo")
    browser.find_element(By.CSS_SELECTOR, "#password").send_keys("mode")
    browser.find_element(By.CSS_SELECTOR, "button[type=submit]").click()
    WebDriverWait(browser, 10).until(EC.title_is("Dashboard"))

Here the LLM has to understand that By.ID, "username_field" and By.CSS_SELECTOR, "#username" point to the same element, that .click() on login_button versus button[type=submit] is functionally identical, and that assert driver.current_url.endswith("/welcome.html") checks the same thing as EC.title_is("Dashboard"). That is technical analysis, not business analysis — and exactly the kind that is hard for embedding models because the tokens carry no natural-language context.

What the research says

The Viggiato study (IEEE TSE 2023) is the central reference: F-score 87.39 % on clustering similar test steps and 83.47–86.13 % on test cases — using SBERT + cosine similarity + hierarchical clustering. The study ran on natural-language test cases, which corresponds exactly to the RF format.

For code-based tests, the LTM study (Pan et al., IEEE TSE 2024) recommends specialised code-embedding models like UniXcoder — which work but have a higher entry barrier and cover the business level less well.

The decisive point: RF tests fall directly into the “natural-language test cases” category for which SBERT embeddings are optimised. Pytest tests fall into the “code” category, which requires specialised code embeddings. For business-level redundancy detection one wants the natural-language level — and that is RF’s sweet spot.

Test coverage against requirements — Jira stories, acceptance criteria, wikis

A team has 200 RF tests and 50 Jira stories with acceptance criteria. The question: which acceptance criteria are covered by tests? Which are not? Where are the gaps?

Why RF makes LLM matching easier

Jira story:

As a user I want to be able to reset my password
so that I regain access to my account.

Acceptance criteria:
- AC1: User can request reset via email link
- AC2: Link expires after 24 hours
- AC3: New password must have at least 8 characters
- AC4: Old password becomes invalid after reset
- AC5: User receives confirmation email after change

RF test suite:

*** Test Cases ***
User Requests Password Reset Via Email    [Tags]    password-reset    AC1
    Navigate To Login Page
    Click Forgot Password Link
    Enter Email Address    user@example.com
    Submit Reset Request
    Reset Email Should Be Received

Reset Link Expires After 24 Hours    [Tags]    password-reset    AC2
    Request Password Reset
    Wait 24 Hours    # simulated
    Open Reset Link
    Page Should Contain    Link expired

New Password Must Meet Minimum Length    [Tags]    password-reset    AC3
    Open Valid Reset Link
    Enter New Password    short
    Submit New Password
    Error Message Should Be Visible    Minimum 8 characters

An LLM can match directly: AC1 ↔ “User Requests Password Reset Via Email”, AC2 ↔ “Reset Link Expires After 24 Hours”, AC3 ↔ “New Password Must Meet Minimum Length”. The tags (AC1, AC2, AC3) provide an extra signal, but even without tags the semantic match via test names would be highly accurate.

What is missing? AC4 and AC5 have no matching test — gap identified.

The same in pytest:

def test_password_reset_request(self, driver): ...
def test_reset_link_expiry(self, driver): ...
def test_password_min_length(self, driver): ...

The abbreviated function names (test_password_reset_request) carry less semantic signal than the RF test names. Without docstrings or comments, the LLM has to read the code to understand the business intent. Possible, but more error-prone and effortful.

LLM-as-a-Judge — weighted rubric for test coverage

The LAJ framework (LLM-as-a-Judge) uses a four-dimensional weighted rubric: scenario completeness (40 %), acceptance-criteria alignment (30 %), HTTP-method-specific aspects (20 %) and assertion quality (10 %).

The study evaluated a three-stage pipeline:

  1. Jira ticket with acceptance criteria — created by product owners
  2. Gherkin tests — generated or hand-written
  3. LLM evaluation — GPT-4o Mini scores coverage against criteria

Result: GPT-4o Mini reaches the best accuracy (6.07 MAAE) at high reliability (96.6 % ECR@1) and low cost (around USD 1.01 per 1,000 evaluations) — a 78× cost reduction over GPT-5.

RF relevance: the study uses Gherkin (Given-When-Then), which RF supports natively. RF tests with BDD syntax (Given ... When ... Then ...) fall straight into this scheme. Without BDD syntax, RF keyword names are still semantically close enough for the matching.

Multi-source analysis — Jira + wiki + PDF + tests

The real scenario is more complex: requirements scattered across Jira stories, Confluence wikis, PDF specifications and email threads. A RAG-based approach (Retrieval-Augmented Generation) can load all sources into a vector index and match RF tests against it.

Pipeline sketch:

  1. Jira API → extract stories and acceptance criteria
  2. Confluence API → wiki pages with functional specs
  3. PDFs → extract text
  4. Embed everything in a vector DB (e.g. Chroma, Pinecone)
  5. Parse RF tests → test name + keywords + tags + [Documentation] as text
  6. For each test: nearest-neighbour search in the requirements DB
  7. For each requirement: check whether a test exists with high similarity
  8. Gaps = requirements with no close test match

In step 5 RF tests produce significantly better text representations than pytest, because test name and keyword sequence are already natural language. In pytest the code would first have to be translated into a natural-language description (e.g. via LLM summary), introducing an extra error-prone step.

Equivalence classes and boundary-value analysis — can an LLM find missing tests

A requirement says: “The user’s age must be between 18 and 65.” An LLM should check whether the test suite covers the right equivalence classes and boundary values.

Expected test cases per ISTQB:

Equivalence classValuesExpected
Below minimum (invalid)17, 0, -1Reject
Minimum boundary (valid)18Accept
Valid (middle)30, 42Accept
Maximum boundary (valid)65Accept
Above maximum (invalid)66, 100, 999Reject

RF makes the data values visible — pytest hides them

RF with Test Template:

*** Settings ***
Test Template    Age Validation Should

*** Test Cases ***            AGE    EXPECTED
Below Minimum                 17     Reject
Minimum Boundary              18     Accept
Valid Middle                   30     Accept
Maximum Boundary              65     Accept
Above Maximum                 66     Reject
Zero                          0      Reject
Negative                      -1     Reject

*** Keywords ***
Age Validation Should
    [Arguments]    ${age}    ${expected}
    Enter Age    ${age}
    Submit Form
    Result Should Be    ${expected}

pytest with parametrize:

@pytest.mark.parametrize("age,expected", [
    (17, "Reject"),
    (18, "Accept"),
    (30, "Accept"),
    (65, "Accept"),
    (66, "Reject"),
    (0, "Reject"),
    (-1, "Reject"),
])
def test_age_validation(driver, age, expected):
    driver.find_element(By.ID, "age").clear()
    driver.find_element(By.ID, "age").send_keys(str(age))
    driver.find_element(By.ID, "submit").click()
    result = driver.find_element(By.ID, "result").text
    assert result == expected

At first glance coverage looks the same in both. But three points make the difference for LLM analysis.

Speaking test names. “Below Minimum”, “Minimum Boundary”, “Maximum Boundary” — the LLM immediately recognises the test-design intent. In pytest all variants are called the same: test_age_validation[17-Reject], test_age_validation[18-Accept]. The functional classification is missing.

Tabular structure. Data is formatted as a table — AGE and EXPECTED as columns. An LLM can read this table directly as an equivalence-class matrix. In pytest the data is hidden in a Python tuple list inside the decorator.

LLM prompt for gap analysis. With RF syntax the LLM can answer a prompt such as the following directly:

Here is a requirement: "Age must be between 18 and 65."
Here are the current RF test cases: [insert test list]

Check:
1. Are all equivalence classes covered?
2. Are the boundary values 17, 18, 65, 66 tested?
3. Which test cases are missing?

The LLM understands the test names and values as natural language. A recent study on LLM-generated boundary-value analysis shows 63.5 % positive ratings (4–5 on a five-point Likert scale), with software professionals appreciating clear structure and practical examples. A second study evaluates LLMs on generating tests with equivalence partitions and boundary values and shows that effectiveness depends strongly on precise requirements and well-designed prompts.

What an LLM might flag as missing in this suite

A capable LLM would point out:

  • Missing equivalence class: non-numeric inputs (letters, special characters, empty string)
  • Missing boundary value: decimal numbers (17.5, 18.0, 65.5)
  • Missing edge case: very large numbers (9999999), leading zeros (“018”)
  • Missing negative tests: SQL injection in the age field, XSS

In RF the LLM identifies these gaps more easily because the existing coverage is laid out as a functional table. In pytest it has to parse the tuple list first and reconstruct the functional meaning of the values.

RF tags as a requirements traceability mechanism

RF offers a built-in mechanism with tags that is ideal for LLM-based analysis:

*** Test Cases ***
Admin Creates User    [Tags]    JIRA-1234    admin    user-mgmt    smoke
    ...

User Edits Own Profile    [Tags]    JIRA-1235    user    profile    regression
    ...

Admin Deletes User    [Tags]    JIRA-1236    admin    user-mgmt    critical
    ...

What an LLM can do with tags:

  • Generate a traceability matrix — which test belongs to which Jira story
  • Coverage report — JIRA-1234 has three tests, JIRA-1237 has none → gap
  • Risk analysis — tests tagged critical that have not run for 30 days
  • Story-level redundancy — two stories with overlapping tests

RF generates tag statistics automatically in output.xml (total / passed / failed / skipped per tag). An LLM can combine those statistics with Jira data to produce a coverage report. Pytest has a similar concept with @pytest.mark, but markers sit inside the code and have to be extracted by AST parsing. RF tags sit directly in output.xml — machine-readable without code analysis.

Practical example — full business-level test-review pipeline

An LLM-based test-review pipeline for RF could look like this.

Step 1 — collect data

Jira API → stories + acceptance criteria for sprint 42
RF parser → test suites from /tests/*.robot
Confluence API → functional spec "Registration process"

Step 2 — LLM analysis with a structured prompt

You are a senior QA engineer. Analyse the business-level
quality of this test suite.

REQUIREMENTS:
[insert Jira stories + acceptance criteria]

TEST SUITE:
[insert RF tests]

Check:
1. COVERAGE: which acceptance criteria have matching tests?
   Which are not covered?
2. REDUNDANCY: are there tests that functionally check the same?
3. TEST DESIGN: are equivalence classes and boundary values
   considered for numeric inputs?
4. MISSING SCENARIOS: which functional scenarios are missing?
   (negative tests, edge cases, permissions)

Step 3 — LLM output (example)

COVERAGE:
- AC1 (request reset) → test "User Requests Password Reset" (covered)
- AC2 (link expiry) → test "Reset Link Expires After 24 Hours" (covered)
- AC3 (password length) → test "New Password Must Meet Minimum Length" (covered)
- AC4 (old password invalid) → NO TEST
- AC5 (confirmation email) → NO TEST

REDUNDANCY:
- "Verify Login With Valid User" and "Successful Authentication Test"
  both check the same happy path with the same data.

TEST DESIGN:
- Password length: only negative test for "too short". Missing:
  - boundary: exactly 8 characters (minimum)
  - boundary: 7 characters (one below minimum)
  - equivalence class: special characters, spaces
  - maximum: is there an upper bound?

MISSING SCENARIOS:
- What happens with an invalid email address on reset?
- What about repeated reset requests in quick succession?
- Rate limiting for reset requests?

Why this would be harder with pytest

The same pipeline with pytest would work, but:

  • Step 2 needs more tokens — the LLM context must contain the entire Python code, not just names and keywords.
  • Lower matching accuracydef test_pw_reset_req() carries less semantic signal than “User Requests Password Reset Via Email”.
  • Equivalence-class analysis requires code understanding — the LLM has to parse @pytest.mark.parametrize decorators instead of reading an RF data table.

Comparison — RF, pytest and Gherkin for business-level LLM analysis

DimensionRFpytestGherkin
Test names as functional descriptionstandardoptional (often abbreviated)scenario names
Keyword sequence readablenatural languageAPI callsGiven/When/Then
Data values visible (equivalence classes)Test Template tableparametrize tupleScenario Outline
Tags / traceability[Tags] + output.xml@pytest.mark@tags in feature
Self-executableyesyesno (step definitions)
LLM tokens per testaround 75around 130around 50 (feature) + around 200 (steps)
Embedding quality (SBERT)high (NL)low (code)high (NL, feature only)
Requirements matchingdirectvia code analysisdirect (feature only)

Gherkin and Cucumber are as good as RF for the purely functional evaluation of feature files — often even more readable. But: as soon as you want to check whether the tests actually do what the feature files promise, you have to analyse the step definitions — and those are code. RF does not have that gap: what is in the test gets executed.

Limits and counter-arguments

Keyword abstraction as a downside. Login To System in an RF test says nothing about whether the login goes via UI, API or database. For some functional assessments this is irrelevant (Is login working at all? Yes). For others it is critical (Is the login flow tested from the user’s perspective, or only the API?). An LLM often cannot answer that question from the .robot file alone — it would need to see the keyword implementation.

Quality depends on keyword naming. RF tests are only as analysable as the keywords are named. Step 1, Do Thing, Check Result are just as bad as test_001() in pytest. RF does not enforce good names.

LLMs hallucinate during test review. If asked whether all boundary values are covered, an LLM may invent boundary values that are not in the requirement. Human review remains indispensable.

No tool fills the niche. As of April 2026, no tool checks RF test suites automatically against Jira, Confluence and PDFs for functional completeness. The building blocks exist (Jira API, RF parser, LLM APIs, vector DBs), but the integration is missing. The Result Companion tool (PyPI, 2026) analyses output.xml for failure root causes but does not perform requirements coverage analysis. Testomat.io offers AI duplicate detection, but primarily for Cypress and Jest.

Verdict

RF tests fall directly into the category for which NL embeddings and LLMs work best. For redundancy detection, requirements mapping and gap analysis the bar is low — the building blocks are available, the research base is solid. What is missing is an integrated tool. Anyone with a large RF suite and good acceptance criteria can build a custom pipeline with manageable effort.

Three recommendations for use. First, name keywords descriptively, because analysis quality depends on naming. Second, use tags consistently — ideally with Jira IDs as tags, because traceability then arises automatically. Third, do not replace human review, because LLMs can invent boundary values that were never specified.

Sources


Looking to check an RF suite against Jira stories automatically or to run a redundancy analysis? In the UTAA workshop we assess pipeline architecture, tooling and prompt strategy against your project. More on the method or request directly.

Request callback