Evaluating Robot Framework tests with AI
by Rainer Haupt
TL;DR: RF keywords are natural language — sentence embeddings (SBERT) detect business-level redundancy at F-score 87 % (Viggiato et al., IEEE TSE 2023). LLM-as-a-Judge evaluates test coverage against Jira acceptance criteria: GPT-4o Mini reaches 6.07 MAAE at around USD 1.01 per 1,000 evaluations. RF tags and keyword names form a semantic bridge between requirements and tests, with no code understanding required. The biggest open gap: as of 2026, no tool combines an RF suite + Jira stories + LLM into an automated gap analysis — an unfilled niche. Counter-argument: RF keyword abstraction hides implementation details an LLM would need for certain assessments.
Reading time approx. 18 min · As of: 2026-04
Business-level redundancy detection — why RF syntax is more transparent for LLMs
In grown test suites with 500+ tests, business redundancies arise not from copy-paste but from semantic overlap: two tests cover the same business scenario, written by different people at different times. The names differ, the implementation differs — but functionally both test the same thing.
Why RF has a structural advantage
An LLM or embedding model has to recognise the business intent of a test when looking for redundancy. In RF that intent is on the surface.
RF test pair — functionally redundant:
*** Test Cases ***
Verify User Can Login With Valid Credentials
Open Browser To Login Page
Input Username demo
Input Password mode
Submit Credentials
Welcome Page Should Be Open
Verify Successful Authentication With Correct Data
Navigate To Login
Enter Username demo
Enter Password mode
Click Login
Dashboard Should Be Visible
A sentence-embedding model (SBERT) computes a high cosine similarity for the test names (around 0.85+) because both consist of common English words. The keyword sequences are also semantically close: “Open Browser To Login Page” sits near “Navigate To Login”, “Submit Credentials” near “Click Login”.
The same pair in pytest:
def test_login_valid(self, driver):
driver.get("http://localhost:7272")
driver.find_element(By.ID, "username_field").send_keys("demo")
driver.find_element(By.ID, "password_field").send_keys("mode")
driver.find_element(By.ID, "login_button").click()
assert driver.current_url.endswith("/welcome.html")
def test_auth_success(self, browser):
browser.navigate().to("http://localhost:7272")
browser.find_element(By.CSS_SELECTOR, "#username").send_keys("demo")
browser.find_element(By.CSS_SELECTOR, "#password").send_keys("mode")
browser.find_element(By.CSS_SELECTOR, "button[type=submit]").click()
WebDriverWait(browser, 10).until(EC.title_is("Dashboard"))
Here the LLM has to understand that By.ID, "username_field" and By.CSS_SELECTOR, "#username" point to the same element, that .click() on login_button versus button[type=submit] is functionally identical, and that assert driver.current_url.endswith("/welcome.html") checks the same thing as EC.title_is("Dashboard"). That is technical analysis, not business analysis — and exactly the kind that is hard for embedding models because the tokens carry no natural-language context.
What the research says
The Viggiato study (IEEE TSE 2023) is the central reference: F-score 87.39 % on clustering similar test steps and 83.47–86.13 % on test cases — using SBERT + cosine similarity + hierarchical clustering. The study ran on natural-language test cases, which corresponds exactly to the RF format.
For code-based tests, the LTM study (Pan et al., IEEE TSE 2024) recommends specialised code-embedding models like UniXcoder — which work but have a higher entry barrier and cover the business level less well.
The decisive point: RF tests fall directly into the “natural-language test cases” category for which SBERT embeddings are optimised. Pytest tests fall into the “code” category, which requires specialised code embeddings. For business-level redundancy detection one wants the natural-language level — and that is RF’s sweet spot.
Test coverage against requirements — Jira stories, acceptance criteria, wikis
A team has 200 RF tests and 50 Jira stories with acceptance criteria. The question: which acceptance criteria are covered by tests? Which are not? Where are the gaps?
Why RF makes LLM matching easier
Jira story:
As a user I want to be able to reset my password
so that I regain access to my account.
Acceptance criteria:
- AC1: User can request reset via email link
- AC2: Link expires after 24 hours
- AC3: New password must have at least 8 characters
- AC4: Old password becomes invalid after reset
- AC5: User receives confirmation email after change
RF test suite:
*** Test Cases ***
User Requests Password Reset Via Email [Tags] password-reset AC1
Navigate To Login Page
Click Forgot Password Link
Enter Email Address user@example.com
Submit Reset Request
Reset Email Should Be Received
Reset Link Expires After 24 Hours [Tags] password-reset AC2
Request Password Reset
Wait 24 Hours # simulated
Open Reset Link
Page Should Contain Link expired
New Password Must Meet Minimum Length [Tags] password-reset AC3
Open Valid Reset Link
Enter New Password short
Submit New Password
Error Message Should Be Visible Minimum 8 characters
An LLM can match directly: AC1 ↔ “User Requests Password Reset Via Email”, AC2 ↔ “Reset Link Expires After 24 Hours”, AC3 ↔ “New Password Must Meet Minimum Length”. The tags (AC1, AC2, AC3) provide an extra signal, but even without tags the semantic match via test names would be highly accurate.
What is missing? AC4 and AC5 have no matching test — gap identified.
The same in pytest:
def test_password_reset_request(self, driver): ...
def test_reset_link_expiry(self, driver): ...
def test_password_min_length(self, driver): ...
The abbreviated function names (test_password_reset_request) carry less semantic signal than the RF test names. Without docstrings or comments, the LLM has to read the code to understand the business intent. Possible, but more error-prone and effortful.
LLM-as-a-Judge — weighted rubric for test coverage
The LAJ framework (LLM-as-a-Judge) uses a four-dimensional weighted rubric: scenario completeness (40 %), acceptance-criteria alignment (30 %), HTTP-method-specific aspects (20 %) and assertion quality (10 %).
The study evaluated a three-stage pipeline:
- Jira ticket with acceptance criteria — created by product owners
- Gherkin tests — generated or hand-written
- LLM evaluation — GPT-4o Mini scores coverage against criteria
Result: GPT-4o Mini reaches the best accuracy (6.07 MAAE) at high reliability (96.6 % ECR@1) and low cost (around USD 1.01 per 1,000 evaluations) — a 78× cost reduction over GPT-5.
RF relevance: the study uses Gherkin (Given-When-Then), which RF supports natively. RF tests with BDD syntax (Given ... When ... Then ...) fall straight into this scheme. Without BDD syntax, RF keyword names are still semantically close enough for the matching.
Multi-source analysis — Jira + wiki + PDF + tests
The real scenario is more complex: requirements scattered across Jira stories, Confluence wikis, PDF specifications and email threads. A RAG-based approach (Retrieval-Augmented Generation) can load all sources into a vector index and match RF tests against it.
Pipeline sketch:
- Jira API → extract stories and acceptance criteria
- Confluence API → wiki pages with functional specs
- PDFs → extract text
- Embed everything in a vector DB (e.g. Chroma, Pinecone)
- Parse RF tests → test name + keywords + tags +
[Documentation]as text - For each test: nearest-neighbour search in the requirements DB
- For each requirement: check whether a test exists with high similarity
- Gaps = requirements with no close test match
In step 5 RF tests produce significantly better text representations than pytest, because test name and keyword sequence are already natural language. In pytest the code would first have to be translated into a natural-language description (e.g. via LLM summary), introducing an extra error-prone step.
Equivalence classes and boundary-value analysis — can an LLM find missing tests
A requirement says: “The user’s age must be between 18 and 65.” An LLM should check whether the test suite covers the right equivalence classes and boundary values.
Expected test cases per ISTQB:
| Equivalence class | Values | Expected |
|---|---|---|
| Below minimum (invalid) | 17, 0, -1 | Reject |
| Minimum boundary (valid) | 18 | Accept |
| Valid (middle) | 30, 42 | Accept |
| Maximum boundary (valid) | 65 | Accept |
| Above maximum (invalid) | 66, 100, 999 | Reject |
RF makes the data values visible — pytest hides them
RF with Test Template:
*** Settings ***
Test Template Age Validation Should
*** Test Cases *** AGE EXPECTED
Below Minimum 17 Reject
Minimum Boundary 18 Accept
Valid Middle 30 Accept
Maximum Boundary 65 Accept
Above Maximum 66 Reject
Zero 0 Reject
Negative -1 Reject
*** Keywords ***
Age Validation Should
[Arguments] ${age} ${expected}
Enter Age ${age}
Submit Form
Result Should Be ${expected}
pytest with parametrize:
@pytest.mark.parametrize("age,expected", [
(17, "Reject"),
(18, "Accept"),
(30, "Accept"),
(65, "Accept"),
(66, "Reject"),
(0, "Reject"),
(-1, "Reject"),
])
def test_age_validation(driver, age, expected):
driver.find_element(By.ID, "age").clear()
driver.find_element(By.ID, "age").send_keys(str(age))
driver.find_element(By.ID, "submit").click()
result = driver.find_element(By.ID, "result").text
assert result == expected
At first glance coverage looks the same in both. But three points make the difference for LLM analysis.
Speaking test names. “Below Minimum”, “Minimum Boundary”, “Maximum Boundary” — the LLM immediately recognises the test-design intent. In pytest all variants are called the same: test_age_validation[17-Reject], test_age_validation[18-Accept]. The functional classification is missing.
Tabular structure. Data is formatted as a table — AGE and EXPECTED as columns. An LLM can read this table directly as an equivalence-class matrix. In pytest the data is hidden in a Python tuple list inside the decorator.
LLM prompt for gap analysis. With RF syntax the LLM can answer a prompt such as the following directly:
Here is a requirement: "Age must be between 18 and 65."
Here are the current RF test cases: [insert test list]
Check:
1. Are all equivalence classes covered?
2. Are the boundary values 17, 18, 65, 66 tested?
3. Which test cases are missing?
The LLM understands the test names and values as natural language. A recent study on LLM-generated boundary-value analysis shows 63.5 % positive ratings (4–5 on a five-point Likert scale), with software professionals appreciating clear structure and practical examples. A second study evaluates LLMs on generating tests with equivalence partitions and boundary values and shows that effectiveness depends strongly on precise requirements and well-designed prompts.
What an LLM might flag as missing in this suite
A capable LLM would point out:
- Missing equivalence class: non-numeric inputs (letters, special characters, empty string)
- Missing boundary value: decimal numbers (17.5, 18.0, 65.5)
- Missing edge case: very large numbers (9999999), leading zeros (“018”)
- Missing negative tests: SQL injection in the age field, XSS
In RF the LLM identifies these gaps more easily because the existing coverage is laid out as a functional table. In pytest it has to parse the tuple list first and reconstruct the functional meaning of the values.
RF tags as a requirements traceability mechanism
RF offers a built-in mechanism with tags that is ideal for LLM-based analysis:
*** Test Cases ***
Admin Creates User [Tags] JIRA-1234 admin user-mgmt smoke
...
User Edits Own Profile [Tags] JIRA-1235 user profile regression
...
Admin Deletes User [Tags] JIRA-1236 admin user-mgmt critical
...
What an LLM can do with tags:
- Generate a traceability matrix — which test belongs to which Jira story
- Coverage report — JIRA-1234 has three tests, JIRA-1237 has none → gap
- Risk analysis — tests tagged
criticalthat have not run for 30 days - Story-level redundancy — two stories with overlapping tests
RF generates tag statistics automatically in output.xml (total / passed / failed / skipped per tag). An LLM can combine those statistics with Jira data to produce a coverage report. Pytest has a similar concept with @pytest.mark, but markers sit inside the code and have to be extracted by AST parsing. RF tags sit directly in output.xml — machine-readable without code analysis.
Practical example — full business-level test-review pipeline
An LLM-based test-review pipeline for RF could look like this.
Step 1 — collect data
Jira API → stories + acceptance criteria for sprint 42
RF parser → test suites from /tests/*.robot
Confluence API → functional spec "Registration process"
Step 2 — LLM analysis with a structured prompt
You are a senior QA engineer. Analyse the business-level
quality of this test suite.
REQUIREMENTS:
[insert Jira stories + acceptance criteria]
TEST SUITE:
[insert RF tests]
Check:
1. COVERAGE: which acceptance criteria have matching tests?
Which are not covered?
2. REDUNDANCY: are there tests that functionally check the same?
3. TEST DESIGN: are equivalence classes and boundary values
considered for numeric inputs?
4. MISSING SCENARIOS: which functional scenarios are missing?
(negative tests, edge cases, permissions)
Step 3 — LLM output (example)
COVERAGE:
- AC1 (request reset) → test "User Requests Password Reset" (covered)
- AC2 (link expiry) → test "Reset Link Expires After 24 Hours" (covered)
- AC3 (password length) → test "New Password Must Meet Minimum Length" (covered)
- AC4 (old password invalid) → NO TEST
- AC5 (confirmation email) → NO TEST
REDUNDANCY:
- "Verify Login With Valid User" and "Successful Authentication Test"
both check the same happy path with the same data.
TEST DESIGN:
- Password length: only negative test for "too short". Missing:
- boundary: exactly 8 characters (minimum)
- boundary: 7 characters (one below minimum)
- equivalence class: special characters, spaces
- maximum: is there an upper bound?
MISSING SCENARIOS:
- What happens with an invalid email address on reset?
- What about repeated reset requests in quick succession?
- Rate limiting for reset requests?
Why this would be harder with pytest
The same pipeline with pytest would work, but:
- Step 2 needs more tokens — the LLM context must contain the entire Python code, not just names and keywords.
- Lower matching accuracy —
def test_pw_reset_req()carries less semantic signal than “User Requests Password Reset Via Email”. - Equivalence-class analysis requires code understanding — the LLM has to parse
@pytest.mark.parametrizedecorators instead of reading an RF data table.
Comparison — RF, pytest and Gherkin for business-level LLM analysis
| Dimension | RF | pytest | Gherkin |
|---|---|---|---|
| Test names as functional description | standard | optional (often abbreviated) | scenario names |
| Keyword sequence readable | natural language | API calls | Given/When/Then |
| Data values visible (equivalence classes) | Test Template table | parametrize tuple | Scenario Outline |
| Tags / traceability | [Tags] + output.xml | @pytest.mark | @tags in feature |
| Self-executable | yes | yes | no (step definitions) |
| LLM tokens per test | around 75 | around 130 | around 50 (feature) + around 200 (steps) |
| Embedding quality (SBERT) | high (NL) | low (code) | high (NL, feature only) |
| Requirements matching | direct | via code analysis | direct (feature only) |
Gherkin and Cucumber are as good as RF for the purely functional evaluation of feature files — often even more readable. But: as soon as you want to check whether the tests actually do what the feature files promise, you have to analyse the step definitions — and those are code. RF does not have that gap: what is in the test gets executed.
Limits and counter-arguments
Keyword abstraction as a downside. Login To System in an RF test says nothing about whether the login goes via UI, API or database. For some functional assessments this is irrelevant (Is login working at all? Yes). For others it is critical (Is the login flow tested from the user’s perspective, or only the API?). An LLM often cannot answer that question from the .robot file alone — it would need to see the keyword implementation.
Quality depends on keyword naming. RF tests are only as analysable as the keywords are named. Step 1, Do Thing, Check Result are just as bad as test_001() in pytest. RF does not enforce good names.
LLMs hallucinate during test review. If asked whether all boundary values are covered, an LLM may invent boundary values that are not in the requirement. Human review remains indispensable.
No tool fills the niche. As of April 2026, no tool checks RF test suites automatically against Jira, Confluence and PDFs for functional completeness. The building blocks exist (Jira API, RF parser, LLM APIs, vector DBs), but the integration is missing. The Result Companion tool (PyPI, 2026) analyses output.xml for failure root causes but does not perform requirements coverage analysis. Testomat.io offers AI duplicate detection, but primarily for Cypress and Jest.
Verdict
RF tests fall directly into the category for which NL embeddings and LLMs work best. For redundancy detection, requirements mapping and gap analysis the bar is low — the building blocks are available, the research base is solid. What is missing is an integrated tool. Anyone with a large RF suite and good acceptance criteria can build a custom pipeline with manageable effort.
Three recommendations for use. First, name keywords descriptively, because analysis quality depends on naming. Second, use tags consistently — ideally with Jira IDs as tags, because traceability then arises automatically. Third, do not replace human review, because LLMs can invent boundary values that were never specified.
Sources
- Viggiato et al. — Identifying Similar Test Cases Specified in Natural Language (IEEE TSE 2023)
- Pan et al. — LTM: Scalable Test Suite Minimization based on Language Models (IEEE TSE 2024)
- LLM-as-a-Judge — Scalable Test Coverage Evaluation (arXiv:2512.01232)
- Understanding on the Edge — LLM-generated Boundary Test Explanations (arXiv:2601.22791)
- Rodríguez et al. — LLMs for Unit Tests with Equivalence Partitions and Boundary Values (Springer 2026)
- Ali et al. — RAG + LLM for Requirements Traceability (ER 2024)
- GenIA-E2ETest — Generative AI for E2E Test Automation with Robot Framework
- Xebia — Robot Framework and the Keyword-Driven Approach
- Eficode — AI with Robot Framework
- Robot Framework Forum — Result Companion Tool
Looking to check an RF suite against Jira stories automatically or to run a redundancy analysis? In the UTAA workshop we assess pipeline architecture, tooling and prompt strategy against your project. More on the method or request directly.