Test Runs

Overview

Test runs allow you to validate your LLM functions against a set of test cases. This helps ensure:

Functions produce expected outputs
Changes don’t break existing behavior
Quality is maintained across versions

Creating a Test Run

Navigate to a function
Click the Test Runs tab
Click New Test Run
Configure the test run settings
Click Run

Input Sets

Input sets define the test cases for your function:

Creating an Input Set

Go to a function’s Test Runs tab
Click Input Sets
Click Create Input Set
Add test cases as JSON

Input Set Format

[
  {
    "text": "My name is John Doe"
  },
  {
    "text": "Jane Smith is a software engineer"
  },
  {
    "text": "Dr. Robert Johnson, PhD"
  }
]

Importing Input Sets

Import input sets from:

CSV files: Each row becomes a test case
JSON files: Array of input objects
Existing traces: Use real inputs from production

Success Criteria

Define what constitutes a successful test:

Built-in Criteria

Criteria	Description
No Errors	The function completes without errors
Valid Output	The output matches the expected schema
Contains Field	A specific field is present in the output

Custom Criteria

Write custom success criteria using JavaScript:

// Check that firstName is not empty
output.firstName && output.firstName.length > 0

// Check that confidence is above threshold
output.confidence > 0.8

// Check that output matches expected pattern
/^[A-Z][a-z]+$/.test(output.firstName)

Running Tests

Manual Test Runs

Select an input set
Configure success criteria
Click Run
Wait for results

Comparing Versions

Compare outputs across function versions:

Create a test run
Select multiple versions to test
Review side-by-side results
Identify regressions or improvements

Test Results

Result Summary

Status	Description
Passed	All success criteria met
Failed	One or more criteria not met
Error	Function execution failed
Pending	Test is still running

Result Details

For each test case, view:

Input: The test input
Output: The function output
Expected: Expected output (if defined)
Criteria Results: Which criteria passed/failed
Execution Time: How long the test took

Aggregate Metrics

Pass Rate: Percentage of passing tests
Average Latency: Mean execution time
Token Usage: Total tokens consumed

Scheduled Tests

Run tests automatically on a schedule:

Go to Test Runs > Schedules
Click Create Schedule
Configure:
- Function: Which function to test
- Input Set: Which test cases to use
- Frequency: How often to run (hourly, daily, weekly)
- Notifications: Alert on failures

Test Run History

View past test runs:

Go to a function’s Test Runs tab
Browse the test run history
Click on a run to view details
Compare runs over time

Best Practices

Diverse Test Cases: Include edge cases and typical inputs
Version Testing: Test before publishing new versions
Regular Runs: Schedule tests to catch regressions
Clear Criteria: Define specific, measurable success criteria
Review Failures: Investigate and fix failing tests promptly

Getting Started

SDKs

Tools

Integrations

Web Portal

Overview

Creating a Test Run

Input Sets

Creating an Input Set

Input Set Format

Importing Input Sets

Success Criteria

Built-in Criteria

Custom Criteria

Running Tests

Manual Test Runs

Comparing Versions

Test Results

Result Summary

Result Details

Aggregate Metrics

Scheduled Tests

Test Run History

Best Practices

Getting Started

SDKs

Tools

Integrations

Web Portal

​Overview

​Creating a Test Run

​Input Sets

​Creating an Input Set

​Input Set Format

​Importing Input Sets

​Success Criteria

​Built-in Criteria

​Custom Criteria

​Running Tests

​Manual Test Runs

​Comparing Versions

​Test Results

​Result Summary

​Result Details

​Aggregate Metrics

​Scheduled Tests

​Test Run History

​Best Practices

Overview

Creating a Test Run

Input Sets

Creating an Input Set

Input Set Format

Importing Input Sets

Success Criteria

Built-in Criteria

Custom Criteria

Running Tests

Manual Test Runs

Comparing Versions

Test Results

Result Summary

Result Details

Aggregate Metrics

Scheduled Tests

Test Run History

Best Practices