Evaluations

Evaluations help you test an Action against known examples and compare the actual result with the result you expected.

Use evaluations when you want confidence that an Action still behaves correctly after changing prompts, steps, tools, filters, or AI model settings.

What Evaluations Are For

An evaluation dataset is a collection of test cases for one Action. Each test case contains representative input data and the expected result.

Evaluations help answer questions such as:

Does the Action produce the expected final output?
Did the Action call the expected tools with the expected arguments?
Which test cases fail after a configuration change?
Is the behavior stable across common, edge, and failure-prone inputs?
Can a cheaper or faster AI model handle this Action, or does it need a more capable model?

Typical Evaluation Process

Pick an Action you want to validate.
Create or open an evaluation dataset for that Action.
Add test cases with realistic inputs and expected outputs.
Run the dataset as a dry run or live test.
Review passed and failed cases.
Adjust the Action, prompts, model choices, tools, or expectations.
Run the dataset again and compare results.

This gives you a repeatable way to verify changes instead of relying on one manual run.

Creating Test Cases From Runs

The easiest way to create useful test cases is often from real runs.

When you inspect a run, you can add it to an evaluation dataset. Studio uses the run data to prefill the test case, including the input context and observed outputs. You can then edit the expected result to describe what should happen in future runs.

This is useful because real runs include realistic input shapes, tool responses, and edge cases that are easy to miss when writing test data manually.

What a Test Case Contains

A test case can include:

input parameters for the Action,
expected tool calls,
expected final output,
optional evaluation metrics,
whether the case is enabled for runs.

Expected tool calls are useful when you need to verify that the Action not only returns the right answer, but also uses the right systems in the right way.

Expected final output is useful for validating the final business result, such as extracted invoice fields, classification decisions, or generated summaries.

Running a Dataset

Run a dataset when you want to verify multiple cases together.

Use a dry test when you want to avoid making live external changes where supported. Use a live test when the Action must interact with real tools or systems to validate the full Action Flow.

After the run finishes, Studio shows summary counts and case-level results.

Reviewing Failed Cases

The test run view shows which cases passed and failed. For each case, you can inspect mismatches in tool calls, final output, metrics, and error messages.

If a case fails, open the linked test case from the run results and decide whether to:

fix the Action because the output is wrong,
update the expected output because the intended behavior changed,
split the scenario into more specific test cases,
adjust the model or prompt because the result is unstable.

Use failed cases as a focused debugging queue. They tell you exactly which input scenario needs attention.

Evaluating AI Model Choices

Evaluations are especially useful when comparing AI model configurations.

For example, you might want to know whether an Action can use a cheaper and faster model or whether it requires a more capable model. Instead of guessing, run the same dataset with different model settings and compare pass rates, output quality, latency, and cost.

Practical model comparison questions include:

Does the smaller model still pass the critical test cases?
Are failures limited to edge cases, or do they affect common scenarios?
Does the faster model call tools correctly?
Is the quality gain from a stronger model worth the added cost?
Can different steps use different models based on complexity?

For credit and model cost considerations, see Credits.

Good Evaluation Habits

Start with a small dataset of high-value scenarios.
Include both successful and failure-prone examples.
Add real cases from runs whenever possible.
Keep expected outputs specific enough to catch regressions.
Re-run evaluations after changing prompts, tools, models, or step logic.
Keep test cases enabled only when they represent behavior you currently want to enforce.

Evaluations work best when they become part of normal Action maintenance: change, run, review, improve, and repeat.

All guides