Subtext

What is Subtext?

Subtext is shadow testing for AI agents. Test any change to your AI stack — new models, new prompts, new providers — and see exactly what breaks before your users do.

Run everything from the CLI, or log into the web portal to see dashboards — visualize results and share with your team.

No API keys needed to start — one Subtext account gives you access to every major model. Your prompts and traces stay local on your machine by default. Nothing is sent to our servers unless you explicitly push to the dashboard. We never train on your data.

📊

Benchmark · customer-support

5 models · 50 queries · Feb 14

Complete

Models

92.1%

Best Quality

$0.30

Cheapest/1K

#ModelQualityCost/1K

🥇claude-sonnet-492.1%$4.80

2gpt-4o89.7%$12.40

3gemini-2.0-flash87.3%$0.30

4gpt-4o-mini82.6%$0.42

5deepseek-v380.1%$0.27

Recommendation

gemini-2.0-flash passes 87% of cases at 97% less cost than your current setup.

👻

Shadow run · extract-invoice

production traffic · 4 challengers

Running

00:01Productiongpt-4o · $12.40/1K · accuracy 94.2%

00:03Shadowclaude-sonnet-4 · $4.80/1K · accuracy 96.1%

00:05Shadowgemini-2.0-flash · $1.20/1K · accuracy 91.4%

00:07Shadowdeepseek-v3 · $0.90/1K · accuracy 87.3%

Found a better path

Best accuracysonnet-4 — 96.1% (+2%)

Cheapest matchgemini-2.0 — 91.4% at $1.20

Cost savingsup to 61% vs. current

⚠ gpt-4o accuracy dropped 2.1% vs. last run — possible provider drift

The web dashboard — optional. The CLI works on its own.

Requirements: Python 3.9+

Quick Start

$ pip install subtexts

See what's available:

$ subtext --help

  Usage: subtext [command]

  Commands:
    run              Test a prompt or trace across models (zero-config)
    eval             Manage and run pinned eval sets
    trace            Import, view, and diff production traces
    shadow           Shadow test with your own API keys
    provider         Manage API keys and providers
    prompt           Manage test prompt sets
    login            Connect CLI to web dashboard
    push             Sync results to web dashboard

Run your first test — no API keys, no config:

$ subtext run

Paste a prompt, drop a file, or give a path. Subtext auto-detects what you gave it, tests it across 5 models, and shows you a cost comparison table — all in under 90 seconds. First 5 runs are free every month.

Which command should I use?

subtext run — Starting out or want quick results. Zero config, managed API keys, auto-detects your input.
subtext trace diff — Compare any two LLM runs side by side. Always free.
subtext eval — Pin test cases, run every change against them, catch regressions.
subtext shadow — Power user mode. Your own API keys, specific candidates, full control.

Run

The zero-config entry point. Paste anything — a prompt, a Python file, a JSON trace, a YAML config, or a directory — and Subtext figures out what it is, tests it across 5 models, and shows you exactly how much you could save.

Interactive Mode

$ subtext run

  Paste your prompt, drop a file, or give a path:
  > You are a customer support agent for Acme Corp...

  Detected: system prompt (12 lines, customer support agent)
  Generating 5 test inputs...

  Testing across 5 models...

  Model              Quality   Cost        Latency   Comparison
  ────────────────────────────────────────────────────────
  claude-sonnet-4    94        $0.0048     1.1s      priciest
  gpt-4o             93        $0.0038     0.8s      -21%
  deepseek-v3        90        $0.0003     0.5s      -94%
  gpt-4o-mini        88        $0.0004     0.3s      -92%
  gemini-2.0-flash   87        $0.0002     0.2s      -96% 💰

Smart File Detection

Subtext auto-detects whatever you throw at it:

Plain text / markdown — treated as a system prompt
Python files — extracts prompt strings from variables and classes
YAML / JSON configs — extracts prompt and model configuration
JSON traces — replays tasks against cheaper models
Eval sets — uses existing test cases directly
Directories — scans for all prompt and trace files, runs all of them

Direct File Path

# Run against a specific file
$ subtext run ./agents/support.py

# Run with production traces
$ subtext run ./logs/january_conversations.json

# Run an entire directory
$ subtext run ./agents/

Free Tier

subtext run uses Subtext's managed API keys — no keys needed. You get 5 free runs per month. After that, recharge your credits and pay as you go. We don't mark up — you're charged exactly what the model providers charge us.

Trace Diffs Always Free

Compare any two LLM runs side by side. See exactly what changed in cost, latency, tokens, and tool calls. No account needed, no API keys, no limits.

The Flow

Import your traces from a JSON file, then list and diff:

# Import traces from a file
$ subtext trace import --from prod_logs.json --name jan_prod

  Imported 24 traces from prod_logs.json

# See what you have
$ subtext trace list

  Name       Traces   Model          Date
  ──────────────────────────────────────────────
  jan_prod   24       gpt-4o         Jan 30
  feb_prod   18       gpt-4o-mini    Feb 12

# Diff two traces by name
$ subtext trace diff jan_prod feb_prod

  Trace Comparison
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                  jan_prod        feb_prod        Delta
                  gpt-4o          gpt-4o-mini
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Cost            $0.024          $0.003          -88%
  Latency         1.2s            0.4s            -67%
  Tokens          847             215             -75%
  Tool calls      2               0               ⚠ -2
  Outcome         ✓ Pass          ✓ Pass

Where Traces Come From

JSON files — export from your logging system and import with subtext trace import
Auto-Capture — add import subtexts to your code and traces accumulate automatically
Shadow runs — every shadow run produces traces you can diff with subtext shadow diff

Eval Sets

Eval sets are curated collections of test cases — your safety net. Pin critical cases from production, previous runs, or generate them from a prompt. Then run every model or prompt change against this set to catch regressions before they ship.

Create an Eval Set

# Create empty, then add cases
$ subtext eval create my-evals

# Import from a JSON file
$ subtext eval create my-evals --from test_cases.json

# Import + attach system prompt (recommended — required for meaningful scores)
$ subtext eval create my-evals \
    --from test_cases.json \
    --prompt agents/support.py

  ✓ Created eval set my-evals with 10 cases from test_cases.json
  System prompt attached (637 chars) — judge will score against it

# Generate from a prompt using LLM
$ subtext eval create my-evals --generate --prompt agents/support.py

Add Cases Over Time

# Add a case manually
$ subtext eval add my-evals \
    --input "Customer asked about crypto refund" \
    --expected "Should explain crypto refunds not supported"

# Add from an existing trace
$ subtext eval add my-evals --from-trace conv_10442

Every production bug becomes a regression test. Over time, your eval set becomes your institutional memory for what your agent needs to handle.

Run Against Models

$ subtext eval run my-evals

  Running 12 pinned cases across 4 models...

  eval_001: billing dispute
  ├─ gpt-4o            93 ✓
  ├─ gpt-4o-mini       88 ✓
  └─ deepseek-v3       85 ⚠ wrong tool call

  eval_002: angry customer
  ├─ gpt-4o            91 ✓
  ├─ gpt-4o-mini       72 ⚠ didn't escalate
  └─ deepseek-v3       80 ✓

  Summary:
  Model              Avg      Pass rate
  ────────────────────────────────────────
  gpt-4o             92.2     12/12
  gpt-4o-mini        73.6      5/12
  deepseek-v3        81.2      7/12

Prompt Diff Mode

Changed your prompt? Test the new version against the same eval set to see what improved and what regressed:

$ subtext eval run my-evals --prompt ./agents/support_v2.txt

  Case                    Before   After    Δ
  ──────────────────────────────────────────────
  billing dispute          95       94       -1
  angry customer           93       96       +3 ✓
  complex proration        97       97       same
  enterprise multi-sub     91       88       -3 ⚠

Manage Eval Sets

$ subtext eval list
$ subtext eval show my-evals
$ subtext eval delete my-evals

Auto-Capture

Add one import to your code and every LLM call gets traced automatically. No decorators, no config, no code changes beyond the import.

import subtext  # That's it. Every LLM call is now traced.
# (pip install subtexts — the package is subtexts, the import is subtext)

# Your existing code — no changes needed:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(...)  # ← automatically captured

# Works with Anthropic too:
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(...)  # ← automatically captured

# Works with any OpenAI-compatible SDK (OpenRouter, Together, etc.):
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...")
response = client.chat.completions.create(...)  # ← also captured

Traces are stored locally in .subtext/traces/, date-partitioned and auto-rotated. Nothing leaves your machine.

Any SDK built on OpenAI's Python client gets captured automatically — that includes OpenRouter, Together, Fireworks, Groq, and any other provider using the OpenAI-compatible format. If it calls client.chat.completions.create(), Subtext sees it.

How It Works

The import automatically instruments your OpenAI and Anthropic clients. Your code runs exactly as before — same responses, same latency, zero side effects. Subtext just records what happened in the background.

Controls

# Disable auto-patching entirely
SUBTEXT_DISABLE_AUTOPATCH=1

# Control sampling rate (0.0 to 1.0, default: 1.0)
SUBTEXT_SAMPLE_RATE=0.1  # capture 10% of calls

# Custom trace directory
SUBTEXT_TRACE_DIR=/var/log/subtext/traces

# Disable capture without removing the import
SUBTEXT_CAPTURE_ENABLED=0

Programmatic Control

from subtext.sdk.autopatch import enable_autopatch, disable_autopatch

disable_autopatch()   # restore original constructors
enable_autopatch()    # re-enable (idempotent)

Using OpenRouter?

Subtext and OpenRouter solve different problems. OpenRouter is a unified API gateway — it routes your requests to the cheapest provider for a model you've already chosen. Subtext helps you figure out which model to choose in the first place, and catches regressions when models or prompts change.

They work well together. A typical workflow: use subtext run to find the best model for each task, pin an eval set to catch regressions, then deploy that model via OpenRouter for best-price provider routing. Auto-capture traces OpenRouter calls automatically since it uses the OpenAI-compatible SDK.

# 1. Test which model works best for your task
$ subtext run ./agents/support.py

# 2. Deploy via OpenRouter for provider routing
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
response = client.chat.completions.create(
    model="openai/gpt-4o-mini",  # ← Subtext told you this works
    messages=[...]
)

# 3. Auto-capture traces these calls automatically
import subtext  # ← nothing else needed

Shadow Testing

Shadow testing lets you bring your own API keys for full control over which models, providers, and prompt sets you test. Register your keys with subtext provider add, then use the shadow commands below.

Don't have API keys? That's fine — subtext run uses Subtext's managed keys so you can test without registering anything. Shadow testing is for when you want to use your own keys, test specific candidates, or need more control over the process.

# Register your API keys
$ subtext provider add openai --key sk-xxx
$ subtext provider add fireworks --key fw-xxx
$ subtext provider add anthropic --key sk-ant-xxx
$ subtext provider add openrouter --key sk-or-xxx

Compare Models

Test different models against the same prompts. Which model gives the best quality at the lowest cost?

$ subtext shadow models \
    --prompts my_agent \
    --candidate openai/gpt-4o \
    --candidate anthropic/claude-sonnet-4-20250514 \
    --candidate google/gemini-2.0-flash \
    --candidate fireworks/llama-3.3-70b

  Candidate Comparison
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  #   Candidate                        Fidelity   Cost/1K   Latency
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  🥇  openai/gpt-4o                     82.3%    $24.00     1.2s
   2  anthropic/claude-sonnet-4          79.1%    $18.00     0.9s
   3  google/gemini-2.0-flash            74.5%     $0.30     0.4s
   4  fireworks/llama-3.3-70b            70.8%     $6.62     4.9s
   5  openai/gpt-4o-mini                 61.4%     $0.42     0.4s

Use --baseline to compare against a specific model, --quick for a fast sanity check, or --runs N to repeat each prompt for more stable results.

Test a Local Model

Any OpenAI-compatible endpoint works — vLLM, Ollama, llama.cpp, or your own server:

# Register a local endpoint
$ subtext provider add local \
    --base-url http://localhost:8000/v1 \
    --model my-model

# Test against it
$ subtext shadow models --candidate local/my-model

Inference Providers

You've picked your model. Now find the cheapest or fastest platform to run it. Provider mode keeps the model fixed and compares the inference infrastructure underneath — Fireworks vs Groq vs Together vs DeepInfra.

$ subtext shadow providers \
    --baseline fireworks \
    --candidate groq \
    --candidate together \
    --candidate deepinfra

  Provider Comparison (Llama 3.3 70B)
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Provider      Fidelity   Cost/1K   Latency   vs Baseline
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  fireworks      —          $6.62     4.9s      (baseline)
  groq           65.8%      $3.55     0.9s      -46% cost
  together       42.1%      $3.71    17.5s      -44% cost

Query Results

# Find the cheapest candidate
$ subtext shadow report --cheapest

# Find the fastest
$ subtext shadow report --fastest

# Best fidelity
$ subtext shadow report --best

# Top N sorted by any metric
$ subtext shadow report --top 3 --sort cost

Voice Models Coming Soon

Compare voice and speech-to-text models the same way you compare LLMs. Test latency, accuracy, and cost across providers like OpenAI Whisper, Deepgram, AssemblyAI, and ElevenLabs — with your actual audio data.

Prompts & Traces

Subtext is flexible about file formats — it auto-detects most common structures when you use subtext run. You don't need to match these examples exactly. If your traces use input instead of query, or content instead of output, Subtext will figure it out.

This section is reference material for when you want to understand what Subtext is reading, or when you're building an export from your own system.

System Prompts

The easiest way to get started. Paste directly into subtext run, or point it at a file. Subtext accepts plain text, markdown, YAML configs with prompt fields, or Python files — it extracts the prompt automatically.

# Any of these work:
$ subtext run ./agents/support.md
$ subtext run ./agents/support.txt
$ subtext run ./agents/support.py    # extracts prompt strings from variables
$ subtext run ./config.yaml          # extracts prompt from YAML fields

# Or just paste directly:
$ subtext run
  Paste your prompt, drop a file, or give a path:
  > You are a customer support agent for Acme Corp...

Prompt Files (Test Inputs)

Prompts are the tasks you want to test your agent with. The simplest format is just a list of user messages:

// prompts.json
[
  "Summarize my unread emails and flag anything urgent",
  "Create a new Jira ticket for the login bug Sarah reported",
  "Find the cheapest direct flight from SFO to NRT in March"
]

$ subtext prompt import --from prompts.json --name my_agent

If you have ground truth (expected responses, tool calls, or token counts), you can include it for more precise scoring. But it's not required — Subtext uses LLM-as-judge scoring or the baseline model's output when ground truth is missing.

// Full format — all extra fields are optional
{
  "prompts": [
    {
      "query": "Find the cheapest direct flight SFO to NRT",
      "messages": [
        {"role": "system", "content": "You are a travel assistant..."},
        {"role": "user", "content": "Find the cheapest direct flight..."}
      ],
      "response": {
        "tool_calls": [{"function": "search_flights", "args": {"origin": "SFO"}}],
        "output": "The cheapest direct flight is..."
      }
    }
  ]
}

Trace Files (Production Data)

A trace is a record of one LLM call your agent made in production. Import them from your logging system, or let Auto-Capture generate them.

Subtext looks for common field names like query, messages, response, model. If your export uses different names, auto-detection will try to map them. Here's a full example:

{
  "query": "Create a Jira ticket for the login bug",
  "messages": [
    {"role": "system", "content": "You are a project management assistant..."},
    {"role": "user", "content": "Create a Jira ticket for the login bug"}
  ],
  "response": {
    "tool_calls": [{"function": "create_jira_issue", "args": {"project": "ENG"}}],
    "output": "Created ENG-1234: Login bug reported by Sarah..."
  },
  "model": "gpt-4o"
}

For bulk import, use a JSON array or JSONL (one trace per line):

$ subtext trace import --from traces.json --name jan_prod
$ subtext trace list
$ subtext trace diff jan_prod feb_prod

Managing Prompts & Traces

$ subtext prompt import --from <file> --name <name>
$ subtext prompt list
$ subtext prompt show <name>

$ subtext trace import --from <file> --name <name>
$ subtext trace list
$ subtext trace show <name>

SDK Integration

Most users should use Auto-Capture — just import subtexts and you're done. The SDK gives you explicit control when you need it: wrap specific clients, set per-client shadow targets, or control sampling rates individually.

from subtext.sdk.client import Subtext

st = Subtext()

# Wrap a specific client with shadow testing
client = st.wrap(
    your_client,
    shadow_target="groq",
    sample_rate=0.1,   # shadow 10% of traffic
)

Web Sync

Connect your CLI to the Subtext web dashboard to visualize results, share with your team, and track changes over time.

# Connect to the dashboard
$ subtext login --url https://subtexts.io --key <your-api-key>

# Push latest results
$ subtext push

# Push last 3 shadow runs
$ subtext push --last 3

Shadow runs auto-sync after completion if you're logged in.

CI/CD

Run shadow tests or eval sets as part of your CI pipeline. If quality drops below your threshold, the build fails before regressions reach production.

name: Shadow Test Gate
on: [push]
jobs:
  shadow-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install subtexts
      - run: |
          subtext shadow models --prompts prod_suite --candidate ${{ vars.MODEL }}
          subtext ci --threshold 0.7 --fail-on-regression
        env:
          FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}

The --threshold flag sets the minimum fidelity score (0.0 to 1.0) that candidates must meet. If any candidate drops below this score, the CI step exits with a non-zero code and the build fails. --fail-on-regression also fails the build if quality dropped compared to the previous run.

subtext eval run also works in CI — run your pinned eval set against every commit to catch regressions automatically.

Reference

All Commands

# Run (zero-config, managed keys)
$ subtext run [input]

# Eval sets
$ subtext eval create <name> [--from <file>] [--prompt <file>] [--generate]
$ subtext eval run <name> [--models <target>...] [--prompt <file>]
$ subtext eval add <name> --input <text> [--expected <text>]
$ subtext eval add <name> --from-trace <trace-id>
$ subtext eval list | show <name> | delete <name>

# Traces
$ subtext trace import --from <file> --name <name>
$ subtext trace list | show <name> | delete <name>
$ subtext trace diff <id1> <id2>

# Prompts
$ subtext prompt import --from <file> --name <name>
$ subtext prompt list | show <name> | delete <name>

# Shadow testing (bring your own keys)
$ subtext shadow models --candidate <target> [--candidate ...] [--baseline <target>] [--prompts <name>]
$ subtext shadow providers --baseline <provider> --candidate <provider> [--candidate ...]
$ subtext shadow status [--mode models|providers]
$ subtext shadow report [--cheapest|--fastest|--best] [--top N --sort cost|latency|fidelity]
$ subtext shadow diff <run1> <run2>

# Provider setup (for shadow testing)
$ subtext provider add <name> [--key] [--model] [--base-url]
$ subtext provider list
$ subtext provider remove <name>

# Web sync & CI
$ subtext login --url <dashboard-url> --key <api-key>
$ subtext push [--last N]
$ subtext ci --threshold <float> [--fail-on-regression]

Configuration

Subtext stores config globally in ~/.subtext/ and project data locally in .subtext/, both created automatically:

~/.subtext/               # Global — shared across all projects
├── providers.yaml        # Your API keys (subtext provider add)
├── credentials.yaml      # Web dashboard connection (subtext login)
├── managed_keys.yaml     # Subtext managed keys (auto-created)
├── evals/                # Pinned eval sets
├── eval_runs/            # Eval run history
└── usage.json            # Free tier usage tracking

.subtext/                 # Project-local — lives in your repo
├── prompts/              # Test input sets
├── traces/               # Production traces (including auto-captured)
├── shadows/              # Shadow run results
└── results/              # Benchmark results

Targets

A target is a provider + model combination used in shadow testing:

openai                       # uses default model (gpt-4o)
openai/gpt-4o-mini           # specific model
anthropic                    # uses default (claude-sonnet-4)
google/gemini-2.0-flash      # Google Gemini
fireworks/llama-3.3-70b      # open-source via Fireworks
openrouter/openai/gpt-4o     # model via OpenRouter
http://localhost:8000/v1     # local endpoint

Roadmap

We're just getting started. Here's what's coming next.

🎙️

Voice Model Comparison

Shadow test speech-to-text and text-to-speech models. Compare Whisper, Deepgram, AssemblyAI, and ElevenLabs on your actual audio.

🔄

Scheduled Eval Runs

Set it and forget it. Your eval set runs automatically every week. Get pinged the moment quality drifts.

🧪

New Model Alerts

When a provider ships a new model, Subtext tests it against your eval set overnight. You wake up to the results.

📄

INSIGHTS.md

A living document in your repo that tracks every LLM decision, cost trend, and quality change. Your team's AI memory.

⚡

Smart Routing

Automatically route queries to the best model based on complexity. Simple questions go cheap, hard ones go premium. One endpoint, optimal spend.

💬

Conversational Agent

Ask Subtext questions about your setup. “Which prompt is my most expensive?” “Should I switch to Haiku?” Get answers backed by your data.

Have a feature request? Reach out at hi@subtexts.io.