What is Subtext?
Subtext is shadow testing for AI agents. Test any change to your AI stack — new models, new prompts, new providers — and see exactly what breaks before your users do.
Run everything from the CLI, or log into the web portal to see dashboards — visualize results and share with your team.
No API keys needed to start — one Subtext account gives you access to every major model. Your prompts and traces stay local on your machine by default. Nothing is sent to our servers unless you explicitly push to the dashboard. We never train on your data.
The web dashboard — optional. The CLI works on its own.
Requirements: Python 3.9+
Quick Start
$ pip install subtextsSee what's available:
$ subtext --help
Usage: subtext [command]
Commands:
run Test a prompt or trace across models (zero-config)
eval Manage and run pinned eval sets
trace Import, view, and diff production traces
shadow Shadow test with your own API keys
provider Manage API keys and providers
prompt Manage test prompt sets
login Connect CLI to web dashboard
push Sync results to web dashboardRun your first test — no API keys, no config:
$ subtext runPaste a prompt, drop a file, or give a path. Subtext auto-detects what you gave it, tests it across 5 models, and shows you a cost comparison table — all in under 90 seconds. First 5 runs are free every month.
Which command should I use?
- subtext run — Starting out or want quick results. Zero config, managed API keys, auto-detects your input.
- subtext trace diff — Compare any two LLM runs side by side. Always free.
- subtext eval — Pin test cases, run every change against them, catch regressions.
- subtext shadow — Power user mode. Your own API keys, specific candidates, full control.
Run
The zero-config entry point. Paste anything — a prompt, a Python file, a JSON trace, a YAML config, or a directory — and Subtext figures out what it is, tests it across 5 models, and shows you exactly how much you could save.
Interactive Mode
$ subtext run Paste your prompt, drop a file, or give a path: > You are a customer support agent for Acme Corp... Detected: system prompt (12 lines, customer support agent) Generating 5 test inputs... Testing across 5 models... Model Quality Cost Latency Comparison ──────────────────────────────────────────────────────── claude-sonnet-4 94 $0.0048 1.1s priciest gpt-4o 93 $0.0038 0.8s -21% deepseek-v3 90 $0.0003 0.5s -94% gpt-4o-mini 88 $0.0004 0.3s -92% gemini-2.0-flash 87 $0.0002 0.2s -96% 💰
Smart File Detection
Subtext auto-detects whatever you throw at it:
- Plain text / markdown — treated as a system prompt
- Python files — extracts prompt strings from variables and classes
- YAML / JSON configs — extracts prompt and model configuration
- JSON traces — replays tasks against cheaper models
- Eval sets — uses existing test cases directly
- Directories — scans for all prompt and trace files, runs all of them
Direct File Path
# Run against a specific file $ subtext run ./agents/support.py # Run with production traces $ subtext run ./logs/january_conversations.json # Run an entire directory $ subtext run ./agents/
Free Tier
subtext run uses Subtext's managed API keys — no keys needed. You get 5 free runs per month. After that, recharge your credits and pay as you go. We don't mark up — you're charged exactly what the model providers charge us.
Trace Diffs Always Free
Compare any two LLM runs side by side. See exactly what changed in cost, latency, tokens, and tool calls. No account needed, no API keys, no limits.
The Flow
Import your traces from a JSON file, then list and diff:
# Import traces from a file $ subtext trace import --from prod_logs.json --name jan_prod Imported 24 traces from prod_logs.json # See what you have $ subtext trace list Name Traces Model Date ────────────────────────────────────────────── jan_prod 24 gpt-4o Jan 30 feb_prod 18 gpt-4o-mini Feb 12 # Diff two traces by name $ subtext trace diff jan_prod feb_prod Trace Comparison ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ jan_prod feb_prod Delta gpt-4o gpt-4o-mini ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Cost $0.024 $0.003 -88% Latency 1.2s 0.4s -67% Tokens 847 215 -75% Tool calls 2 0 ⚠ -2 Outcome ✓ Pass ✓ Pass
Where Traces Come From
- JSON files — export from your logging system and import with
subtext trace import - Auto-Capture — add
import subtextsto your code and traces accumulate automatically - Shadow runs — every shadow run produces traces you can diff with
subtext shadow diff
Eval Sets
Eval sets are curated collections of test cases — your safety net. Pin critical cases from production, previous runs, or generate them from a prompt. Then run every model or prompt change against this set to catch regressions before they ship.
Create an Eval Set
# Create empty, then add cases $ subtext eval create my-evals # Import from a JSON file $ subtext eval create my-evals --from test_cases.json # Import + attach system prompt (recommended — required for meaningful scores) $ subtext eval create my-evals \ --from test_cases.json \ --prompt agents/support.py ✓ Created eval set my-evals with 10 cases from test_cases.json System prompt attached (637 chars) — judge will score against it # Generate from a prompt using LLM $ subtext eval create my-evals --generate --prompt agents/support.py
Add Cases Over Time
# Add a case manually $ subtext eval add my-evals \ --input "Customer asked about crypto refund" \ --expected "Should explain crypto refunds not supported" # Add from an existing trace $ subtext eval add my-evals --from-trace conv_10442
Every production bug becomes a regression test. Over time, your eval set becomes your institutional memory for what your agent needs to handle.
Run Against Models
$ subtext eval run my-evals Running 12 pinned cases across 4 models... eval_001: billing dispute ├─ gpt-4o 93 ✓ ├─ gpt-4o-mini 88 ✓ └─ deepseek-v3 85 ⚠ wrong tool call eval_002: angry customer ├─ gpt-4o 91 ✓ ├─ gpt-4o-mini 72 ⚠ didn't escalate └─ deepseek-v3 80 ✓ Summary: Model Avg Pass rate ──────────────────────────────────────── gpt-4o 92.2 12/12 gpt-4o-mini 73.6 5/12 deepseek-v3 81.2 7/12
Prompt Diff Mode
Changed your prompt? Test the new version against the same eval set to see what improved and what regressed:
$ subtext eval run my-evals --prompt ./agents/support_v2.txt Case Before After Δ ────────────────────────────────────────────── billing dispute 95 94 -1 angry customer 93 96 +3 ✓ complex proration 97 97 same enterprise multi-sub 91 88 -3 ⚠
Manage Eval Sets
$ subtext eval list $ subtext eval show my-evals $ subtext eval delete my-evals
Auto-Capture
Add one import to your code and every LLM call gets traced automatically. No decorators, no config, no code changes beyond the import.
import subtext # That's it. Every LLM call is now traced. # (pip install subtexts — the package is subtexts, the import is subtext) # Your existing code — no changes needed: from openai import OpenAI client = OpenAI() response = client.chat.completions.create(...) # ← automatically captured # Works with Anthropic too: from anthropic import Anthropic client = Anthropic() response = client.messages.create(...) # ← automatically captured # Works with any OpenAI-compatible SDK (OpenRouter, Together, etc.): client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key="...") response = client.chat.completions.create(...) # ← also captured
Traces are stored locally in .subtext/traces/, date-partitioned and auto-rotated. Nothing leaves your machine.
Any SDK built on OpenAI's Python client gets captured automatically — that includes OpenRouter, Together, Fireworks, Groq, and any other provider using the OpenAI-compatible format. If it calls client.chat.completions.create(), Subtext sees it.
How It Works
The import automatically instruments your OpenAI and Anthropic clients. Your code runs exactly as before — same responses, same latency, zero side effects. Subtext just records what happened in the background.
Controls
# Disable auto-patching entirely SUBTEXT_DISABLE_AUTOPATCH=1 # Control sampling rate (0.0 to 1.0, default: 1.0) SUBTEXT_SAMPLE_RATE=0.1 # capture 10% of calls # Custom trace directory SUBTEXT_TRACE_DIR=/var/log/subtext/traces # Disable capture without removing the import SUBTEXT_CAPTURE_ENABLED=0
Programmatic Control
from subtext.sdk.autopatch import enable_autopatch, disable_autopatch disable_autopatch() # restore original constructors enable_autopatch() # re-enable (idempotent)
Using OpenRouter?
Subtext and OpenRouter solve different problems. OpenRouter is a unified API gateway — it routes your requests to the cheapest provider for a model you've already chosen. Subtext helps you figure out which model to choose in the first place, and catches regressions when models or prompts change.
They work well together. A typical workflow: use subtext run to find the best model for each task, pin an eval set to catch regressions, then deploy that model via OpenRouter for best-price provider routing. Auto-capture traces OpenRouter calls automatically since it uses the OpenAI-compatible SDK.
# 1. Test which model works best for your task $ subtext run ./agents/support.py # 2. Deploy via OpenRouter for provider routing client = OpenAI( base_url="https://openrouter.ai/api/v1", api_key=os.environ["OPENROUTER_API_KEY"], ) response = client.chat.completions.create( model="openai/gpt-4o-mini", # ← Subtext told you this works messages=[...] ) # 3. Auto-capture traces these calls automatically import subtext # ← nothing else needed
Shadow Testing
Shadow testing lets you bring your own API keys for full control over which models, providers, and prompt sets you test. Register your keys with subtext provider add, then use the shadow commands below.
Don't have API keys? That's fine — subtext run uses Subtext's managed keys so you can test without registering anything. Shadow testing is for when you want to use your own keys, test specific candidates, or need more control over the process.
# Register your API keys $ subtext provider add openai --key sk-xxx $ subtext provider add fireworks --key fw-xxx $ subtext provider add anthropic --key sk-ant-xxx $ subtext provider add openrouter --key sk-or-xxx
Compare Models
Test different models against the same prompts. Which model gives the best quality at the lowest cost?
$ subtext shadow models \
--prompts my_agent \
--candidate openai/gpt-4o \
--candidate anthropic/claude-sonnet-4-20250514 \
--candidate google/gemini-2.0-flash \
--candidate fireworks/llama-3.3-70bCandidate Comparison ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ # Candidate Fidelity Cost/1K Latency ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🥇 openai/gpt-4o 82.3% $24.00 1.2s 2 anthropic/claude-sonnet-4 79.1% $18.00 0.9s 3 google/gemini-2.0-flash 74.5% $0.30 0.4s 4 fireworks/llama-3.3-70b 70.8% $6.62 4.9s 5 openai/gpt-4o-mini 61.4% $0.42 0.4s
Use --baseline to compare against a specific model, --quick for a fast sanity check, or --runs N to repeat each prompt for more stable results.
Test a Local Model
Any OpenAI-compatible endpoint works — vLLM, Ollama, llama.cpp, or your own server:
# Register a local endpoint $ subtext provider add local \ --base-url http://localhost:8000/v1 \ --model my-model # Test against it $ subtext shadow models --candidate local/my-model
Inference Providers
You've picked your model. Now find the cheapest or fastest platform to run it. Provider mode keeps the model fixed and compares the inference infrastructure underneath — Fireworks vs Groq vs Together vs DeepInfra.
$ subtext shadow providers \
--baseline fireworks \
--candidate groq \
--candidate together \
--candidate deepinfraProvider Comparison (Llama 3.3 70B) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Provider Fidelity Cost/1K Latency vs Baseline ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ fireworks — $6.62 4.9s (baseline) groq 65.8% $3.55 0.9s -46% cost together 42.1% $3.71 17.5s -44% cost
Query Results
# Find the cheapest candidate $ subtext shadow report --cheapest # Find the fastest $ subtext shadow report --fastest # Best fidelity $ subtext shadow report --best # Top N sorted by any metric $ subtext shadow report --top 3 --sort cost
Voice Models Coming Soon
Compare voice and speech-to-text models the same way you compare LLMs. Test latency, accuracy, and cost across providers like OpenAI Whisper, Deepgram, AssemblyAI, and ElevenLabs — with your actual audio data.
Prompts & Traces
Subtext is flexible about file formats — it auto-detects most common structures when you use subtext run. You don't need to match these examples exactly. If your traces use input instead of query, or content instead of output, Subtext will figure it out.
This section is reference material for when you want to understand what Subtext is reading, or when you're building an export from your own system.
System Prompts
The easiest way to get started. Paste directly into subtext run, or point it at a file. Subtext accepts plain text, markdown, YAML configs with prompt fields, or Python files — it extracts the prompt automatically.
# Any of these work: $ subtext run ./agents/support.md $ subtext run ./agents/support.txt $ subtext run ./agents/support.py # extracts prompt strings from variables $ subtext run ./config.yaml # extracts prompt from YAML fields # Or just paste directly: $ subtext run Paste your prompt, drop a file, or give a path: > You are a customer support agent for Acme Corp...
Prompt Files (Test Inputs)
Prompts are the tasks you want to test your agent with. The simplest format is just a list of user messages:
// prompts.json [ "Summarize my unread emails and flag anything urgent", "Create a new Jira ticket for the login bug Sarah reported", "Find the cheapest direct flight from SFO to NRT in March" ]
$ subtext prompt import --from prompts.json --name my_agentIf you have ground truth (expected responses, tool calls, or token counts), you can include it for more precise scoring. But it's not required — Subtext uses LLM-as-judge scoring or the baseline model's output when ground truth is missing.
// Full format — all extra fields are optional { "prompts": [ { "query": "Find the cheapest direct flight SFO to NRT", "messages": [ {"role": "system", "content": "You are a travel assistant..."}, {"role": "user", "content": "Find the cheapest direct flight..."} ], "response": { "tool_calls": [{"function": "search_flights", "args": {"origin": "SFO"}}], "output": "The cheapest direct flight is..." } } ] }
Trace Files (Production Data)
A trace is a record of one LLM call your agent made in production. Import them from your logging system, or let Auto-Capture generate them.
Subtext looks for common field names like query, messages, response, model. If your export uses different names, auto-detection will try to map them. Here's a full example:
{
"query": "Create a Jira ticket for the login bug",
"messages": [
{"role": "system", "content": "You are a project management assistant..."},
{"role": "user", "content": "Create a Jira ticket for the login bug"}
],
"response": {
"tool_calls": [{"function": "create_jira_issue", "args": {"project": "ENG"}}],
"output": "Created ENG-1234: Login bug reported by Sarah..."
},
"model": "gpt-4o"
}For bulk import, use a JSON array or JSONL (one trace per line):
$ subtext trace import --from traces.json --name jan_prod $ subtext trace list $ subtext trace diff jan_prod feb_prod
Managing Prompts & Traces
$ subtext prompt import --from <file> --name <name> $ subtext prompt list $ subtext prompt show <name> $ subtext trace import --from <file> --name <name> $ subtext trace list $ subtext trace show <name>
SDK Integration
Most users should use Auto-Capture — just import subtexts and you're done. The SDK gives you explicit control when you need it: wrap specific clients, set per-client shadow targets, or control sampling rates individually.
from subtext.sdk.client import Subtext st = Subtext() # Wrap a specific client with shadow testing client = st.wrap( your_client, shadow_target="groq", sample_rate=0.1, # shadow 10% of traffic )
Web Sync
Connect your CLI to the Subtext web dashboard to visualize results, share with your team, and track changes over time.
# Connect to the dashboard $ subtext login --url https://subtexts.io --key <your-api-key> # Push latest results $ subtext push # Push last 3 shadow runs $ subtext push --last 3
Shadow runs auto-sync after completion if you're logged in.
CI/CD
Run shadow tests or eval sets as part of your CI pipeline. If quality drops below your threshold, the build fails before regressions reach production.
name: Shadow Test Gate on: [push] jobs: shadow-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - run: pip install subtexts - run: | subtext shadow models --prompts prod_suite --candidate ${{ vars.MODEL }} subtext ci --threshold 0.7 --fail-on-regression env: FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
The --threshold flag sets the minimum fidelity score (0.0 to 1.0) that candidates must meet. If any candidate drops below this score, the CI step exits with a non-zero code and the build fails. --fail-on-regression also fails the build if quality dropped compared to the previous run.
subtext eval run also works in CI — run your pinned eval set against every commit to catch regressions automatically.
Reference
All Commands
# Run (zero-config, managed keys) $ subtext run [input] # Eval sets $ subtext eval create <name> [--from <file>] [--prompt <file>] [--generate] $ subtext eval run <name> [--models <target>...] [--prompt <file>] $ subtext eval add <name> --input <text> [--expected <text>] $ subtext eval add <name> --from-trace <trace-id> $ subtext eval list | show <name> | delete <name> # Traces $ subtext trace import --from <file> --name <name> $ subtext trace list | show <name> | delete <name> $ subtext trace diff <id1> <id2> # Prompts $ subtext prompt import --from <file> --name <name> $ subtext prompt list | show <name> | delete <name> # Shadow testing (bring your own keys) $ subtext shadow models --candidate <target> [--candidate ...] [--baseline <target>] [--prompts <name>] $ subtext shadow providers --baseline <provider> --candidate <provider> [--candidate ...] $ subtext shadow status [--mode models|providers] $ subtext shadow report [--cheapest|--fastest|--best] [--top N --sort cost|latency|fidelity] $ subtext shadow diff <run1> <run2> # Provider setup (for shadow testing) $ subtext provider add <name> [--key] [--model] [--base-url] $ subtext provider list $ subtext provider remove <name> # Web sync & CI $ subtext login --url <dashboard-url> --key <api-key> $ subtext push [--last N] $ subtext ci --threshold <float> [--fail-on-regression]
Configuration
Subtext stores config globally in ~/.subtext/ and project data locally in .subtext/, both created automatically:
~/.subtext/ # Global — shared across all projects ├── providers.yaml # Your API keys (subtext provider add) ├── credentials.yaml # Web dashboard connection (subtext login) ├── managed_keys.yaml # Subtext managed keys (auto-created) ├── evals/ # Pinned eval sets ├── eval_runs/ # Eval run history └── usage.json # Free tier usage tracking .subtext/ # Project-local — lives in your repo ├── prompts/ # Test input sets ├── traces/ # Production traces (including auto-captured) ├── shadows/ # Shadow run results └── results/ # Benchmark results
Targets
A target is a provider + model combination used in shadow testing:
openai # uses default model (gpt-4o) openai/gpt-4o-mini # specific model anthropic # uses default (claude-sonnet-4) google/gemini-2.0-flash # Google Gemini fireworks/llama-3.3-70b # open-source via Fireworks openrouter/openai/gpt-4o # model via OpenRouter http://localhost:8000/v1 # local endpoint
Roadmap
We're just getting started. Here's what's coming next.
Have a feature request? Reach out at hi@subtexts.io.