Generate evaluations and iterate on your AI applications
The Simforge Python SDK captures your AI function calls to automatically generate evaluations. Re-run your prompts with different models, parameters, and inputs to iterate faster.
Copy this prompt into your coding agent (tested with Cursor and Claude Code using Sonnet 4.5):
Copy
Modify existing Python code to add Simforge tracing.Do NOT browse or web search. Use ONLY the API described below.Simforge Python SDK (authoritative excerpt):- Install: `pip install simforge-py` or `poetry add simforge-py` or `uv add simforge-py`- Init: import os from simforge import Simforge simforge = Simforge(api_key=os.environ["SIMFORGE_API_KEY"])- Instrumentation (ONLY allowed form - use get_function): # Declare trace function key once my_service = simforge.get_function("<trace_function_key>") # Decorate methods with span @my_service.span() def method_name(): ... # Or with options: @my_service.span(name="DisplayName", type="function") def method_name(): ... # Span types: "llm", "agent", "function", "guardrail", "handoff", "custom"- Decorator form ONLY; must be placed immediately ABOVE the `def` it instruments.- DO NOT use context managers or manual span creation.- DO NOT extract helper methods.Task:1) Ensure simforge-py is installed and initialization exists.2) Read the codebase and identify ALL AI workflows (LLM calls, agent runs, AI-driven decisions).3) Present me with a numbered list of workflows you found. For each, describe: - What it does - Why it's worth instrumenting — what visibility tracing gives you into each step4) After I choose which workflow(s) to instrument: - Create a function wrapper with `simforge.get_function("<trace_function_key>")` - Add `@my_service.span()` directly ABOVE each method's `def` - Instrument intermediate steps (not just the final output) so each trace has enough context to diagnose issues - Ensure the simforge client is initialized and accessible5) Do not change method signature, behavior, or return value. Minimal diff.Output:- First: your numbered list of workflows with why each is worth instrumenting- After my selection: minimal diffs for dependencies, initialization, and the method changes
Simforge(api_key: str)# Disable tracing (functions still execute, but no spans are sent)Simforge(api_key: str, enabled: bool = True)
Missing API key doesn’t crash. If the API key is missing, empty, or whitespace-only, the SDK automatically disables tracing and logs a warning. All decorated functions still execute normally — no spans are sent, no errors are thrown. You don’t need any conditional logic around the API key.
For projects with instrumented functions spread across multiple files, create a dedicated file that initializes Simforge and exports the function. Import it wherever you need to instrument.
Copy
# lib/simforge_client.py — single source of truthimport osfrom simforge import Simforgesimforge = Simforge(api_key=os.environ["SIMFORGE_API_KEY"])order_service = simforge.get_function("order-processing")
Use get_current_span() to get a handle to the active span, then call .add_context() to attach contextual key-value pairs from inside a traced function — useful for runtime values like request IDs, computed scores, or dynamic context:
Use get_current_span() to set the prompt string on the current span. This is stored in span_data.prompt and is useful for capturing the exact prompt text sent to an LLM:
Copy
from simforge import get_current_span@simforge.span("classification", type="llm")def classify_text(text: str) -> str: prompt = f"Classify the following text: {text}" get_current_span().set_prompt(prompt) result = llm.complete(prompt) return result
The last set_prompt call wins — it overwrites any previously set prompt on the span. Calling set_prompt outside a span context is a no-op (it never crashes).
If you use BAML for your LLM calls, wrap_baml automatically captures the rendered prompt and LLM metadata (model, provider, token counts, duration) on the current span — no manual set_prompt or add_context calls needed.
Copy
pip install baml-py
Copy
from baml_client import b# Pass your BAML client to the constructorsimforge = Simforge(api_key=os.environ["SIMFORGE_API_KEY"], baml_client=b)# Wrap a BAML method inside a span — prompt and metadata are captured automatically@simforge.span("classify", type="llm")async def classify(text: str): return await simforge.wrap_baml(b.ClassifyText)(text=text)result = await classify("Hello world")
wrap_baml works by creating a BAML Collector, running the method through a tracked client, then extracting:
Prompt → set_prompt() with the rendered messages (system + user)
Use get_current_trace() to set context that applies to the entire trace (all spans within a single execution). This is useful for grouping traces by session or attaching trace-level metadata:
Copy
from simforge import get_current_trace@simforge.span("order-processing", type="function")def process_order(order_id: str) -> dict: trace = get_current_trace() # Set session ID (stored as database column, filterable in dashboard) trace.set_session_id("session-123") # Set trace metadata (stored in raw trace data) trace.set_metadata({"region": "us-west-2", "environment": "production"}) # Add context entries (stored as key-value pairs, accumulates across calls) trace.add_context({"workflow": "checkout-flow", "batch_id": "batch-2024-01"}) return {"order_id": order_id, "status": "completed"}
set_session_id(id) — Groups traces by user session. Stored as a database column for efficient filtering.
set_metadata(dict) — Arbitrary key-value metadata on the trace. Merges with existing metadata.
add_context(dict) — Key-value context entries. Accumulates across multiple calls.
Replay historical traces through a function and create a test run with comparison data. This is useful for testing changes to your functions against real production inputs.
Copy
@simforge.span("my-function-key")def my_function(text: str) -> dict: return {"processed": text.upper()}result = simforge.replay(my_function, limit=5)# Or replay specific traces by IDresult = simforge.replay(my_function, trace_ids=["trace-abc", "trace-def"])print(f"Test Run: {result['test_run_url']}")for item in result["items"]: print(f" Input: {item['input']}") print(f" Result: {item['result']}") print(f" Original: {item['original_output']}")
Parameters:
fn (required): The function to replay (must be decorated with @span)
limit (optional): Maximum number of traces to replay. Default: 5
trace_ids (optional): List of trace IDs to filter which traces are replayed
Returns:
Copy
{ "items": [ { "input": [...], # The inputs passed to fn "result": ..., # What fn returned "original_output": ..., # What the original trace produced "error": None | str # Error message if fn raised } ], "test_run_id": "...", "test_run_url": "..."}
Notes:
The function must be decorated with @span — the trace function key is read from the decorator
The function can be sync or async (async functions are detected and run automatically)
If the function raises an error for one input, replay continues with the remaining inputs
Each replay creates a test run visible in the Simforge dashboard
Works through nested decorators (e.g. @retry, @cache) — walks the __wrapped__ chain to find @span
Simforge’s native functions improve prompt tuning efficacy. The auto-tuning engine has full access to prompts—unlike other Agent SDKs that nest user instructions inside system prompts, making prompts inaccessible to tracing.
Copy
result = simforge.call("ExtractName", text="My name is John Doe")# Returns Pydantic model or primitive type