The system of context for your AI agents.
Capture every input, every retrieval, every tool call, every output — the full context behind every agent decision. In your own warehouse. Replayable on demand. The substrate beneath audit, debugging, reproduction, erasure, and bias detection.
from stele.sdk import Tracer from stele.writer import IcebergWriter writer = IcebergWriter.from_env() tracer = Tracer(sink=writer, agent_id="sales-bot") with tracer.trace(input_text=user_msg) as t: docs = retriever.invoke(user_msg) t.record_retrieval(source="kb", documents=docs) response = llm.complete(prompt(docs, user_msg)) t.set_output(response.text)
Pick any agent decision from yesterday. Now answer:
Five questions that come from a board review, a regulator letter, or a Monday-morning incident. Each one needs a row, not a guess.
Every decision becomes a queryable row in your warehouse.
For every agent invocation we capture the full context the model saw — the user input, every retrieval (with the source-data version pinned), every tool call (with response hash), the prompt and parameters, the model output, and any post-hoc outcome — into structured tables in your own warehouse. Replayable from any point.
The answers stop being “we'd have to dig” and start being one SQL query.
Every role reaches into the same warehouse for different answers.
Every item below ships today. Each one is a SQL query, a CLI command, or a one-line oc invocation against the warehouse the SDK has been writing to.
- Reproduce any production decision.Same inputs, same model params, same tool responses (looked up by response_hash), same source data (read AS-OF the captured Iceberg snapshot).
- Per-call replay for agentic loops.Target a specific intermediate model call inside a think → tool → think → respond loop with
--call-index N. - Counterfactual divergence attribution.Replay one trace with each differing layer reverted to another's value; the layer whose reversion converges the output gets the attribution weight.
- Test new model / prompt against last week's real production traces — before rollout.A/B replay across multiple LLM providers (OpenAI, Anthropic, any OpenAI-protocol endpoint).
- Per-decision audit reports.6 regulator templates: EU AI Act, HIPAA, SOX, GDPR, NIST RMF, plain. Markdown or PDF. Optional PKCS#7 signature.
- Subject erasure with signed receipt.GDPR Article 17 / FTC consent orders. Row-level deletes + signed JSON in 4 minutes.
- Disparate-impact analysis.EEOC 4/5ths, demographic parity, equalized odds. Demographics never enter our substrate.
- Per-trace cost attribution.Token usage (input, output, cache-read, cache-write, reasoning) captured per trace. Roll up by agent, team, customer.
- Cost-quality Pareto.Replay against a cheaper model; quantify the quality delta via equivalence_to_source. Find the substitution that maintains quality.
- Catch hallucinated tool use.Find traces where the model claimed tool use that didn't happen. One SQL query against
agent.tool_calls. - Outcome capture, closing the loop.Record post-hoc signals (regression, incident, customer complaint, audit finding). Aggregate over time to find quality drift.
- BYO Iceberg catalog.Use your existing lakehouse — Polaris, AWS S3 Tables, Databricks Unity, Snowflake. Iceberg REST protocol; any compliant catalog.
- Zero sub-processor agreements.Trace data never crosses your VPC boundary. Smaller security review surface; no DPA / SCC / BAA to negotiate for trace content.
The question observability cannot answer.
Production agents routinely produce confident output citing data they never looked up and tool calls they never invoked. The page-of-prose answer reads correct; the receipts behind it don't exist.
Observability dashboards log a “successful” trace with output tokens and a green status. They count events — they don't read what the events contain, and they don't compare them against the work the model actually performed. With a system of context, the work performed is one row away. Count the tool calls actually captured for the trace; the answer is zero.
Observability vendors structurally cannot run this query against your operational warehouse — they don't write to your warehouse, only theirs.
SELECT trace_id FROM agent.traces t WHERE NOT EXISTS ( SELECT 1 FROM agent.tool_calls WHERE trace_id = t.trace_id );
Sorted into what's actually new, what's an extension, and what's operational.
No frozen promises. The split below is honest about which items already have a working version today and which are net-new in the warehouse.
Embedder + Tool layer reversibility
Counterfactual attribution covers 5 of 7 layers cleanly today. EMBEDDER and TOOL reversal close the gap, completing the seven-layer attribution surface.
Real-time anomaly alerting
Webhook / Slack alert when a hallucination flag or output anomaly fires. Today the same query runs on demand against your warehouse.
Cost-quality Pareto, packaged
Doable today via replay + SQL. Coming: a single command that returns “here's the model substitution that holds your quality at lower cost.”
Reasoning content capture + drift detection
Capture the model's reasoning text alongside the answer. Detect when reasoning style drifts at the cluster level. None of the agent-observability vendors do this.
Web console
Trace search, divergence reports, audit-report viewer, cost dashboards. The operator surface today is the CLI; the console reads from the same warehouse.
SOC 2 Type II attestation
Most of our compliance surface is structural — trace data never leaves your VPC — but enterprise procurement still asks for the badge. We'll publish the attestation date when the audit window opens.
Questions a security review will actually open with.
Including the answers that say not yet. Your DPO will reach for these on the first call.
Make every agent decision queryable.
30 minutes. We'll walk through the architecture, the compliance surface, and how a pilot would land in your VPC.