Chapter 11: Model Clients, Streaming, and Caching
The Model Client Is More Than HTTP
Calling a model API sounds simple: send messages, receive text. Agent runtimes need more:
- Streaming text, reasoning, and tool-call events.
- Tool schema delivery.
- Token usage accounting.
- Prompt caching metadata.
- Retry and fallback behavior.
- Authentication and provider selection.
- Cancellation.
- Model capability checks.
The model client is the adapter between the runtime's internal representation and a provider-specific protocol.
A Generic Streaming Adapter
async def stream_model_request(provider, request):
stream = await provider.open_stream(request)
async for event in stream:
if event.type == "text_delta":
yield TextDelta(event.text)
elif event.type == "tool_call_done":
yield ToolCall(event.name, event.arguments)
elif event.type == "usage":
yield Usage(event.input_tokens, event.output_tokens)
elif event.type == "completed":
yield Completed(event.response_id)
The runtime should not care whether the provider used SSE, WebSocket, or another wire format. It should receive normalized events.
Codex: Responses API Client With Session State
Codex integrates with OpenAI-style Responses API flows. It builds requests with instructions, formatted input history, model-visible tools, reasoning settings, tool choice, parallel tool-call support, metadata, and prompt-cache keys.
The model client can maintain turn-scoped session state, prewarm connections, stream over WebSocket or HTTP, and fall back between transports when needed.
Codex Request Shape
def build_codex_model_request(turn, prompt_items, tools):
return {
"model": turn.model,
"instructions": turn.instructions,
"input": prompt_items,
"tools": tools,
"tool_choice": "auto",
"parallel_tool_calls": True,
"reasoning": turn.reasoning_config,
"stream": True,
"prompt_cache_key": turn.thread_id,
"metadata": turn.client_metadata,
}
Codex Transport Fallback
async def run_with_transport_fallback(client, request):
try:
return await client.stream_via_websocket(request)
except RetryBudgetExhausted:
return await client.stream_via_http(request)
This matters for interactive latency. A warmed streaming session can reduce turn-to-turn overhead, while fallback keeps the agent usable when one transport path fails.
Claw: Provider Abstraction
Claw's API layer abstracts multiple provider families. It can resolve model aliases, detect provider kind, stream Anthropic-style messages, support OpenAI-compatible providers, and record usage or prompt-cache statistics.
The runtime sees assistant events rather than raw provider deltas.
Claw Provider Shape
async def claw_stream_request(provider_client, runtime_request):
provider_request = translate_to_provider_request(runtime_request)
stream = await provider_client.stream_message(provider_request)
events = []
async for provider_event in stream:
event = normalize_provider_event(provider_event)
events.append(event)
return events
This shape keeps ConversationRuntime provider-agnostic. The runtime asks for a
stream of assistant events and then builds the assistant message and tool uses
from those events.
Tool Calls Across Providers
Providers represent tool calls differently. A runtime needs a stable internal format:
class NormalizedToolCall:
id: str
name: str
arguments: dict
raw_provider_item: object
def normalize_tool_call(provider_item):
if provider_item.api == "responses":
return NormalizedToolCall(
id=provider_item.call_id,
name=provider_item.name,
arguments=parse_json(provider_item.arguments),
)
if provider_item.api == "anthropic_messages":
return NormalizedToolCall(
id=provider_item.id,
name=provider_item.name,
arguments=provider_item.input,
)
This is what lets the same tool registry work regardless of provider-specific wire format.
Usage And Cost Tracking
Agents must track usage because long coding sessions can be expensive. Usage tracking also helps compaction decisions.
def record_usage(session, usage_event, pricing):
input_cost = usage_event.input_tokens * pricing.input_per_token
output_cost = usage_event.output_tokens * pricing.output_per_token
session.usage.input_tokens += usage_event.input_tokens
session.usage.output_tokens += usage_event.output_tokens
session.usage.estimated_cost += input_cost + output_cost
Codex records token counts and emits token events through the session protocol. Claw has a usage tracker and model pricing helpers.
Prompt Caching
Prompt caching rewards stable prompt prefixes. The model integration layer has to preserve enough identity across requests for the provider to reuse cached content.
def cache_key_for_session(session):
return f"agent-thread:{session.thread_id}"
def split_prompt_for_cache(prompt):
return {
"stable_prefix": prompt.before_dynamic_boundary,
"dynamic_suffix": prompt.after_dynamic_boundary,
}
Codex uses provider request metadata such as a prompt cache key. Claw's prompt builder includes a dynamic boundary so stable Claude-style prompt sections can be separated from volatile context.
Retry Policy
Model requests fail for ordinary reasons: network interruptions, rate limits, provider overload, context-length errors, and malformed tool arguments.
async def sample_with_retries(request, client):
for attempt in range(3):
try:
return await client.stream(request)
except RateLimited as error:
await sleep(error.retry_after or backoff(attempt))
except ContextTooLarge:
request = compact_request(request)
except TransportError:
client = client.fallback_transport()
raise RuntimeError("model request failed after retries")
Codex has visible transport retry and fallback behavior. Claw's provider client normalizes provider-specific streaming and can fall back to non-streaming paths in some cases.
Comparison
| Aspect | Codex | Claw |
|---|---|---|
| Main API shape | OpenAI Responses-style request/stream | Provider abstraction, Anthropic and compatible APIs |
| Transport | HTTP streaming and WebSocket paths | Provider-specific streaming abstraction |
| Tool delivery | Model-visible tool schemas in request | Tool definitions translated to provider request |
| Runtime events | Stream events handled during turn | Assistant events returned to conversation runtime |
| Caching | Prompt cache key and session state | Dynamic prompt boundary and prompt-cache stats |
| Usage | Token events and session accounting | Usage tracker and pricing helpers |
| Fallback | Transport fallback and retry budget | Provider abstraction and stream/non-stream handling |
Source Anchors
For Codex, useful filenames are client.rs, turn.rs, and protocol model files.
For Claw, useful filenames are client.rs, conversation.rs, and prompt.rs.