Name: weave
Author: terminal-skills

You are an expert in Weave, the lightweight toolkit by Weights & Biases for tracking and evaluating AI applications. You help developers trace LLM calls, evaluate outputs, compare model versions, track experiments, and debug AI pipelines — with automatic logging via decorators and a visual dashboard for exploring traces, costs, and quality metrics.

Core Capabilities

Automatic Tracing

python

import weave
import openai

weave.init("my-ai-project")               # Initialize with project name

client = openai.OpenAI()

# OpenAI calls are automatically traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain transformers"}],
)
# Weave captures: model, tokens, latency, cost, input/output — viewable in dashboard

# Custom function tracing
@weave.op()
def extract_entities(text: str) -> list[str]:
    """Extract named entities from text."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Extract entities from: {text}\nReturn JSON list."}],
    )
    return json.loads(response.choices[0].message.content)

@weave.op()
def rag_pipeline(query: str) -> str:
    """Full RAG pipeline — each step traced as child span."""
    docs = retrieve_documents(query)       # Traced if decorated
    context = "\n".join(docs)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using:\n{context}"},
            {"role": "user", "content": query},
        ],
    )
    return response.choices[0].message.content

Evaluations

python

# Define evaluation dataset
eval_dataset = [
    {"query": "What is Python?", "expected": "programming language"},
    {"query": "Who created Linux?", "expected": "Linus Torvalds"},
    {"query": "What is Docker?", "expected": "containerization platform"},
]

# Define scoring functions
@weave.op()
def relevance_scorer(output: str, expected: str) -> dict:
    """Score if output contains expected information."""
    contains = expected.lower() in output.lower()
    return {"relevance": 1.0 if contains else 0.0}

@weave.op()
def length_scorer(output: str) -> dict:
    """Score response length (prefer concise)."""
    words = len(output.split())
    return {"conciseness": min(1.0, 50 / max(words, 1))}

# Run evaluation
evaluation = weave.Evaluation(
    dataset=eval_dataset,
    scorers=[relevance_scorer, length_scorer],
)

results = await evaluation.evaluate(rag_pipeline)
# Results visible in Weave dashboard with per-example scores
# Compare across model versions, prompts, parameters

Model Versioning

python

# Track model/prompt versions
class SupportAgent(weave.Model):
    model_name: str = "gpt-4o"
    system_prompt: str = "You are a helpful support agent."
    temperature: float = 0.7

    @weave.op()
    def predict(self, query: str) -> str:
        response = client.chat.completions.create(
            model=self.model_name,
            temperature=self.temperature,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": query},
            ],
        )
        return response.choices[0].message.content

# Version 1
agent_v1 = SupportAgent(system_prompt="Be concise and helpful.")

# Version 2 — compare in dashboard
agent_v2 = SupportAgent(model_name="gpt-4o-mini", system_prompt="Be detailed and empathetic.")

# Evaluate both versions
for agent in [agent_v1, agent_v2]:
    await evaluation.evaluate(agent)
# Dashboard shows side-by-side comparison: quality, cost, latency

Installation

bash

pip install weave
# Uses your W&B account — set WANDB_API_KEY

Best Practices

@weave.op() decorator — Add to any function to trace it; creates hierarchical spans for nested calls
Auto-instrumentation — OpenAI, Anthropic, LangChain calls traced automatically after weave.init()
Evaluations — Define datasets + scorers; run systematically; compare versions in dashboard
weave.Model — Subclass for versioned models; parameters tracked, comparable across evaluations
W&B integration — Weave data appears in your W&B workspace; share with team, add to reports
Cost tracking — Automatic per-call cost calculation; aggregate by function, model, or user
Production monitoring — Use in production for continuous quality tracking; alert on regressions
Lightweight — Single @weave.op() decorator; no complex setup, no separate infrastructure

weave

Validation

Getting Started

Example Prompts

Documentation

Core Capabilities

Automatic Tracing

Evaluations

Model Versioning

Installation

Best Practices

Information

Use Cases