langsmith
Monitor, trace, debug, and evaluate LLM applications with LangSmith. Use when a user asks to trace LLM calls, debug chain executions, evaluate AI output quality, set up LLM observability, monitor agent performance, run prompt experiments, compare model outputs, create evaluation datasets, track token usage and latency, or build LLM testing pipelines. Covers tracing, datasets, evaluators, annotation queues, prompt hub, and production monitoring.
Usage
Getting Started
- Install the skill using the command above
- Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
- Reference the skill in your prompt
- The AI will use the skill's capabilities automatically
Example Prompts
- "Analyze the sales data in revenue.csv and identify trends"
- "Create a visualization comparing Q1 vs Q2 performance metrics"
Documentation
Overview
LangSmith is the observability and evaluation platform for LLM applications. It traces every step of your chains and agents, helps you build evaluation datasets, run automated quality checks, and monitor production performance. Essential for moving LLM apps from prototype to production.
Instructions
Step 1: Setup and Configuration
Create a LangSmith account at smith.langchain.com and get an API key.
pip install langsmith
Set environment variables:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="lsv2_pt_..."
export LANGCHAIN_PROJECT="my-project" # Optional, defaults to "default"
Once set, all LangChain calls are automatically traced — no code changes needed.
For non-LangChain code, use the SDK directly:
from langsmith import Client
client = Client()
Step 2: Tracing
Automatic Tracing (LangChain)
With LANGCHAIN_TRACING_V2=true, every .invoke(), .stream(), .batch() call is traced automatically. Each trace shows:
- Input/output at every step
- Token usage and cost
- Latency per component
- Error details with full stack traces
Manual Tracing with @traceable
For custom functions outside LangChain:
from langsmith import traceable
@traceable(name="process_order", tags=["production"])
def process_order(order_id: str, items: list) -> dict:
# Your business logic
validated = validate_items(items)
summary = generate_summary(validated) # LLM call
return {"order_id": order_id, "summary": summary, "status": "processed"}
@traceable
def validate_items(items: list) -> list:
# Nested traces automatically link to parent
return [item for item in items if item["quantity"] > 0]
Tracing with Context Manager
from langsmith import trace
with trace("data-pipeline", inputs={"source": "csv"}) as run:
data = load_data("input.csv")
processed = transform(data)
run.end(outputs={"rows": len(processed)})
Metadata and Tags
# Add metadata to any LangChain call
result = chain.invoke(
{"question": "..."},
config={
"metadata": {"user_id": "u-123", "environment": "staging"},
"tags": ["beta-test", "gpt4"]
}
)
Step 3: Datasets and Examples
Datasets are collections of input/output pairs used for evaluation:
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset("customer-support-qa", description="Real support questions with expected answers")
# Add examples
client.create_examples(
inputs=[
{"question": "How do I reset my password?"},
{"question": "What's your refund policy?"},
],
outputs=[
{"answer": "Go to Settings > Security > Reset Password"},
{"answer": "Full refund within 30 days, no questions asked"},
],
dataset_id=dataset.id,
)
# Also create from existing traces: in the UI, select traces → "Add to Dataset"
Step 4: Evaluation
Run your chain against a dataset and score the results:
from langsmith import evaluate
# Your target function (chain, agent, or any callable)
def my_app(inputs: dict) -> dict:
result = chain.invoke(inputs)
return {"answer": result}
# Custom evaluator
def correctness(run, example) -> dict:
"""Check if the answer matches expected output."""
predicted = run.outputs["answer"]
expected = example.outputs["answer"]
score = 1.0 if expected.lower() in predicted.lower() else 0.0
return {"key": "correctness", "score": score}
def conciseness(run, example) -> dict:
"""Penalize overly long answers."""
answer = run.outputs["answer"]
word_count = len(answer.split())
score = 1.0 if word_count < 100 else max(0, 1.0 - (word_count - 100) / 200)
return {"key": "conciseness", "score": score}
# Run evaluation
results = evaluate(
my_app,
data="customer-support-qa", # dataset name
evaluators=[correctness, conciseness],
experiment_prefix="gpt4o-v2",
max_concurrency=4,
)
# Results visible in LangSmith UI with scores, comparisons, and drill-down
For LLM-as-judge evaluators, create a function that calls an LLM to rate quality on a 0-1 scale. Use temperature=0 for consistency. For pairwise comparisons, use evaluate_comparative to compare two experiment runs side by side.
Step 5: Prompt Hub and Annotation Queues
Use hub.pull("rlm/rag-prompt") to fetch shared prompts and hub.push("my-org/support-prompt", my_prompt) to version your own. Annotation queues let you set up human review workflows — create a queue with client.create_annotation_queue(), then filter traces in the UI and send low-scoring ones for review.
Step 7: Production Monitoring
# Filter and analyze traces
runs = client.list_runs(
project_name="production",
filter='and(eq(status, "error"), gt(latency, 5))',
limit=50,
)
for run in runs:
print(f"Run {run.id}: {run.error} | Latency: {run.total_time}s | Tokens: {run.total_tokens}")
In the LangSmith dashboard, set up automation rules to auto-flag slow runs, send low-score responses to annotation queues, and alert on error rate spikes.
Step 8: Testing in CI/CD
Run evaluations in CI and assert minimum quality scores:
def test_qa_quality():
results = evaluate(my_app, data="regression-test-set", evaluators=[correctness])
avg_score = sum(r["evaluation_results"]["results"][0].score for r in results) / len(results)
assert avg_score >= 0.85, f"Quality dropped to {avg_score:.2f}"
Examples
Example 1: Add tracing and evaluation to an existing RAG chatbot
User prompt: "I have a LangChain RAG chatbot answering questions about our HR policies. Add LangSmith tracing and create an evaluation pipeline that checks if answers are correct and concise."
The agent will set the LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY environment variables so all chain invocations are automatically traced. It will then create a LangSmith dataset called hr-policy-qa with 10-15 real question/answer pairs drawn from common employee queries. Next, it will write two evaluators — a correctness evaluator that checks whether the expected answer appears in the predicted output, and a conciseness evaluator that penalizes answers over 100 words. Finally, it will wire up evaluate() to run the chatbot against the dataset with both evaluators and print a summary of scores.
Example 2: Monitor production agent and alert on regressions
User prompt: "Our customer support agent is in production. Set up LangSmith monitoring to track error rates and latency, and add a CI test that fails if answer quality drops below 90%."
The agent will configure the production project in LangSmith with metadata tags for environment: production and service: support-agent. It will write a monitoring script using client.list_runs() with filters for error status and high latency (over 5 seconds), outputting a summary of total tokens, average latency, and error count. Then it will create a regression-test-set dataset from recent production traces and write a pytest test that runs evaluate() against it, asserting the average correctness score stays at or above 0.90.
Guidelines
- Always enable tracing in dev — set
LANGCHAIN_TRACING_V2=truefrom day one - Use projects to organize — separate dev, staging, production traces
- Build datasets from production — real data makes the best test sets
- Start with simple evaluators — exact match and contains before LLM judges
- Run evals on every PR — catch regressions before they ship
- Use annotation queues — human review builds trust and better datasets
- Tag everything — metadata makes filtering and analysis possible
- Monitor cost — track token usage per user/feature to control spend
- Compare experiments — A/B test prompts and models systematically
- Version prompts in Hub — never lose a prompt that worked well
Common Pitfalls
- Forgetting to set env vars: No tracing without
LANGCHAIN_TRACING_V2=true - Huge traces: Logging full documents in metadata slows the UI — summarize or truncate
- Evaluator flakiness: LLM judges are non-deterministic — use temperature=0 and run multiple times
- Not separating projects: Dev traces mixed with production makes analysis impossible
- Ignoring latency data: Tracing overhead is minimal (<5ms) — the latency insights are worth it
Information
- Version
- 1.0.0
- Author
- terminal-skills
- Category
- Data & AI
- License
- Apache-2.0