Overview
Build a systematic approach to prompt engineering. Design test cases, define evaluation rubrics, run prompt variants against edge cases, and compare results to find the best-performing prompt for your use case.
Instructions
1. Define the evaluation criteria
Before testing prompts, establish what "good" looks like:
## Evaluation Rubric: Customer Support Classifier
| Criterion | Weight | Description |
|---------------|--------|------------------------------------------|
| Accuracy | 40% | Correct category assigned |
| Consistency | 25% | Same input → same output across runs |
| Latency | 15% | Response time under threshold |
| Format | 10% | Output matches expected JSON schema |
| Edge cases | 10% | Handles ambiguous/unusual inputs |
2. Create test cases
Build a test suite covering normal cases, edge cases, and adversarial inputs:
test_cases:
- id: TC-001
input: "My order hasn't arrived and it's been 2 weeks"
expected_category: "shipping_delay"
expected_priority: "high"
tags: [normal, shipping]
- id: TC-002
input: "I love your product! Also my payment failed"
expected_category: "payment_issue"
expected_priority: "high"
tags: [mixed-intent, edge-case]
- id: TC-003
input: "asdf jkl; 12345"
expected_category: "unclassifiable"
expected_priority: "low"
tags: [adversarial, garbage-input]
- id: TC-004
input: ""
expected_category: "unclassifiable"
expected_priority: "low"
tags: [adversarial, empty-input]
3. Design prompt variants
Create 2-3 prompt variants to compare:
Variant A (Concise):
Classify this support ticket into one category: billing, shipping_delay,
product_defect, account_access, feature_request, unclassifiable.
Return JSON: {"category": "...", "priority": "high|medium|low"}
Variant B (Detailed with examples):
You are a support ticket classifier. Analyze the customer message and
assign exactly one category and priority level.
Categories: billing, shipping_delay, product_defect, account_access,
feature_request, unclassifiable
Rules:
- If the message contains multiple issues, classify by the most urgent
- If the message is gibberish or empty, use "unclassifiable"
- Priority is "high" for payment/shipping issues, "medium" for product
issues, "low" for feature requests
Examples:
Input: "I was charged twice for my subscription"
Output: {"category": "billing", "priority": "high"}
Input: "It would be nice to have dark mode"
Output: {"category": "feature_request", "priority": "low"}
Now classify this message:
4. Run the evaluation
Execute each prompt variant against all test cases and score:
## Evaluation Results
| Metric | Variant A | Variant B |
|--------------|-----------|-----------|
| Accuracy | 72% | 91% |
| Consistency | 85% | 94% |
| Format match | 100% | 100% |
| Edge cases | 40% | 80% |
| Avg tokens | 12 | 18 |
### Detailed Results
| Test Case | Variant A | Variant B | Expected |
|-----------|---------------------|---------------------|---------------------|
| TC-001 | ✅ shipping_delay | ✅ shipping_delay | shipping_delay |
| TC-002 | ❌ general_inquiry | ✅ payment_issue | payment_issue |
| TC-003 | ❌ feature_request | ✅ unclassifiable | unclassifiable |
| TC-004 | ❌ (error) | ✅ unclassifiable | unclassifiable |
Winner: Variant B (+19% accuracy, +40% edge case handling)
Tradeoff: ~50% more tokens per request
5. Iterate and refine
Based on failures, improve the winning variant:
- Analyze which test cases failed and why
- Add specific rules for failure patterns
- Rerun only the failed cases to verify fixes
- Document the final prompt with its test results
Examples
Example 1: Summarization Prompt
Input: "I need a prompt that summarizes meeting transcripts into action items. Help me test it."
Output:
## Test Suite: Meeting Summary Prompt
Test cases created: 8
- 3 normal meetings (standup, planning, retrospective)
- 2 edge cases (meeting with no action items, all-chat meeting)
- 2 long transcripts (>5000 words)
- 1 adversarial (transcript in mixed languages)
Rubric:
- Action item extraction: 40% (did it find all action items?)
- Owner assignment: 25% (did it identify who owns each item?)
- Brevity: 20% (summary under 200 words?)
- No hallucination: 15% (no invented action items?)
Running 3 variants... Results ready.
Example 2: Code Review Prompt
Input: "Compare these two system prompts for our AI code reviewer"
Output:
## Comparison: Code Review Prompts
Test cases: 6 code snippets (SQL injection, race condition, clean code,
style-only issues, empty file, 500-line file)
| Metric | Prompt A | Prompt B |
|---------------------|----------|----------|
| Bug detection | 4/6 | 6/6 |
| False positives | 3 | 1 |
| Actionable feedback | 60% | 90% |
| Handles large files | ❌ | ✅ |
Prompt B is better: fewer false positives, catches all bugs,
and handles edge cases. Main improvement: explicit severity levels
and "only report issues you are confident about" instruction.
Guidelines
- Always define evaluation criteria BEFORE testing — prevents post-hoc rationalization
- Test at least 8-10 cases: 50% normal, 30% edge cases, 20% adversarial
- Run each variant 3 times to check consistency (LLMs are non-deterministic)
- Track token usage alongside quality — cost matters at scale
- Keep a prompt changelog: version, date, changes, test results
- The winning prompt isn't always the longest — sometimes concise prompts outperform
- Document failure modes: knowing when a prompt breaks is as valuable as knowing when it works
- For production prompts, add regression tests and rerun when updating the model version