The Situation
Dmitri is a DevOps engineer at a SaaS company running 12 microservices on AWS. When an alert fires, his current workflow is:
- SSH into the affected server (2 min)
- Check application logs for errors (5 min)
- Check system metrics — CPU, memory, disk (3 min)
- Identify root cause from logs (10 min)
- Restart service or apply fix (5 min)
- Verify recovery (5 min)
- Create Jira ticket with full incident report (15 min)
Total: ~45 minutes per incident. With 3-4 incidents per week, that's 6-8 hours of repetitive work.
Goal: Build a Goose-powered agent that handles steps 1-7 from a single chat, cutting response time to under 10 minutes.
What You'll Build
A Goose profile called "incident-responder" with:
- SSH access to production servers
- Jira integration via MCP for ticket creation
- Custom health-check extension
- Pre-built incident response playbooks
Step 1 — Install Goose
# macOS
brew install block/tap/goose
# Linux / cross-platform
pipx install goose-ai
# Verify
goose --version
Set up your LLM provider:
export GOOSE_PROVIDER=anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Or use the interactive setup
goose configure
Step 2 — Configure SSH Extension
Goose's built-in developer extension already has shell access. For remote servers, set up SSH keys:
# Ensure SSH keys are configured for your servers
ssh-add ~/.ssh/prod_key
# Test direct access
ssh deploy@prod-web-01 "hostname && uptime"
No special Goose extension needed — the developer extension executes shell commands including SSH.
Step 3 — Add Jira MCP Tool
Install the Jira MCP server:
npm install -g @modelcontextprotocol/server-jira
Get a Jira API token from: https://id.atlassian.com/manage-profile/security/api-tokens
export JIRA_URL="https://yourteam.atlassian.net"
export JIRA_EMAIL="dmitri@company.com"
export JIRA_TOKEN="your-api-token"
Step 4 — Create the Incident Responder Profile
Create ~/.config/goose/profiles.yaml:
incident-responder:
provider: anthropic
model: claude-sonnet-4-20250514
extensions:
- name: developer
- name: jira
type: mcp
command: npx
args: ["-y", "@modelcontextprotocol/server-jira"]
env:
JIRA_URL: "${JIRA_URL}"
JIRA_EMAIL: "${JIRA_EMAIL}"
JIRA_TOKEN: "${JIRA_TOKEN}"
system_prompt: |
You are an incident response agent for a SaaS platform.
## Servers
- prod-web-01, prod-web-02: Nginx + Node.js frontend
- prod-api-01, prod-api-02: Python FastAPI backend
- prod-db-01: PostgreSQL primary
- prod-redis-01: Redis cache
## SSH Access
User: deploy, Key: default SSH agent
## Log Locations
- Nginx: /var/log/nginx/access.log, error.log
- App: /var/log/app/app.log
- System: journalctl -u <service>
## Playbook
1. SSH into affected server
2. Check recent error logs (last 15 minutes)
3. Check system resources (free -h, df -h, top -bn1)
4. Identify root cause
5. Apply fix (restart service if needed: sudo systemctl restart <service>)
6. Verify recovery (curl health endpoint)
7. Create Jira ticket in OPS project with:
- Title: [INCIDENT] <brief description>
- Priority based on severity
- Full timeline and root cause in description
Step 5 — Test with a Real Scenario
Launch the incident responder:
goose session --profile incident-responder --name "incident-$(date +%s)"
Scenario: API Returning 502 Errors
You: Monitoring alert: prod-api-01 returning 502 errors.
Rate: 23% of requests in last 5 minutes. Investigate and fix.
Goose: I'll investigate the 502 errors on prod-api-01.
[Executing: ssh deploy@prod-api-01 "tail -100 /var/log/app/app.log | grep -i error"]
Found multiple OOM errors:
MemoryError: Cannot allocate 256MB for request processing
Worker pid:4523 exited with code 137 (OOM killed)
[Executing: ssh deploy@prod-api-01 "free -h"]
Total: 8GB, Used: 7.6GB, Available: 180MB
Root cause: Memory exhaustion. Checking what's consuming memory...
[Executing: ssh deploy@prod-api-01 "ps aux --sort=-%mem | head -10"]
Found: 3 zombie worker processes consuming 2GB each from a failed deployment.
[Executing: ssh deploy@prod-api-01 "sudo systemctl restart app-api"]
Service restarted. Verifying...
[Executing: ssh deploy@prod-api-01 "curl -s localhost:8000/health"]
{"status": "healthy", "memory_usage": "42%"}
[Executing: curl -s https://api.company.com/health]
{"status": "healthy", "response_time_ms": 45}
Service recovered. Creating Jira ticket...
[Creates Jira ticket OPS-847]:
Title: [INCIDENT] prod-api-01 502 errors — OOM from zombie workers
Priority: High
Description: Full timeline, root cause, and resolution steps
✅ Incident resolved. Jira ticket: OPS-847
Total time: ~3 minutes
Step 6 — Add Custom Health Check Extension
For proactive monitoring, create a custom extension:
Save as ~/.config/goose/extensions/health_checker.py:
import json
import subprocess
def check_all_services():
"""Check health of all production services."""
services = {
"web-01": "https://prod-web-01.internal:443/health",
"web-02": "https://prod-web-02.internal:443/health",
"api-01": "http://prod-api-01.internal:8000/health",
"api-02": "http://prod-api-02.internal:8000/health",
}
results = {}
for name, url in services.items():
try:
r = subprocess.run(
["curl", "-s", "-o", "/dev/null", "-w", "%{http_code}", "--max-time", "5", url],
capture_output=True, text=True
)
code = int(r.stdout.strip())
results[name] = {"status": "healthy" if code == 200 else "unhealthy", "code": code}
except Exception as e:
results[name] = {"status": "error", "error": str(e)}
return json.dumps(results, indent=2)
if __name__ == "__main__":
print(check_all_services())
Add to the profile:
incident-responder:
extensions:
# ... existing extensions ...
- name: health-check
type: mcp
command: python
args: ["~/.config/goose/extensions/health_checker.py"]
Results
| Metric | Before (Manual) | After (Goose) |
|---|---|---|
| Response time | ~45 min | ~8 min |
| SSH + log check | 10 min | 30 sec (automated) |
| Root cause analysis | 10 min | 2 min (AI-assisted) |
| Jira documentation | 15 min | 30 sec (auto-generated) |
| Weekly time spent | 6-8 hours | 1-1.5 hours |
| Consistency | Varies by engineer | Standardized playbook |
Key win: Dmitri now handles incidents from his phone using Goose's chat interface. The agent does the SSH work, he reviews and approves the fix. Jira tickets are created automatically with full context — no more writing incident reports at 3 AM.
Tips
- Start small: Begin with read-only access (log checking), then add write actions (restarts)
- Use named sessions:
--name "incident-123"lets you resume context if disconnected - Audit trail: Goose logs every command it runs — export session logs for compliance
- Safety: Add confirmation prompts for destructive actions in the system prompt
- Multiple profiles: Create separate profiles for different severity levels (P1 gets more aggressive, P3 is read-only)