Overview

Set up and manage Datadog for full-stack observability including infrastructure metrics, APM traces, log aggregation, dashboards, and alerting. Covers agent installation, integration configuration, monitor creation, and API usage.

Instructions

Task A: Install and Configure the Datadog Agent

Install the agent on the target host
Configure the main datadog.yaml with API key and site
Enable relevant integrations

yaml

# /etc/datadog-agent/datadog.yaml — Main agent configuration
api_key: "<YOUR_DD_API_KEY>"
site: "datadoghq.com"
hostname: "web-server-01"
tags:
  - env:production
  - service:web-api
  - team:platform
logs_enabled: true
apm_config:
  enabled: true
  env: production
process_config:
  process_collection:
    enabled: true

bash

# Install Datadog Agent on Ubuntu/Debian
DD_API_KEY="<YOUR_DD_API_KEY>" DD_SITE="datadoghq.com" \
  bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"

# Verify agent status
sudo datadog-agent status

Task B: Configure Integrations

yaml

# /etc/datadog-agent/conf.d/postgres.d/conf.yaml — PostgreSQL integration
init_config:

instances:
  - host: localhost
    port: 5432
    username: datadog
    password: "<DB_PASSWORD>"
    dbname: myapp_production
    tags:
      - env:production
      - service:database
    collect_activity_metrics: true
    collect_database_size_metrics: true

yaml

# /etc/datadog-agent/conf.d/nginx.d/conf.yaml — Nginx integration
init_config:

instances:
  - nginx_status_url: http://localhost:8080/nginx_status
    tags:
      - env:production
      - service:web-proxy

Task C: Create Monitors and Alerts

bash

# Create a metric monitor via API — High CPU alert
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  -d '{
    "name": "High CPU on {{host.name}}",
    "type": "metric alert",
    "query": "avg(last_5m):avg:system.cpu.user{env:production} by {host} > 85",
    "message": "CPU usage above 85% on {{host.name}}.\n\n@slack-ops-alerts @pagerduty-infra",
    "tags": ["env:production", "team:platform"],
    "options": {
      "thresholds": {
        "critical": 85,
        "warning": 70
      },
      "notify_no_data": true,
      "no_data_timeframe": 10,
      "renotify_interval": 30,
      "escalation_message": "CPU still elevated on {{host.name}} — escalating."
    }
  }'

bash

# Create a log-based monitor — Error rate spike
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  -d '{
    "name": "Error log spike in payment-service",
    "type": "log alert",
    "query": "logs(\"service:payment-service status:error\").index(\"main\").rollup(\"count\").by(\"service\").last(\"5m\") > 50",
    "message": "More than 50 error logs in 5 minutes for payment-service.\n\n@slack-payments-team",
    "options": {
      "thresholds": { "critical": 50, "warning": 25 },
      "enable_logs_sample": true
    }
  }'

Task D: Build Dashboards

bash

# Create a dashboard via API — Service overview
curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  -d '{
    "title": "Web API Service Overview",
    "layout_type": "ordered",
    "widgets": [
      {
        "definition": {
          "type": "timeseries",
          "title": "Request Rate",
          "requests": [
            {
              "q": "sum:trace.http.request.hits{service:web-api,env:production}.as_count()",
              "display_type": "bars"
            }
          ]
        }
      },
      {
        "definition": {
          "type": "query_value",
          "title": "P99 Latency",
          "requests": [
            {
              "q": "p99:trace.http.request.duration{service:web-api,env:production}"
            }
          ],
          "precision": 2
        }
      },
      {
        "definition": {
          "type": "toplist",
          "title": "Top Endpoints by Error Rate",
          "requests": [
            {
              "q": "sum:trace.http.request.errors{service:web-api,env:production} by {resource_name}.as_count()"
            }
          ]
        }
      }
    ]
  }'

Task E: APM Instrumentation

python

# app.py — Python APM auto-instrumentation with ddtrace
from ddtrace import tracer, patch_all

# Patch all supported libraries (requests, flask, sqlalchemy, etc.)
patch_all()

tracer.configure(
    hostname="localhost",
    port=8126,
    service="payment-service",
    env="production",
    version="2.1.0",
)

from flask import Flask
app = Flask(__name__)

@app.route("/charge", methods=["POST"])
def charge():
    with tracer.trace("payment.process", service="payment-service") as span:
        span.set_tag("payment.provider", "stripe")
        result = process_payment()
        span.set_metric("payment.amount", result["amount"])
        return {"status": "ok"}

bash

# Run with ddtrace auto-instrumentation
pip install ddtrace
ddtrace-run python app.py

Task F: Log Collection and Pipelines

yaml

# /etc/datadog-agent/conf.d/python.d/conf.yaml — Custom log collection
logs:
  - type: file
    path: /var/log/myapp/*.log
    service: web-api
    source: python
    tags:
      - env:production
    log_processing_rules:
      - type: multi_line
        name: python_traceback
        pattern: "Traceback \\(most recent call last\\)"

bash

# Query logs via API — Find errors in last hour
curl -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: ${DD_API_KEY}" \
  -H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
  -d '{
    "filter": {
      "query": "service:web-api status:error",
      "from": "now-1h",
      "to": "now"
    },
    "sort": "-timestamp",
    "page": { "limit": 25 }
  }'

Best Practices

Use consistent tagging: env, service, team on all resources
Set notify_no_data on critical monitors to catch silent failures
Use composite monitors to reduce alert noise by correlating signals
Configure log exclusion filters to control ingestion costs
Use Unified Service Tagging (DD_ENV, DD_SERVICE, DD_VERSION) across APM, logs, and metrics

datadog

Usage

Getting Started

Example Prompts

Information

Documentation

Overview

Instructions

Task A: Install and Configure the Datadog Agent

Task B: Configure Integrations

Task C: Create Monitors and Alerts

Task D: Build Dashboards

Task E: APM Instrumentation

Task F: Log Collection and Pipelines

Best Practices