Run an Autonomous Penetration Test with AI Agents

Tomás leads security for a fintech startup with 12 microservices, 3 web apps, and a mobile API. The team does manual penetration testing once a year — it costs $40K, takes 3 weeks, and by the time the report arrives, half the findings are already outdated because the codebase changed. He needs continuous security testing that runs against every staging deployment, catches regressions immediately, and builds institutional knowledge about what breaks and why. He deploys PentAGI as an autonomous security testing platform.

Step 1: Deploy PentAGI on Isolated Infrastructure

Security testing tools must never run on the same network as production. Tomás provisions a dedicated testing VPS with Docker.

bash

# Provision a dedicated pentest server (isolated from production)
# Requirements: 8GB RAM, 4 CPU cores, 100GB SSD

# Clone and configure PentAGI
git clone https://github.com/vxcontrol/pentagi.git
cd pentagi
cp .env.example .env

bash

# .env — Configuration for continuous security testing
# LLM — Using Anthropic for strong reasoning on complex attack chains
LLM_PROVIDER=anthropic
LLM_MODEL=claude-sonnet-4-20250514
ANTHROPIC_API_KEY=sk-ant-...

# Search — Tavily for CVE lookups and exploit research
TAVILY_API_KEY=tvly-...

# Database
POSTGRES_PASSWORD=pentest-db-2026-secure
SECRET_KEY=$(openssl rand -hex 32)

# Monitoring
GRAFANA_ADMIN_PASSWORD=grafana-secure-2026
LANGFUSE_SECRET_KEY=$(openssl rand -hex 32)

# Network — bind only to internal network
BIND_ADDRESS=10.0.50.10

bash

# Deploy the full stack
docker compose up -d

# Verify all services are healthy
docker compose ps
# NAME              STATUS       PORTS
# pentagi-api       Up (healthy) 3000/tcp
# pentagi-ui        Up           3001/tcp
# pentagi-postgres  Up (healthy) 5432/tcp
# pentagi-neo4j     Up           7474/tcp, 7687/tcp
# pentagi-grafana   Up           3002/tcp
# pentagi-langfuse  Up           3003/tcp
# pentagi-scraper   Up           9222/tcp

Step 2: Define the Engagement Scope

The first assessment targets the staging environment. Tomás defines clear boundaries — what to test, what to avoid, and what the AI agents are allowed to do.

typescript

// scripts/create-engagement.ts — Define and launch security assessment
const PENTAGI_URL = 'http://10.0.50.10:3000/graphql'

const engagement = await fetch(PENTAGI_URL, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.PENTAGI_TOKEN}`,
  },
  body: JSON.stringify({
    query: `
      mutation CreateTask($input: CreateTaskInput!) {
        createTask(input: $input) {
          id
          status
        }
      }
    `,
    variables: {
      input: {
        name: 'Staging Full Assessment — Sprint 47',
        target: 'staging.internal.finpay.dev',
        objective: `Perform a comprehensive security assessment of the FinPay staging environment.

Target services:
- Web application (React + Next.js) at staging.internal.finpay.dev
- REST API at api-staging.internal.finpay.dev
- Mobile API at mobile-staging.internal.finpay.dev
- Admin panel at admin-staging.internal.finpay.dev

Focus areas:
1. Authentication and session management flaws
2. API authorization bypass (IDOR, privilege escalation)
3. Input validation (SQL injection, XSS, SSRF)
4. Business logic flaws in payment flows
5. Exposed sensitive data in API responses
6. Misconfigured security headers
7. Known CVEs in dependencies`,

        scope: [
          'port-scan',
          'service-enum',
          'web-app-test',
          'api-fuzz',
          'auth-test',
          'vuln-scan',
        ],
        constraints: [
          'no-dos',                         // don't run denial-of-service tests
          'no-data-exfil',                  // don't extract real user data
          'no-brute-force-production',      // staging only
          'max-concurrent-requests:50',     // don't overwhelm staging infra
          'test-accounts-only',             // use provided test credentials
        ],
        credentials: {
          testUser: { email: 'pentest-user@test.finpay.dev', password: 'TestPass2026!' },
          testAdmin: { email: 'pentest-admin@test.finpay.dev', password: 'AdminTest2026!' },
        },
      },
    },
  }),
}).then(r => r.json())

console.log(`Engagement started: ${engagement.data.createTask.id}`)

Step 3: AI Agents Execute the Assessment

Once launched, PentAGI's multi-agent system works autonomously. The primary agent orchestrates the assessment, delegating to specialized agents.

Phase 1: Reconnaissance (agents work in parallel)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[Primary Agent] Planning reconnaissance phase for staging.internal.finpay.dev
[Infra Agent]   Running: nmap -sV -sC -p- staging.internal.finpay.dev
                → Found 7 open ports: 22, 80, 443, 3000, 5432, 6379, 8080
[Research Agent] Querying CVE database for detected service versions
                → nginx/1.25.3: 2 known CVEs (low severity)
                → PostgreSQL 16.1: 1 known CVE (medium, auth bypass)
                → Redis 7.2.3 exposed without auth ⚠️ CRITICAL
[Research Agent] Web scraping: checking robots.txt, sitemap.xml, .well-known
                → Found /api/docs (Swagger UI exposed)
                → Found /.env.example (information disclosure)

Phase 2: Vulnerability Scanning
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[Primary Agent] Prioritizing targets based on reconnaissance
[Infra Agent]   Running: nikto -h https://staging.internal.finpay.dev
[Infra Agent]   Running: sqlmap --crawl=3 -u https://api-staging.internal.finpay.dev
[Dev Agent]     Testing API authorization: trying test-user credentials on admin endpoints
                → FINDING: /api/admin/users accessible with regular user token ⚠️ HIGH
[Dev Agent]     Testing payment flow for logic flaws
                → FINDING: negative amount accepted in transfer API ⚠️ CRITICAL

Phase 3: Exploitation Attempts
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[Primary Agent] Attempting exploitation of confirmed vulnerabilities
[Dev Agent]     Redis exposed without auth → connected, read session data
                → CONFIRMED: Session hijacking possible via Redis
[Dev Agent]     IDOR on /api/users/{id}/transactions — can read other users' data
                → CONFIRMED: Full transaction history accessible
[Research Agent] Searching knowledge graph for similar patterns
                → Previous engagement found same IDOR pattern in /api/users/{id}/settings
                → Checking: /api/users/{id}/settings still vulnerable → YES ⚠️

Phase 4: Report Generation
━━━━━━━━━━━━━━━━━━━━━━━━━━
[Primary Agent] Compiling findings into vulnerability report
                → 3 Critical, 4 High, 6 Medium, 8 Low findings
                → Report generated with evidence and remediation steps

Step 4: Integrate into CI/CD Pipeline

Run PentAGI automatically against every staging deployment. Fail the pipeline if critical vulnerabilities are found.

yaml

# .github/workflows/security-test.yml — Automated security testing
name: Security Assessment

on:
  deployment_status:
    # Trigger after staging deployment succeeds
    types: [success]

jobs:
  pentest:
    if: github.event.deployment.environment == 'staging'
    runs-on: ubuntu-latest
    timeout-minutes: 120          # max 2 hours for security assessment

    steps:
      - uses: actions/checkout@v4

      - name: Launch PentAGI Assessment
        id: pentest
        run: |
          TASK_ID=$(curl -s -X POST $PENTAGI_URL/graphql \
            -H "Authorization: Bearer ${{ secrets.PENTAGI_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d '{
              "query": "mutation { createTask(input: { target: \"staging.internal.finpay.dev\", objective: \"Quick security regression test: auth, API authorization, input validation\", scope: [\"web-app-test\", \"api-fuzz\", \"auth-test\"], constraints: [\"no-dos\", \"max-duration:60m\"] }) { id } }"
            }' | jq -r '.data.createTask.id')
          echo "task_id=$TASK_ID" >> $GITHUB_OUTPUT

      - name: Wait for Assessment
        run: |
          while true; do
            STATUS=$(curl -s -X POST $PENTAGI_URL/graphql \
              -H "Authorization: Bearer ${{ secrets.PENTAGI_TOKEN }}" \
              -H "Content-Type: application/json" \
              -d "{\"query\": \"{ task(id: \\\"${{ steps.pentest.outputs.task_id }}\\\") { status progress } }\"}" \
              | jq -r '.data.task.status')

            echo "Status: $STATUS"
            if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then break; fi
            sleep 60
          done

      - name: Check Findings
        run: |
          CRITICAL=$(curl -s -X POST $PENTAGI_URL/graphql \
            -H "Authorization: Bearer ${{ secrets.PENTAGI_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d "{\"query\": \"{ task(id: \\\"${{ steps.pentest.outputs.task_id }}\\\") { findings { severity } } }\"}" \
            | jq '[.data.task.findings[] | select(.severity == "critical")] | length')

          echo "Critical findings: $CRITICAL"
          if [ "$CRITICAL" -gt 0 ]; then
            echo "::error::$CRITICAL critical vulnerabilities found! Blocking deployment."
            exit 1
          fi

      - name: Upload Report
        if: always()
        run: |
          curl -H "Authorization: Bearer ${{ secrets.PENTAGI_TOKEN }}" \
            "$PENTAGI_URL/api/v1/tasks/${{ steps.pentest.outputs.task_id }}/report" \
            -o security-report.pdf

      - name: Upload Report Artifact
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: security-report-${{ github.sha }}
          path: security-report.pdf

Step 5: Knowledge Graph Compounds Over Time

After 3 months of continuous testing (2 assessments/week):

Knowledge Graph Statistics:
├── 847 vulnerability patterns stored
├── 234 unique service fingerprints
├── 156 successful exploitation paths
├── 89 technology stack profiles
└── 12 recurring vulnerability categories

The AI agents now:
- Skip reconnaissance on known services (saves 15min per run)
- Immediately test for patterns that recurred in past sprints
- Correlate new findings with historical data
  ("This IDOR is the same pattern we found in Sprint 41 — the fix was incomplete")
- Predict which new features are likely to introduce specific vulnerability types
  ("Payment endpoint changed → testing for amount manipulation and race conditions first")

Results

Security testing frequency goes from once per year to twice per week. The average time to discover a critical vulnerability drops from 3 weeks (annual pentest) to 2 hours (first CI/CD run after the vulnerable code ships). The Redis exposure — which had existed for 8 months undetected — is found in the first automated assessment. The knowledge graph catches a recurring IDOR pattern across 4 consecutive sprints, proving the root cause was a shared authorization middleware bug, not individual endpoint issues. This leads the team to fix the middleware once instead of patching endpoints one by one. Annual security testing cost drops from $40K (external pentest firm) to $3K (LLM API costs + server hosting), while coverage increases from 1 assessment per year to 100+. The engineering team fixes critical findings within 48 hours because they get the report while the code is still fresh in their minds, not 3 weeks later.

LIVETry this use case on your own files

Skills stack · 3 skills

pentagi

owasp-zap

security-audit

Step 1: Deploy PentAGI on Isolated Infrastructure

Step 2: Define the Engagement Scope

Step 3: AI Agents Execute the Assessment

Step 4: Integrate into CI/CD Pipeline

Step 5: Knowledge Graph Compounds Over Time

Results