Overview
theHarvester is a passive OSINT tool that aggregates information about a target domain from multiple public sources. It finds email addresses, subdomains, hostnames, and IP ranges without making any direct requests to the target — making it ideal for stealth recon during the pre-engagement phase of penetration tests or OSINT investigations.
Sources include: Google, Bing, DuckDuckGo, LinkedIn, Shodan, Hunter.io, CertSpotter, DNSDumpster, VirusTotal, and more.
Instructions
Step 1: Install theHarvester
bash
# Option 1: pip (in a virtual environment recommended)
pip install theHarvester
# Option 2: Clone from GitHub (most up-to-date)
git clone https://github.com/laramies/theHarvester.git
cd theHarvester
pip install -r requirements/base.txt
# Option 3: Docker
docker pull ghcr.io/laramies/theharvester
docker run ghcr.io/laramies/theharvester -d example.com -b google
Step 2: Basic usage
bash
# Syntax: theHarvester -d <domain> -b <source> [options]
# -d target domain
# -b data source(s)
# -l limit results (default: 500)
# -f output filename (supports XML and JSON)
# -n DNS lookup on discovered hosts
# -v verify host via DNS resolution
# Search a single source
theHarvester -d example.com -b google
# Search all available sources
theHarvester -d example.com -b all
# Limit results, enable DNS lookup, save output
theHarvester -d example.com -b google,bing,linkedin -l 200 -n -f results_example
# Run from cloned repo
python3 theHarvester.py -d example.com -b all -l 500 -f output
Step 3: Choose sources strategically
bash
# Email harvesting — best sources
theHarvester -d example.com -b google,bing,hunter,linkedin
# Subdomain enumeration — best sources
theHarvester -d example.com -b certspotter,dnsdumpster,virustotal,shodan
# Comprehensive (slower, uses all sources)
theHarvester -d example.com -b all -l 1000 -f full_recon_example
# LinkedIn employee discovery (requires LinkedIn API key in api-keys.yaml)
theHarvester -d example.com -b linkedin -l 200
Step 4: Configure API keys
yaml
# api-keys.yaml (place in theHarvester directory or specify with -c flag)
apikeys:
hunter:
key: YOUR_HUNTER_IO_KEY
shodan:
key: YOUR_SHODAN_KEY
virustotal:
key: YOUR_VIRUSTOTAL_KEY
binaryedge:
key: YOUR_BINARYEDGE_KEY
fullhunt:
key: YOUR_FULLHUNT_KEY
securityTrails:
key: YOUR_SECURITYTRAILS_KEY
github:
key: YOUR_GITHUB_TOKEN
Step 5: Parse and process output with Python
python
import json
import subprocess
import re
def run_harvester(domain, sources="google,bing,certspotter,dnsdumpster", limit=500):
"""Run theHarvester and return parsed results."""
output_file = f"harvester_{domain.replace('.', '_')}"
cmd = [
"theHarvester",
"-d", domain,
"-b", sources,
"-l", str(limit),
"-f", output_file,
]
print(f"Running: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
print(result.stdout)
# Parse JSON output
json_file = f"{output_file}.json"
try:
with open(json_file) as f:
data = json.load(f)
return data
except FileNotFoundError:
# Fall back to parsing stdout
return parse_stdout(result.stdout)
def parse_stdout(output):
"""Extract emails, hosts, and IPs from raw stdout."""
emails = set(re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', output))
# Filter out false positives
emails = {e for e in emails if not e.endswith(('.png', '.jpg', '.css', '.js'))}
hosts = set(re.findall(r'[\w\.-]+\.\w{2,}', output))
ips = set(re.findall(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', output))
return {"emails": list(emails), "hosts": list(hosts), "ips": list(ips)}
def deduplicate_and_report(data, domain):
"""Clean and summarize harvested data."""
emails = sorted(set(data.get("emails", [])))
hosts = sorted(set(data.get("hosts", [])))
ips = sorted(set(data.get("ips", [])))
# Filter to target domain
domain_emails = [e for e in emails if domain in e]
domain_hosts = [h for h in hosts if domain in h]
print(f"\n=== Harvest Report: {domain} ===")
print(f"Emails found: {len(domain_emails)}")
print(f"Subdomains: {len(domain_hosts)}")
print(f"IP addresses: {len(ips)}")
if domain_emails:
print("\nEmails:")
for e in domain_emails[:20]:
print(f" {e}")
if domain_hosts:
print("\nSubdomains:")
for h in domain_hosts[:20]:
print(f" {h}")
return {
"emails": domain_emails,
"subdomains": domain_hosts,
"ips": ips,
}
# Usage
results = run_harvester("target-company.com", sources="google,bing,certspotter,hunter")
clean = deduplicate_and_report(results, "target-company.com")
# Save cleaned results
with open("clean_results.json", "w") as f:
json.dump(clean, f, indent=2)
Step 6: Combine with other tools
bash
# Pass discovered subdomains to nmap (only with explicit authorization)
theHarvester -d example.com -b all -f hosts
cat hosts.json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for host in data.get('hosts', []):
print(host)
" > subdomains.txt
# Feed subdomains into amass for deeper DNS enumeration
cat subdomains.txt | amass enum -df - -passive
# Check emails against breach databases
cat emails.txt | while read email; do
curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/$email" \
-H "hibp-api-key: YOUR_HIBP_KEY"
done
Available Sources Reference
| Source | Data Type | API Key Required |
|---|---|---|
google | Emails, subdomains | No |
bing | Emails, subdomains | No |
duckduckgo | Emails, subdomains | No |
linkedin | Employees, emails | Optional |
hunter | Emails | Yes |
certspotter | Subdomains (SSL certs) | No |
dnsdumpster | Subdomains, IPs | No |
virustotal | Subdomains | Yes |
shodan | IPs, open ports | Yes |
securitytrails | Subdomains, DNS | Yes |
github | Emails, code | Yes |
binaryedge | IPs, services | Yes |
Guidelines
- Always get authorization before running theHarvester against a target — passive does not mean invisible. Data queries may be logged by third-party services.
- Rate limits: Without API keys, theHarvester relies on scraping search engines which may throttle or block requests. Add API keys for reliable results.
- Combine sources: No single source is complete. Use multiple sources and deduplicate.
- Email format detection: Once you have a few emails (e.g.,
jsmith@corp.com,john.smith@corp.com), infer the naming convention and use it to generate a target list. - DNS verification: Always use
-nor-vto verify discovered hosts are live before reporting.