Production-ready PHP web scraping examples covering session management, async scraping, anti-bot bypass, and data storage. Each script is self-contained and ready to run.
Companion repository to: Modern PHP Web Scraping in 2026: Complete Technical Guide
- Features
- Requirements
- Installation
- Project Structure
- Examples
- Key Techniques
- Performance Benchmarks
- Disclaimer
- Session Management. Cookie jars, persistent sessions, login handling
- XPath Parsing. 585% faster than CSS selectors
- Async Scraping. 3.3x faster with Guzzle Pool
- Anti-Bot Bypass. Proxy rotation, header spoofing, rate limiting
- Data Storage. PostgreSQL JSONB, CSV, JSON, JSONL
- HasData API. AI-powered extraction, JS rendering, proxy rotation
- PHP 8.1+
- Composer
- Extensions: curl, mbstring, dom, libxml
Recommended PHP 8.3+ for best performance.
git clone https://github.com/hasdata/php-scraper.git
cd php-scrapercomposer installphp 01-session-management/login_cookie_jar.php
php 02-dom-parsing/extract_products_xpath.php
php 03-async-scraping/async_pool.phpphp-scraper/
│
├── 01-session-management/
│ ├── login_cookie_jar.php # Cookie jar basics
│ └── persistent_session.php # FileCookieJar for cron jobs
│
├── 02-dom-parsing/
│ └── extract_products_xpath.php # XPath product extraction
│
├── 03-async-scraping/
│ └── async_pool.php # Guzzle Pool with concurrency control
│
├── 04-anti-bot/
│ └── anti_bot_demo.php # Proxy rotation, headers, rate limits
│
├── 05-data-storage/
│ └── storage_examples.php # CSV, JSON, JSONL, duplicate detection
│
├── 06-hasdata-api/
│ └── hasdata_examples.php # AI extraction, JS rendering
│
├── composer.json
└── README.md
Maintains session across multiple requests:
php 01-session-management/login_cookie_jar.phpKey concepts:
- Automatic cookie capture from
Set-Cookieheaders - Session persistence across requests
- Login flow handling
Saves cookies to disk for cron jobs:
php 01-session-management/persistent_session.phpKey concepts:
- FileCookieJar for disk storage
- Session expiration detection
- Re-authentication on timeout
Extracts product data using XPath (585% faster than CSS selectors):
php 02-dom-parsing/extract_products_xpath.phpOutput: products.json
Key concepts:
- XPath for performance-critical code
- Optional field handling with
count()checks - Broken HTML handling
Sample output:
6 products found:
1. Acer 5750
Price: $1,400.00 (was $1,600.00)
Description: Acer Aspire 5750
URL: https://electronics.nop-templates.com/acer-5750
Scrapes 10 URLs with controlled concurrency:
php 03-async-scraping/async_pool.phpBenchmark results:
- Synchronous: 4.1s
- Async (concurrency 10): 1.25s
- Speedup: 3.3x faster
Key concepts:
- Guzzle Pool for concurrent requests
- Memory leak prevention with
unset()andgc_collect_cycles() - Concurrency control (10 simultaneous requests)
Demonstrates proxy rotation, header spoofing, and rate limiting:
php 04-anti-bot/anti_bot_demo.phpKey concepts:
ProxyRotatorclass with failure tracking- User-Agent rotation (5 current browsers)
RateLimiterclass for per-domain delays- Exponential backoff for 429 responses
Included classes:
ProxyRotator. Cycles through proxy pool, skips failed proxiesRateLimiter. Per-domain rate limitingfetchWithBackoff(). Exponential backoff (1s, 2s, 4s, 8s, 16s)
Demonstrates CSV, JSON, JSONL, and duplicate detection:
php 05-data-storage/storage_examples.phpGenerated files:
products.csv. Tabular dataproducts.json. Nested structuresproducts.jsonl. One object per line (large datasets)processed_urls.txt. Duplicate tracking
Key concepts:
- PostgreSQL JSONB example (SQL only)
- CSV with
fputcsv() - Hash-based duplicate detection
- Resumable scraping with processed URLs file
Uses HasData API for JS rendering and AI extraction:
# Add your API key to the script first
php 06-hasdata-api/hasdata_examples.phpKey features demonstrated:
- Basic JS rendering with proxy rotation
- AI extraction rules (no CSS selectors needed)
- Output formats (HTML, Markdown, JSON, text)
Example AI extraction:
'aiExtractRules' => [
'articles' => [
'type' => 'list',
'output' => [
'title' => ['description' => 'article title', 'type' => 'string'],
'author' => ['type' => 'string'],
'publishDate' => ['type' => 'string']
]
]
]Get your API key: hasdata.com
Performance benchmark (1,000 iterations):
| Selector Type | Time | Performance |
|---|---|---|
| CSS Selectors | 3.865s | Baseline |
| XPath | 0.564s | 585% faster |
Why? Symfony DomCrawler converts CSS to XPath internally. Using XPath directly skips this conversion.
For async scraping of 1,000+ pages:
// Solution 1: Explicit cleanup
unset($crawler, $html, $products);
// Solution 2: Periodic garbage collection
if ($processedCount % 100 === 0) {
gc_collect_cycles();
}
// Solution 3: Process in batches
$batchSize = 1000;
// Restart script between batches- Simple delays:
usleep(1000 * 1000)- 1 second - Exponential backoff: 1s → 2s → 4s → 8s → 16s
- Per-domain limiting:
RateLimiterclass tracks last request per domain
Real-world results from the article:
| Task | Method | Time | Speedup |
|---|---|---|---|
| 10 URLs | Synchronous | 4.1s | 1x |
| 10 URLs | Async (concurrency 10) | 1.25s | 3.3x |
| 1,000 queries | CSS Selectors | 3.865s | 1x |
| 1,000 queries | XPath | 0.564s | 5.85x |
These examples are for educational purposes only.
- Respect
robots.txtand Terms of Service - Use scrapers ethically and legally
- Don't overload target servers
- Some sites prohibit automated access
For legal context: Is Web Scraping Legal?
- Full Guide: Modern PHP Web Scraping in 2026
- HasData API: Web Scraping API
- Discord Community: Join HasData
- Star this repo if helpful ⭐
