PHP Web Scraping Examples (2026)

Production-ready PHP web scraping examples covering session management, async scraping, anti-bot bypass, and data storage. Each script is self-contained and ready to run.

Companion repository to: Modern PHP Web Scraping in 2026: Complete Technical Guide

Features

Session Management. Cookie jars, persistent sessions, login handling
XPath Parsing. 585% faster than CSS selectors
Async Scraping. 3.3x faster with Guzzle Pool
Anti-Bot Bypass. Proxy rotation, header spoofing, rate limiting
Data Storage. PostgreSQL JSONB, CSV, JSON, JSONL
HasData API. AI-powered extraction, JS rendering, proxy rotation

Requirements

PHP 8.1+
Composer
Extensions: curl, mbstring, dom, libxml

Recommended PHP 8.3+ for best performance.

Installation

1. Clone the repository

git clone https://github.com/hasdata/php-scraper.git
cd php-scraper

2. Install dependencies

composer install

3. Run examples

php 01-session-management/login_cookie_jar.php
php 02-dom-parsing/extract_products_xpath.php
php 03-async-scraping/async_pool.php

Project Structure

php-scraper/
│
├── 01-session-management/
│   ├── login_cookie_jar.php           # Cookie jar basics
│   └── persistent_session.php         # FileCookieJar for cron jobs
│
├── 02-dom-parsing/
│   └── extract_products_xpath.php     # XPath product extraction
│
├── 03-async-scraping/
│   └── async_pool.php                 # Guzzle Pool with concurrency control
│
├── 04-anti-bot/
│   └── anti_bot_demo.php              # Proxy rotation, headers, rate limits
│
├── 05-data-storage/
│   └── storage_examples.php           # CSV, JSON, JSONL, duplicate detection
│
├── 06-hasdata-api/
│   └── hasdata_examples.php           # AI extraction, JS rendering
│
├── composer.json
└── README.md

Examples

1. Session Management

Cookie Jar Basics

Maintains session across multiple requests:

php 01-session-management/login_cookie_jar.php

Key concepts:

Automatic cookie capture from Set-Cookie headers
Session persistence across requests
Login flow handling

Persistent Sessions

Saves cookies to disk for cron jobs:

php 01-session-management/persistent_session.php

Key concepts:

FileCookieJar for disk storage
Session expiration detection
Re-authentication on timeout

2. DOM Parsing

XPath Product Extraction

Extracts product data using XPath (585% faster than CSS selectors):

php 02-dom-parsing/extract_products_xpath.php

Output: products.json

Key concepts:

XPath for performance-critical code
Optional field handling with count() checks
Broken HTML handling

Sample output:

6 products found:

1. Acer 5750
   Price: $1,400.00 (was $1,600.00)
   Description: Acer Aspire 5750
   URL: https://electronics.nop-templates.com/acer-5750

3. Async Scraping

Concurrent Requests with Guzzle Pool

Scrapes 10 URLs with controlled concurrency:

php 03-async-scraping/async_pool.php

Benchmark results:

Synchronous: 4.1s
Async (concurrency 10): 1.25s
Speedup: 3.3x faster

Key concepts:

Guzzle Pool for concurrent requests
Memory leak prevention with unset() and gc_collect_cycles()
Concurrency control (10 simultaneous requests)

4. Anti-Bot Protection

Complete Anti-Bot Demo

Demonstrates proxy rotation, header spoofing, and rate limiting:

php 04-anti-bot/anti_bot_demo.php

Key concepts:

ProxyRotator class with failure tracking
User-Agent rotation (5 current browsers)
RateLimiter class for per-domain delays
Exponential backoff for 429 responses

Included classes:

ProxyRotator. Cycles through proxy pool, skips failed proxies
RateLimiter. Per-domain rate limiting
fetchWithBackoff(). Exponential backoff (1s, 2s, 4s, 8s, 16s)

5. Data Storage

Multiple Storage Formats

Demonstrates CSV, JSON, JSONL, and duplicate detection:

php 05-data-storage/storage_examples.php

Generated files:

products.csv. Tabular data
products.json. Nested structures
products.jsonl. One object per line (large datasets)
processed_urls.txt. Duplicate tracking

Key concepts:

PostgreSQL JSONB example (SQL only)
CSV with fputcsv()
Hash-based duplicate detection
Resumable scraping with processed URLs file

6. HasData API Integration

AI-Powered Extraction

Uses HasData API for JS rendering and AI extraction:

# Add your API key to the script first
php 06-hasdata-api/hasdata_examples.php

Key features demonstrated:

Basic JS rendering with proxy rotation
AI extraction rules (no CSS selectors needed)
Output formats (HTML, Markdown, JSON, text)

Example AI extraction:

'aiExtractRules' => [
    'articles' => [
        'type' => 'list',
        'output' => [
            'title' => ['description' => 'article title', 'type' => 'string'],
            'author' => ['type' => 'string'],
            'publishDate' => ['type' => 'string']
        ]
    ]
]

Get your API key: hasdata.com

Key Techniques

XPath vs CSS Selectors

Performance benchmark (1,000 iterations):

Selector Type	Time	Performance
CSS Selectors	3.865s	Baseline
XPath	0.564s	585% faster

Why? Symfony DomCrawler converts CSS to XPath internally. Using XPath directly skips this conversion.

Memory Leak Prevention

For async scraping of 1,000+ pages:

// Solution 1: Explicit cleanup
unset($crawler, $html, $products);

// Solution 2: Periodic garbage collection
if ($processedCount % 100 === 0) {
    gc_collect_cycles();
}

// Solution 3: Process in batches
$batchSize = 1000;
// Restart script between batches

Rate Limiting Strategies

Simple delays: usleep(1000 * 1000) - 1 second
Exponential backoff: 1s → 2s → 4s → 8s → 16s
Per-domain limiting: RateLimiter class tracks last request per domain

Performance Benchmarks

Real-world results from the article:

Task	Method	Time	Speedup
10 URLs	Synchronous	4.1s	1x
10 URLs	Async (concurrency 10)	1.25s	3.3x
1,000 queries	CSS Selectors	3.865s	1x
1,000 queries	XPath	0.564s	5.85x

Disclaimer

These examples are for educational purposes only.

Respect robots.txt and Terms of Service
Use scrapers ethically and legally
Don't overload target servers
Some sites prohibit automated access

For legal context: Is Web Scraping Legal?

📎 More Resources

Full Guide: Modern PHP Web Scraping in 2026
HasData API: Web Scraping API
Discord Community: Join HasData
Star this repo if helpful ⭐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PHP Web Scraping Examples (2026)

Table of Contents

Features

Requirements

Installation

1. Clone the repository

2. Install dependencies

3. Run examples

Project Structure

Examples

1. Session Management

Cookie Jar Basics

Persistent Sessions

2. DOM Parsing

XPath Product Extraction

3. Async Scraping

Concurrent Requests with Guzzle Pool

4. Anti-Bot Protection

Complete Anti-Bot Demo

5. Data Storage

Multiple Storage Formats

6. HasData API Integration

AI-Powered Extraction

Key Techniques

XPath vs CSS Selectors

Memory Leak Prevention

Rate Limiting Strategies

Performance Benchmarks

Disclaimer

📎 More Resources

About

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
01-session-management		01-session-management
02-dom-parsing		02-dom-parsing
03-async-scraping		03-async-scraping
04-anti-bot		04-anti-bot
05-data-storage		05-data-storage
06-hasdata-api		06-hasdata-api
README.md		README.md
banner.png		banner.png
composer.json		composer.json

Folders and files

Latest commit

History

Repository files navigation

PHP Web Scraping Examples (2026)

Table of Contents

Features

Requirements

Installation

1. Clone the repository

2. Install dependencies

3. Run examples

Project Structure

Examples

1. Session Management

Cookie Jar Basics

Persistent Sessions

2. DOM Parsing

XPath Product Extraction

3. Async Scraping

Concurrent Requests with Guzzle Pool

4. Anti-Bot Protection

Complete Anti-Bot Demo

5. Data Storage

Multiple Storage Formats

6. HasData API Integration

AI-Powered Extraction

Key Techniques

XPath vs CSS Selectors

Memory Leak Prevention

Rate Limiting Strategies

Performance Benchmarks

Disclaimer

📎 More Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 1

Languages