Skip to content

HasData/php-scraper

Repository files navigation

PHP Guzzle Symfony

PHP Web Scraping Examples (2026)

HasData_banner

Production-ready PHP web scraping examples covering session management, async scraping, anti-bot bypass, and data storage. Each script is self-contained and ready to run.

Companion repository to: Modern PHP Web Scraping in 2026: Complete Technical Guide

Table of Contents

  1. Features
  2. Requirements
  3. Installation
  4. Project Structure
  5. Examples
  6. Key Techniques
  7. Performance Benchmarks
  8. Disclaimer

Features

  • Session Management. Cookie jars, persistent sessions, login handling
  • XPath Parsing. 585% faster than CSS selectors
  • Async Scraping. 3.3x faster with Guzzle Pool
  • Anti-Bot Bypass. Proxy rotation, header spoofing, rate limiting
  • Data Storage. PostgreSQL JSONB, CSV, JSON, JSONL
  • HasData API. AI-powered extraction, JS rendering, proxy rotation

Requirements

  • PHP 8.1+
  • Composer
  • Extensions: curl, mbstring, dom, libxml

Recommended PHP 8.3+ for best performance.

Installation

1. Clone the repository

git clone https://github.com/hasdata/php-scraper.git
cd php-scraper

2. Install dependencies

composer install

3. Run examples

php 01-session-management/login_cookie_jar.php
php 02-dom-parsing/extract_products_xpath.php
php 03-async-scraping/async_pool.php

Project Structure

php-scraper/
│
├── 01-session-management/
│   ├── login_cookie_jar.php           # Cookie jar basics
│   └── persistent_session.php         # FileCookieJar for cron jobs
│
├── 02-dom-parsing/
│   └── extract_products_xpath.php     # XPath product extraction
│
├── 03-async-scraping/
│   └── async_pool.php                 # Guzzle Pool with concurrency control
│
├── 04-anti-bot/
│   └── anti_bot_demo.php              # Proxy rotation, headers, rate limits
│
├── 05-data-storage/
│   └── storage_examples.php           # CSV, JSON, JSONL, duplicate detection
│
├── 06-hasdata-api/
│   └── hasdata_examples.php           # AI extraction, JS rendering
│
├── composer.json
└── README.md

Examples

1. Session Management

Cookie Jar Basics

Maintains session across multiple requests:

php 01-session-management/login_cookie_jar.php

Key concepts:

  • Automatic cookie capture from Set-Cookie headers
  • Session persistence across requests
  • Login flow handling

Persistent Sessions

Saves cookies to disk for cron jobs:

php 01-session-management/persistent_session.php

Key concepts:

  • FileCookieJar for disk storage
  • Session expiration detection
  • Re-authentication on timeout

2. DOM Parsing

XPath Product Extraction

Extracts product data using XPath (585% faster than CSS selectors):

php 02-dom-parsing/extract_products_xpath.php

Output: products.json

Key concepts:

  • XPath for performance-critical code
  • Optional field handling with count() checks
  • Broken HTML handling

Sample output:

6 products found:

1. Acer 5750
   Price: $1,400.00 (was $1,600.00)
   Description: Acer Aspire 5750
   URL: https://electronics.nop-templates.com/acer-5750

3. Async Scraping

Concurrent Requests with Guzzle Pool

Scrapes 10 URLs with controlled concurrency:

php 03-async-scraping/async_pool.php

Benchmark results:

  • Synchronous: 4.1s
  • Async (concurrency 10): 1.25s
  • Speedup: 3.3x faster

Key concepts:

  • Guzzle Pool for concurrent requests
  • Memory leak prevention with unset() and gc_collect_cycles()
  • Concurrency control (10 simultaneous requests)

4. Anti-Bot Protection

Complete Anti-Bot Demo

Demonstrates proxy rotation, header spoofing, and rate limiting:

php 04-anti-bot/anti_bot_demo.php

Key concepts:

  • ProxyRotator class with failure tracking
  • User-Agent rotation (5 current browsers)
  • RateLimiter class for per-domain delays
  • Exponential backoff for 429 responses

Included classes:

  • ProxyRotator. Cycles through proxy pool, skips failed proxies
  • RateLimiter. Per-domain rate limiting
  • fetchWithBackoff(). Exponential backoff (1s, 2s, 4s, 8s, 16s)

5. Data Storage

Multiple Storage Formats

Demonstrates CSV, JSON, JSONL, and duplicate detection:

php 05-data-storage/storage_examples.php

Generated files:

  • products.csv. Tabular data
  • products.json. Nested structures
  • products.jsonl. One object per line (large datasets)
  • processed_urls.txt. Duplicate tracking

Key concepts:

  • PostgreSQL JSONB example (SQL only)
  • CSV with fputcsv()
  • Hash-based duplicate detection
  • Resumable scraping with processed URLs file

6. HasData API Integration

AI-Powered Extraction

Uses HasData API for JS rendering and AI extraction:

# Add your API key to the script first
php 06-hasdata-api/hasdata_examples.php

Key features demonstrated:

  1. Basic JS rendering with proxy rotation
  2. AI extraction rules (no CSS selectors needed)
  3. Output formats (HTML, Markdown, JSON, text)

Example AI extraction:

'aiExtractRules' => [
    'articles' => [
        'type' => 'list',
        'output' => [
            'title' => ['description' => 'article title', 'type' => 'string'],
            'author' => ['type' => 'string'],
            'publishDate' => ['type' => 'string']
        ]
    ]
]

Get your API key: hasdata.com

Key Techniques

XPath vs CSS Selectors

Performance benchmark (1,000 iterations):

Selector Type Time Performance
CSS Selectors 3.865s Baseline
XPath 0.564s 585% faster

Why? Symfony DomCrawler converts CSS to XPath internally. Using XPath directly skips this conversion.

Memory Leak Prevention

For async scraping of 1,000+ pages:

// Solution 1: Explicit cleanup
unset($crawler, $html, $products);

// Solution 2: Periodic garbage collection
if ($processedCount % 100 === 0) {
    gc_collect_cycles();
}

// Solution 3: Process in batches
$batchSize = 1000;
// Restart script between batches

Rate Limiting Strategies

  1. Simple delays: usleep(1000 * 1000) - 1 second
  2. Exponential backoff: 1s → 2s → 4s → 8s → 16s
  3. Per-domain limiting: RateLimiter class tracks last request per domain

Performance Benchmarks

Real-world results from the article:

Task Method Time Speedup
10 URLs Synchronous 4.1s 1x
10 URLs Async (concurrency 10) 1.25s 3.3x
1,000 queries CSS Selectors 3.865s 1x
1,000 queries XPath 0.564s 5.85x

Disclaimer

These examples are for educational purposes only.

  • Respect robots.txt and Terms of Service
  • Use scrapers ethically and legally
  • Don't overload target servers
  • Some sites prohibit automated access

For legal context: Is Web Scraping Legal?

📎 More Resources

About

This repository provides practical examples of web scraping using PHP.

Topics

Resources

Stars

Watchers

Forks

Languages