feat: rewrite north west leicestershire as pure http scraper#2075
feat: rewrite north west leicestershire as pure http scraper#2075InertiaUK wants to merge 2 commits into
Conversation
Removes the Selenium dependency entirely. The council's Cuttlefish CMS has an address autocomplete endpoint that returns internal IDs, and a cookie-based location system that serves collection dates as server- rendered HTML. Three plain HTTP requests replace the previous flow of launching Chrome, waiting for elements, and clicking links. Postcode + house number is now the only input needed (no UPRN).
|
Warning Review limit reached
Your plan currently allows 2 reviews/hour. Refill in 27 minutes and 17 seconds. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more review capacity refills, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than trial, open-source, and free plans. In all cases, review capacity refills continuously over time. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughMigrated NorthWestLeicestershire council scraper from Selenium/UPRN automation to HTTP-based postcode and house-number lookup. Removed Selenium imports, implemented address autocomplete resolution with error handling, added HTML parsing for refuse collection dates with relative-date normalization, and updated test configuration to match the new interface. ChangesNorthWestLeicestershire HTTP Migration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #2075 +/- ##
=======================================
Coverage 86.67% 86.67%
=======================================
Files 9 9
Lines 1141 1141
=======================================
Hits 989 989
Misses 152 152 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@uk_bin_collection/tests/input.json`:
- Around line 1827-1829: Remove the duplicate "postcode" key from the
NorthWestLeicestershire fixture in the JSON object (the object that currently
contains "house_number": "1" and two "postcode" entries); keep a single
"postcode" entry (the correct value "DE74 2FZ") so the object is unambiguous and
valid JSON.
In `@uk_bin_collection/uk_bin_collection/councils/NorthWestLeicestershire.py`:
- Around line 70-77: The parser currently calls datetime.strptime(date_str, "%a
%d %b") which uses year 1900 (not a leap year) and will fail for "29 Feb";
instead parse against a leap-safe placeholder year (e.g., 2000) by
appending/replacing the year in date_str or using datetime.strptime with a
format that includes a fixed year, then replace the placeholder year with
current_year/current_year+1 when projecting onto the actual collection year;
update the logic around parsed_date, current_date and current_year so
parsed_date = parsed_date.replace(year=current_year) and parsed_date =
parsed_date.replace(year=current_year + 1) work after parsing with the safe year
(ensure date_str, parsed_date, current_date and current_year are the referenced
symbols).
- Around line 37-44: The requests to LOCATION_URL and HOME_URL in the
NorthWestLeicestershire scraper lack timeouts and don’t validate HTTP status, so
update the session.get calls (the one with params={"put": nwl_id,...} and the
one assigning response = session.get(self.HOME_URL)) to include a reasonable
timeout (e.g. timeout=10) and immediately check the response via
response.raise_for_status() (or validate response.status_code) to fail fast on
network/HTTP errors; ensure any exceptions are allowed to propagate or are
converted into a clear upstream error rather than falling back to “No refuse
collection data found.”
- Around line 24-28: Read the "house_number" kwarg and prefer it when resolving
ambiguous addresses: retrieve user_house_number = kwargs.get("house_number")
(keep existing user_paon = kwargs.get("paon") and
check_postcode(user_postcode)), then call self._resolve_address(user_postcode,
user_house_number or user_paon) so callers using the documented house_number
field disambiguate autocomplete hits; update the variables around the existing
call to _resolve_address accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d88b6ac1-8700-411d-9d4c-840279b47587
📒 Files selected for processing (2)
uk_bin_collection/tests/input.jsonuk_bin_collection/uk_bin_collection/councils/NorthWestLeicestershire.py
Summary
nwl-prefixed address IDs/data/ac/addresses.json) to resolve postcode + house number to an internal ID, then sets a session cookie via/locationand parses the homepage HTMLThe existing Selenium scraper wasn't fundamentally broken — the timeout was caused by the ID mismatch, not a site change. This rewrite fixes the root cause and removes the Selenium dependency.
Testing
DE74 2FZ+ paon1Summary by CodeRabbit
Refactor