Add AGENTS.md and enrich package docstring#1497
Draft
timsaucer wants to merge 5 commits intoapache:mainfrom
Draft
Add AGENTS.md and enrich package docstring#1497timsaucer wants to merge 5 commits intoapache:mainfrom
timsaucer wants to merge 5 commits intoapache:mainfrom
Conversation
Add python/datafusion/AGENTS.md as a comprehensive DataFrame API guide for AI agents and users. It ships with pip automatically (Maturin includes everything under python-source = "python"). Covers core abstractions, import conventions, data loading, all DataFrame operations, expression building, a SQL-to-DataFrame reference table, common pitfalls, idiomatic patterns, and a categorized function index. Enrich the __init__.py module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md. Closes apache#1394 (PR 1a) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root AGENTS.md (symlinked as CLAUDE.md) is for contributors working on the project. Add a pointer to python/datafusion/AGENTS.md which is the user-facing DataFrame API guide shipped with the package. Also add the Apache license header to the package AGENTS.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document that all PRs must follow .github/pull_request_template.md and that pre-commit hooks must pass before committing. List all configured hooks (actionlint, ruff, ruff-format, cargo fmt, cargo clippy, codespell, uv-lock) and the command to run them manually. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Let the hooks be discoverable from .pre-commit-config.yaml rather than maintaining a separate list that can drift. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify that DataFusion works with any Arrow C Data Interface implementation, not just PyArrow. - Show the filter keyword argument on aggregate functions (the idiomatic HAVING equivalent) instead of the post-aggregate .filter() pattern. - Update the SQL reference table to show FILTER (WHERE ...) syntax. - Remove the now-incorrect "Aggregate then filter for HAVING" pitfall. - Add .collect() to the fluent chaining example so the result is clearly materialized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Addresses part of #1394 (PR 1a from the implementation plan)
Rationale for this change
AI agents (and humans) that encounter
datafusionviapip installcurrently get a 2-line module docstring and no structured guide to the DataFrame API. This makes it difficult for agents to produce idiomatic DataFrame code, even though they are very capable with SQL. The goal is that any agent -- whether it encounters the package via pip, the docs site, or the repo -- gets enough context to write correct DataFrame code.What changes are included in this PR?
python/datafusion/AGENTS.md(new) -- comprehensive DataFrame API guide that ships withpip install datafusion(Maturin includes all files underpython-source = "python"). Covers:lit()wrapping, column quoting, immutable DataFrames, window frame defaults, HAVING pattern)python/datafusion/__init__.py(modified) -- enriched module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md.AGENTS.md(modified, root) -- clarified that the root file is for contributors working on the project, and added a pointer topython/datafusion/AGENTS.mdfor agents that need to use the DataFrame API.Are there any user-facing changes?
Yes -- the
datafusionpackage now ships with anAGENTS.mdguide and has a richer module docstring visible viahelp(datafusion). No API changes.