Skip to content

Add AGENTS.md and enrich package docstring#1497

Draft
timsaucer wants to merge 5 commits intoapache:mainfrom
timsaucer:feat/create-user-agent-file
Draft

Add AGENTS.md and enrich package docstring#1497
timsaucer wants to merge 5 commits intoapache:mainfrom
timsaucer:feat/create-user-agent-file

Conversation

@timsaucer
Copy link
Copy Markdown
Member

@timsaucer timsaucer commented Apr 15, 2026

Which issue does this PR close?

Addresses part of #1394 (PR 1a from the implementation plan)

Rationale for this change

AI agents (and humans) that encounter datafusion via pip install currently get a 2-line module docstring and no structured guide to the DataFrame API. This makes it difficult for agents to produce idiomatic DataFrame code, even though they are very capable with SQL. The goal is that any agent -- whether it encounters the package via pip, the docs site, or the repo -- gets enough context to write correct DataFrame code.

What changes are included in this PR?

  1. python/datafusion/AGENTS.md (new) -- comprehensive DataFrame API guide that ships with pip install datafusion (Maturin includes all files under python-source = "python"). Covers:

    • What DataFusion is and core abstractions
    • Import conventions and data loading
    • All DataFrame operations with examples (select, filter, join, aggregate, window, sort, limit, set operations, deduplication)
    • Executing and collecting results
    • Expression building (arithmetic, comparisons, boolean logic, null handling, CASE/WHEN, casting, aliasing, BETWEEN, IN)
    • SQL-to-DataFrame reference table (~25 mappings)
    • Common pitfalls (boolean operators, lit() wrapping, column quoting, immutable DataFrames, window frame defaults, HAVING pattern)
    • Idiomatic patterns (fluent chaining, variables as CTEs, window functions for scalar subqueries, semi/anti joins for EXISTS/NOT EXISTS)
    • Categorized function index
  2. python/datafusion/__init__.py (modified) -- enriched module docstring from 2 lines to a full overview with core abstractions, a quick-start example, and a pointer to AGENTS.md.

  3. AGENTS.md (modified, root) -- clarified that the root file is for contributors working on the project, and added a pointer to python/datafusion/AGENTS.md for agents that need to use the DataFrame API.

Are there any user-facing changes?

Yes -- the datafusion package now ships with an AGENTS.md guide and has a richer module docstring visible via help(datafusion). No API changes.

timsaucer and others added 2 commits April 15, 2026 09:51
Add python/datafusion/AGENTS.md as a comprehensive DataFrame API guide
for AI agents and users. It ships with pip automatically (Maturin includes
everything under python-source = "python"). Covers core abstractions,
import conventions, data loading, all DataFrame operations, expression
building, a SQL-to-DataFrame reference table, common pitfalls, idiomatic
patterns, and a categorized function index.

Enrich the __init__.py module docstring from 2 lines to a full overview
with core abstractions, a quick-start example, and a pointer to AGENTS.md.

Closes apache#1394 (PR 1a)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The root AGENTS.md (symlinked as CLAUDE.md) is for contributors working
on the project. Add a pointer to python/datafusion/AGENTS.md which is
the user-facing DataFrame API guide shipped with the package. Also add
the Apache license header to the package AGENTS.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@timsaucer timsaucer marked this pull request as draft April 15, 2026 13:54
timsaucer and others added 3 commits April 15, 2026 10:02
Document that all PRs must follow .github/pull_request_template.md and
that pre-commit hooks must pass before committing. List all configured
hooks (actionlint, ruff, ruff-format, cargo fmt, cargo clippy, codespell,
uv-lock) and the command to run them manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Let the hooks be discoverable from .pre-commit-config.yaml rather than
maintaining a separate list that can drift.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Clarify that DataFusion works with any Arrow C Data Interface
  implementation, not just PyArrow.
- Show the filter keyword argument on aggregate functions (the idiomatic
  HAVING equivalent) instead of the post-aggregate .filter() pattern.
- Update the SQL reference table to show FILTER (WHERE ...) syntax.
- Remove the now-incorrect "Aggregate then filter for HAVING" pitfall.
- Add .collect() to the fluent chaining example so the result is clearly
  materialized.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant