pygxml

streaming parser using gjson-style path queries over XML. Rust core (quick-xml) + PyO3.

The original GJSON: tidwall/gjson

Installation

pip install pygxml

Usage examples

import pygxml

xml = """<store>
  <book id="b1"><title>XML in a Nutshell</title><price>30</price></book>
  <book id="b2"><title>The Cathedral and the Bazaar</title><price>20</price></book>
  <book id="b3"><title>Programming Rust</title><price>45</price></book>
</store>"""

# Single-shot path query — returns a typed Result.
pygxml.get(xml, "store.book").type_                         # list (3 books)
pygxml.get(xml, "store.book.0.title").to_str()             # 'XML in a Nutshell'
pygxml.get(xml, "store.book.1.@id").to_str()               # 'b2'
pygxml.get(xml, "store.book.#").to_int()                   # 3
[str(r) for r in pygxml.get(xml, "store.book.#.title")]    # ['XML in a Nutshell', ...]

# A bare child name with multiple matches AND a follow-on step is rejected:
# the user must pick `.N` (single) or `.#` (each).
pygxml.get(xml, "store.book.title")                         # ValueError

# Filters
pygxml.get(xml, "store.book.#(price>=30).title").to_str()           # 'XML in a Nutshell'
[str(r) for r in pygxml.get(xml, "store.book.#(price>=30)#.title")] # all matches
pygxml.get(xml, 'store.book.#(@id=="b2").title').to_str()           # 'The Cathedral...'

# Modifiers
pygxml.get(xml, "store.book.#.title|@count").to_int()               # 3

# Result.get(...) — descend into a captured element fragment.
book = pygxml.get(xml, "store.book.0")
book.get("title").to_str()                                # 'XML in a Nutshell'
book.get("@id").to_str()                                  # 'b1'
book.get("price").to_int()                                # 30

# parse(data) — wrap the input as a top-level Result for chained navigation.
r = pygxml.parse(xml)
r.get("store.book.0.title").to_str()                      # 'XML in a Nutshell'
r.get("store.book.#(price>=30)#.title").value             # [Result('XML in a Nutshell'), Result('Programming Rust')]

# get_many — scan the document once and return multiple Results.
title, price = pygxml.get_many(xml, ["store.book.0.title", "store.book.0.price"])

# compile() — pre-compile a path for reuse across many documents.
path = pygxml.compile("store.book.0.title")
path.get(xml).to_str()                                    # 'XML in a Nutshell'

# Compiled paths can also be passed to get_many / get_many_bytes / get_many_buffer.
title_path = pygxml.compile("store.book.0.title")
price_path = pygxml.compile("store.book.0.price")
title, price = pygxml.get_many(xml, [title_path, price_path])

# mmap input — true zero-copy on huge files. parse(mm) keeps the mmap by
# reference, so subsequent .get() calls re-borrow it without copying.
import mmap
with open('huge.xml', 'rb') as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
    title = pygxml.parse(mm).get("store.book.0.title").to_str()

# Namespace prefix-aware match
pygxml.get(atom_xml, "atom:feed.atom:entry.atom:title").to_str()

# Validate well-formedness without raising.
pygxml.validate(xml)                                        # True

API

Module-level functions

Function	Description
`get(xml, path)`	Query `xml` (str) at `path`; returns `Result`
`get_bytes(xml, path)`	Query `xml` (bytes) at `path`; returns `Result`
`get_buffer(xml, path)`	Query `xml` (buffer protocol) at `path`; returns `Result`
`get_many(xml, paths)`	Query `xml` (str) at each path (str or `Path`); returns `list[Result]`
`get_many_bytes(xml, paths)`	Query `xml` (bytes) at each path (str or `Path`); returns `list[Result]`
`get_many_buffer(xml, paths)`	Query `xml` (buffer protocol) at each path (str or `Path`); returns `list[Result]`
`parse(xml)`	Parse the entire XML document into a `Result`
`validate(xml)`	`True` if `xml` is syntactically valid
`compile(path)`	Pre-compile a path expression; returns a `Path`

Result

get and parse return a Result. Result accessors

Properties

Property	Description
`r.type_`	Python type for this value: `None`, `bool`, `int`, `float`, `str`, `list`, `dict`
`r.value`	Value converted to the corresponding Python type: `None` / `int` / `float` / `str` / `list[Result]` / `dict[str, Result]`

gjson-style methods

Method	Description
`r.exists()`	`True` if the value was found in the XML
`r.to_str()`	String representation (text content for elements, or full XML for dict/list elements)
`r.to_int()`,`r.to_float()`	Typed coercions; return `0` / `0.0` when empty
`r.to_bool()`	gjson-style boolean coercion (see below); returns `False` when empty
`r.get(path)`	Sub-query relative to this value
`r.get_many(paths)`	Sub-query at multiple paths; returns `list[Result]`

Result.to_bool() follows gjson semantics:

"1" / "true" → True; "0" / "false" → False
"\"t\"" / "\"T\"" / "\"1\"" → True; "\"f\"" / "\"F\"" / "\"0\"" → False
"\"true\"" / "\"TRUE\"" / "\"True\"" → True; "\"false\"" / "\"FALSE\"" / "\"False\"" → False
Any other value: to_int() != 0 (non-numeric strings → False)
Non-empty dict or list Result → True; empty Result → False

Result.get(path) only descends into element items — scalar items (attributes, #text, counts, modifier aggregates like @sum) have no children, so .get(...) against them yields an empty Result.

Pythonic methods

Syntax	Description
`str(v)`,`repr(v)`	dict: `<Result type=dict, keys=[...]>`; list: `<Result type=list, value=[...]>`; others: `str(v.value)`
`int(v)`	64-bit Integer
`float(v)`	64-bit float
`bool(v)`	Equivalent to `bool(v.value)` — `False` for null/false/0/""/[]/{}
`len(v)`	Chars for String; element count for list/dict elements
`v[key]`	Subscript access
`key in v`	Key membership for dict; string match for list
`iter(v)`	Lazy iterator: chars for str; `Result`s for list; keys for dict
`v.keys()`	Lazy `KeysView` of dict keys (raises `TypeError` for non-dict)
`v.values()`	Lazy `ValuesView` of dict values (raises `TypeError` for non-dict)
`v.items()`	Lazy `ItemsView` of `(key, Result)` pairs (raises `TypeError` for non-dict)
`r == "x"`, `r == ["a", "b"]`, `r == other_result`	Equality with str/list/Result

Path

compile(path) returns a Path — a compiled, reusable path expression. Using a pre-compiled Path avoids re-parsing the path string on every call, which matters when querying many documents with the same path.

Method	Description
`p.get(data)`	Query XML string `data`; equivalent to `get(data, path)`
`p.get_bytes(data)`	Query XML bytes `data`
`p.get_buffer(data)`	Query buffer-protocol `data` (mmap, bytearray, memoryview)

Path objects are also accepted wherever a path string is accepted: get_many, get_many_bytes, get_many_buffer, and Result.get_many.

Path syntax

syntax	meaning
`a.b.c`	Descend into child elements (local-name match, ignores namespaces)
`a.0`, `a.1`	N-th same-named sibling
`a.#`	Count of same-named siblings
`a.#.b`	Project `b` over all same-named siblings
`*`, `?`	Wildcards in element name
`@name`	Attribute reference
`#text`	Explicit text content
`\.`, `\@`	Escape
`a.#(expr)`	Filter, first match. `expr ::= path op value`
`a.#(expr)#`	Filter, all matches
`path \| @modifier`	Apply modifier (`@reverse`, `@first`, `@last`, `@count`, `@sort`, `@sort_n`, `@unique`/`@uniq`, `@flatten`, `@tostr`, `@sum`, `@avg`/`@mean`, `@min`, `@max`)
`prefix:local`	Prefix-aware match — qualified-name literal compare (matches `<atom:title>`, not `<rss:title>`)
`a.**.b`	Descendant: match every `b` at any depth under `a` (XPath `//` equivalent)

A bare child name (no .N/.#/filter) is implicit: at the terminal position it returns every match (a list-shaped Result), but at a non-terminal position the engine raises ValueError if more than one element matches. To chain past a multi-match step, pick one (.0, .1, …) or project explicitly with .#. Single-match elements (e.g., a unique root element) chain transparently.

Result.get(path) follows the same rule: it requires the receiver to hold at most one element. To process every element of a multi-match Result, iterate (for item in result: item.get(...)).

Filter operators: == = != < <= > >= % (glob) !% (negative glob). Filter values: "string", number, true/false, or bare unquoted text.

Inputs

bytes, str, mmap.mmap, bytearray, memoryview — anything implementing the buffer protocol. bytes and mmap are zero-copy on the way in; str is copied once to UTF-8.

pygxml.parse(data) keeps a reference to the input object instead of copying it. Two consequences worth knowing:

For mmap input, do not close the mmap while a Result derived from it is still in use — re-borrows during .get() / .str() will fault.
For bytearray and other mutable buffers, mutations after parse(...) are observed by subsequent Result accesses.

pygxml.get(data, path) does not retain the input; captured element fragments inside the returned Result are owned copies.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
python/pygxml		python/pygxml
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pygxml

Installation

Usage examples

API

Module-level functions

Result

Path

Path syntax

Inputs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pygxml

Installation

Usage examples

API

Module-level functions

Result

Path

Path syntax

Inputs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages