Skip to content

feat(data): replace git LFS with S3 for test data downloads#2001

Draft
spomichter wants to merge 2 commits intodevfrom
s3-data-migration
Draft

feat(data): replace git LFS with S3 for test data downloads#2001
spomichter wants to merge 2 commits intodevfrom
s3-data-migration

Conversation

@spomichter
Copy link
Copy Markdown
Contributor

@spomichter spomichter commented May 7, 2026

Summary

Replace git LFS with S3 for fetching test data archives, addressing recurring LFS budget exhaustion and slow downloads. Refs #1998.

  • dimos/utils/data.py: lazy S3 download via boto3 (with unsigned-access fallback for public read), replacing git lfs pull smudging
  • bin/s3_push: new compress + upload script (replaces bin/lfs_push)
  • bin/hooks/lfs_check: hook now directs devs to bin/s3_push
  • .gitattributes / .gitignore: drop LFS filters; stop tracking data/.lfs/ in git
  • Bucket / prefix / region overridable via DIMOS_DATA_S3_BUCKET, DIMOS_DATA_S3_PREFIX, DIMOS_DATA_S3_REGION

The LfsPath API and get_data() function are unchanged — all callers continue working without modification.

Notes

  • Default bucket: s3://dimos-github-lfs/.lfs/ (us-east-2)
  • 71/75 archives mirrored from LFS to S3
  • 4 files (go2_sf_office, models_graspgen, piper_description, xarm_description) remained as LFS pointer stubs locally and need to be sourced from a teammate's cache or a temp LFS budget bump before tests that depend on them can pass
  • Branch-level data versioning (e.g. dev vs main pointing at different SHAs of the same archive) isn't handled by the flat-key S3 layout in this PR — see issue for discussion of moving to content-addressed keys later

Test plan

  • `pytest dimos/utils/test_data.py` — 12/12 pass (10 mirrored from LFS + 2 new edge cases)
  • Full `pytest` suite: 1615 passed / 26 skipped / 1 failure + 3 errors that all pass in isolation (pre-existing flakiness, not S3-related)
  • End-to-end round-trip via `bin/s3_push` → S3 → `get_data()` → decompress → contents verified
  • Local SHA256 of 70/75 archives verified against current `dev` LFS pointers
  • Verify CI passes
  • Recover the 4 remaining pointer stubs

Move test data fetching from git LFS to S3 to avoid recurring LFS
budget exhaustion and slow downloads. Devs no longer need git-lfs
installed; data is pulled lazily from s3://dimos-github-lfs/.lfs/
on demand via boto3.

Changes:
- dimos/utils/data.py: replace _pull_lfs_archive() with _pull_s3_archive()
  using boto3, with unsigned-access fallback for public read
- bin/s3_push: new compress + upload script (replaces bin/lfs_push)
- bin/hooks/lfs_check: point devs to bin/s3_push instead of bin/lfs_push
- .gitattributes: drop LFS filter rules
- .gitignore: stop tracking data/.lfs/ in git
- dimos/utils/test_data.py: drop LFS-specific pointer-file assertions

Bucket location is overridable via DIMOS_DATA_S3_BUCKET /
DIMOS_DATA_S3_PREFIX / DIMOS_DATA_S3_REGION env vars so devs can
point at a regional mirror.

Refs #1998
@spomichter spomichter marked this pull request as ready for review May 7, 2026 05:58
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 4 Tests Failed:

Tests completed Failed Passed Skipped
4 4 0 0
View the top 3 failed test(s) by shortest run time
::dimos.mapping.osm.test_osm
Stack Traces | 0s run time
.../mapping/osm/test_osm.py:28: in <module>
    _fixture_dir = get_data("osm_map_test")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials
::dimos.mapping.pointclouds.test_occupancy
Stack Traces | 0s run time
.../mapping/pointclouds/test_occupancy.py:33: in <module>
    from dimos.utils.testing.test_moment import Go2Moment
<frozen importlib._bootstrap>:1360: in _find_and_load
    ???
<frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:935: in _load_unlocked
    ???
.venv/lib/python3.12.../_pytest/assertion/rewrite.py:185: in exec_module
    exec(co, module.__dict__)
.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials
::dimos.mapping.test_voxels
Stack Traces | 0s run time
dimos/mapping/test_voxels.py:27: in <module>
    from dimos.utils.testing.test_moment import Go2Moment
<frozen importlib._bootstrap>:1360: in _find_and_load
    ???
<frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:935: in _load_unlocked
    ???
.venv/lib/python3.12.../_pytest/assertion/rewrite.py:185: in exec_module
    exec(co, module.__dict__)
.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials
::dimos.utils.testing.test_moment
Stack Traces | 0s run time
.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR replaces git LFS with direct S3 downloads for test data, addressing LFS budget exhaustion. The get_data() / LfsPath public API is unchanged; internally, _pull_lfs_archive is replaced by _pull_s3_archive using boto3, and a new bin/s3_push script handles compression and upload.

  • dimos/utils/data.py: new _get_s3_client() (with unsigned-access fallback) and _pull_s3_archive() replace the git-lfs subprocess path; get_project_root() no longer clones the repo when running as an installed package.
  • bin/s3_push: new compress-and-sync script that tars each data/* entry and pushes to S3 via aws s3 sync.
  • Config/ignore cleanup: LFS filter attributes removed from .gitattributes, data/ fully ignored in .gitignore, boto3>=1.34 added as a runtime dependency.

Confidence Score: 4/5

Safe to merge after addressing the ClientError mis-classification in _pull_s3_archive and the issues flagged in prior review threads.

The download error handler in _pull_s3_archive catches every ClientError and re-raises it as FileNotFoundError with a not found message, masking real permission and throttling failures.

dimos/utils/data.py — specifically the error-handling branches in _get_s3_client and _pull_s3_archive.

Important Files Changed

Filename Overview
dimos/utils/data.py Core S3 download logic replacing git-lfs; all ClientErrors (403, 429, 500) are incorrectly re-raised as FileNotFoundError with a "not found" message, masking real access/throttling failures.
bin/s3_push New upload script replacing lfs_push; missing shebang # (known thread), dead if [ $? -eq 0 ] branch under set -e (known thread), and compressed_entries array populated but never consumed.
dimos/utils/test_data.py Test suite updated to remove git-lfs subprocess calls and add two new S3-specific edge case tests (missing file, idempotency). No issues found.
bin/hooks/lfs_check Hook message updated to reference bin/s3_push instead of bin/lfs_push. Straightforward string change, no issues.
pyproject.toml Adds boto3>=1.34 runtime dependency and adds boto3/botocore mypy ignore overrides. Clean change.
.gitattributes LFS filter attributes removed from image/media types and data/.lfs; binary markers retained. No issues.
.gitignore Simplified from "ignore data/* but keep data/.lfs/" to ignoring the entire data/ tree since archives now live in S3. Correct.

Reviews (2): Last reviewed commit: "fix(data): add boto3 dependency and mypy..." | Re-trigger Greptile

Comment thread bin/s3_push
@@ -0,0 +1,76 @@
!/bin/bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing # in shebang line

The file starts with !/bin/bash instead of #!/bin/bash. Without the #! two-byte magic sequence, the kernel cannot identify this as a script and will return "Exec format error" on most Linux systems, making the upload workflow completely broken for any developer who tries to run ./bin/s3_push directly.

Comment thread dimos/utils/data.py
Comment on lines +99 to 113
def _get_s3_client(): # type: ignore[no-untyped-def]
"""Get a boto3 S3 client, using instance profile / env credentials."""
try:
relative_path = file_path.relative_to(repo_root)

env = os.environ.copy()
env["GIT_LFS_FORCE_PROGRESS"] = "1"

subprocess.run(
["git", "lfs", "pull", "--include", str(relative_path)],
cwd=repo_root,
check=True,
env=env,
session = boto3.Session(region_name=S3_REGION)
client = session.client("s3")
# Quick check that credentials work
client.head_bucket(Bucket=S3_BUCKET)
return client
except ClientError:
# Fall back to unsigned access (public bucket)
return boto3.client(
"s3",
region_name=S3_REGION,
config=BotoConfig(signature_version=UNSIGNED),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 NoCredentialsError not caught — unsigned fallback unreachable in credential-less environments

botocore.exceptions.NoCredentialsError is a subclass of BotoCoreError, not ClientError. When no AWS credentials are configured (e.g. fresh developer machine, CI without IAM role), client.head_bucket() raises NoCredentialsError, which escapes the except ClientError block and propagates as an unhandled exception — the unsigned public-bucket fallback is never reached. The caller _pull_s3_archive has no handler for this, so tests will crash with a confusing NoCredentialsError instead of transparently using unsigned access as intended.

Comment thread bin/s3_push
Comment on lines +61 to +68
if [ $? -eq 0 ]; then
compressed_size=$(du -sh "$compressed_file" | cut -f1)
echo -e " ${GREEN}✓${NC} Compressed $entry_name (${GREEN}$entry_size${NC} → ${GREEN}$compressed_size${NC})"
compressed_entries+=("$entry_name")
else
echo -e " ${RED}✗${NC} Failed to compress $entry_name"
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead code: if [ $? -eq 0 ] is unreachable under set -e

set -e at line 5 causes the shell to exit immediately if tar returns a non-zero exit code, so execution never reaches the if [ $? -eq 0 ] check when tar fails. The else branch (failure message + exit 1) is dead code.

Comment thread dimos/utils/data.py
Comment on lines 116 to 121
@@ -181,46 +121,40 @@ def _decompress_archive(filename: str | Path) -> Path:
return target_dir / filename_path.name.replace(".tar.gz", "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 security tarfile.extractall() without filter — path-traversal risk and deprecation warning

Calling tar.extractall(target_dir) without a filter argument is deprecated since Python 3.12 and produces DeprecationWarning; a future Python release will change the default to filter='data'. More importantly, a crafted archive could use absolute paths or ../ components to write files outside target_dir. Passing filter='data' strips unsafe members and also silences the warning.

mypy was failing on dimos/utils/data.py because boto3 wasn't a
declared dependency. Add it to pyproject.toml dependencies and the
ignore_missing_imports list (boto3 doesn't ship inline type stubs
without boto3-stubs).
Comment thread dimos/utils/data.py
Comment on lines +139 to 145
try:
client.download_file(S3_BUCKET, s3_key, str(local_path))
except ClientError as e:
raise FileNotFoundError(
f"Test file '{filename}' not found at {file_path}. "
f"Make sure the file is committed to Git LFS in the tests/data directory."
f"Data file '{archive_name}' not found in s3://{S3_BUCKET}/{S3_PREFIX}. "
f"Make sure it has been uploaded with bin/s3_push. ({e})"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 All ClientErrors mis-reported as "not found"

except ClientError catches every S3 error — 403 AccessDenied, 503 SlowDown, 500 InternalError, etc. — and re-raises each one as FileNotFoundError with the message "not found in s3://…". A developer who hits a permission error or throttle will waste time hunting for a supposedly missing S3 object when the archive is present but inaccessible or rate-limited. Only HTTP 404 / NoSuchKey genuinely means the object is absent; other codes should surface their real cause.

@leshy
Copy link
Copy Markdown
Contributor

leshy commented May 7, 2026

try S3 backed LFS self hosting first

to not re-invent LFS (branching, third party devs etc) benchmark self hosted LFS vs WGET to s3 to compare if there are speed diffs (LFS seems slow sometimes, idk if due to github) to decide if we need raw S3 mirroring or just our own LFS servers

branching, third party devs

this needs solving for above to be viable, how do I test a third party branch that added a new big file? how do I test a branch that overrides existing file? when do these files get deleted?

we can solve all this:

keep using LFS as a branching/identity mechanism, defer to S3 as cache. store files as hashes in S3 and not filenames,

cat data/.lfs/xarm6.tar.gz 
version https://git-lfs.github.com/spec/v1
oid sha256:71c9990ab779d20b878ec4c6b0ee21b29bbb7963b59375acf9dd635b7241009b
size 1863215

(actual .lfs stub files store a hash - use that hash for your s3 filename)
keep using lfs but our dimos get_data function can try our S3 caching layer first, and auto-defer to LFS if it 404s)

GET our_bucket/71c9990ab779d20b878ec4c6b0ee21b29bbb7963b59375acf9dd635b7241009b

this allows us to whenever we want - upload whatever we want. If we don't upload, everything still works.

Uploading just current dev or main to S3, 99% of traffic goes from S3

@spomichter spomichter marked this pull request as draft May 7, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants