feat(data): replace git LFS with S3 for test data downloads by spomichter · Pull Request #2001 · dimensionalOS/dimos

spomichter · 2026-05-07T05:54:57Z

Summary

Replace git LFS with S3 for fetching test data archives, addressing recurring LFS budget exhaustion and slow downloads. Refs #1998.

dimos/utils/data.py: lazy S3 download via boto3 (with unsigned-access fallback for public read), replacing git lfs pull smudging
bin/s3_push: new compress + upload script (replaces bin/lfs_push)
bin/hooks/lfs_check: hook now directs devs to bin/s3_push
.gitattributes / .gitignore: drop LFS filters; stop tracking data/.lfs/ in git
Bucket / prefix / region overridable via DIMOS_DATA_S3_BUCKET, DIMOS_DATA_S3_PREFIX, DIMOS_DATA_S3_REGION

The LfsPath API and get_data() function are unchanged — all callers continue working without modification.

Notes

Default bucket: s3://dimos-github-lfs/.lfs/ (us-east-2)
71/75 archives mirrored from LFS to S3
4 files (go2_sf_office, models_graspgen, piper_description, xarm_description) remained as LFS pointer stubs locally and need to be sourced from a teammate's cache or a temp LFS budget bump before tests that depend on them can pass
Branch-level data versioning (e.g. dev vs main pointing at different SHAs of the same archive) isn't handled by the flat-key S3 layout in this PR — see issue for discussion of moving to content-addressed keys later

Test plan

`pytest dimos/utils/test_data.py` — 12/12 pass (10 mirrored from LFS + 2 new edge cases)
Full `pytest` suite: 1615 passed / 26 skipped / 1 failure + 3 errors that all pass in isolation (pre-existing flakiness, not S3-related)
End-to-end round-trip via `bin/s3_push` → S3 → `get_data()` → decompress → contents verified
Local SHA256 of 70/75 archives verified against current `dev` LFS pointers
Verify CI passes
Recover the 4 remaining pointer stubs

Move test data fetching from git LFS to S3 to avoid recurring LFS budget exhaustion and slow downloads. Devs no longer need git-lfs installed; data is pulled lazily from s3://dimos-github-lfs/.lfs/ on demand via boto3. Changes: - dimos/utils/data.py: replace _pull_lfs_archive() with _pull_s3_archive() using boto3, with unsigned-access fallback for public read - bin/s3_push: new compress + upload script (replaces bin/lfs_push) - bin/hooks/lfs_check: point devs to bin/s3_push instead of bin/lfs_push - .gitattributes: drop LFS filter rules - .gitignore: stop tracking data/.lfs/ in git - dimos/utils/test_data.py: drop LFS-specific pointer-file assertions Bucket location is overridable via DIMOS_DATA_S3_BUCKET / DIMOS_DATA_S3_PREFIX / DIMOS_DATA_S3_REGION env vars so devs can point at a regional mirror. Refs #1998

codecov · 2026-05-07T06:02:15Z

❌ 4 Tests Failed:

Tests completed	Failed	Passed	Skipped
4	4	0	0

View the top 3 failed test(s) by shortest run time

::dimos.mapping.osm.test_osm

Stack Traces | 0s run time

.../mapping/osm/test_osm.py:28: in <module>
    _fixture_dir = get_data("osm_map_test")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials

::dimos.mapping.pointclouds.test_occupancy

Stack Traces | 0s run time

.../mapping/pointclouds/test_occupancy.py:33: in <module>
    from dimos.utils.testing.test_moment import Go2Moment
<frozen importlib._bootstrap>:1360: in _find_and_load
    ???
<frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:935: in _load_unlocked
    ???
.venv/lib/python3.12.../_pytest/assertion/rewrite.py:185: in exec_module
    exec(co, module.__dict__)
.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials

::dimos.mapping.test_voxels

Stack Traces | 0s run time

dimos/mapping/test_voxels.py:27: in <module>
    from dimos.utils.testing.test_moment import Go2Moment
<frozen importlib._bootstrap>:1360: in _find_and_load
    ???
<frozen importlib._bootstrap>:1331: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:935: in _load_unlocked
    ???
.venv/lib/python3.12.../_pytest/assertion/rewrite.py:185: in exec_module
    exec(co, module.__dict__)
.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials

::dimos.utils.testing.test_moment

Stack Traces | 0s run time

.../utils/testing/test_moment.py:27: in <module>
    data_dir = get_data("unitree_go2_office_walk2")
dimos/utils/data.py:192: in get_data
    archive_path = _decompress_archive(_pull_s3_archive(archive_name))
dimos/utils/data.py:138: in _pull_s3_archive
    client = _get_s3_client()
dimos/utils/data.py:105: in _get_s3_client
    client.head_bucket(Bucket=S3_BUCKET)
.venv/lib/python3.12........./site-packages/botocore/client.py:606: in _api_call
    return self._make_api_call(operation_name, kwargs)
.venv/lib/python3.12.../site-packages/botocore/context.py:123: in wrapper
    return func(*args, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/client.py:1076: in _make_api_call
    http, parsed_response = self._make_request(
.venv/lib/python3.12........./site-packages/botocore/client.py:1100: in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:119: in make_request
    return self._send_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:196: in _send_request
    request = self.create_request(request_dict, operation_model)
.venv/lib/python3.12........./site-packages/botocore/endpoint.py:132: in create_request
    self._event_emitter.emit(
.venv/lib/python3.12........./site-packages/botocore/hooks.py:412: in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:256: in emit
    return self._emit(event_name, kwargs)
.venv/lib/python3.12........./site-packages/botocore/hooks.py:239: in _emit
    response = handler(**kwargs)
.venv/lib/python3.12....../site-packages/botocore/signers.py:108: in handler
    return self.sign(operation_name, request)
.venv/lib/python3.12....../site-packages/botocore/signers.py:200: in sign
    auth.add_auth(request)
.venv/lib/python3.12.../site-packages/botocore/auth.py:429: in add_auth
    raise NoCredentialsError()
E   botocore.exceptions.NoCredentialsError: Unable to locate credentials

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

greptile-apps · 2026-05-07T06:02:53Z

Greptile Summary

This PR replaces git LFS with direct S3 downloads for test data, addressing LFS budget exhaustion. The get_data() / LfsPath public API is unchanged; internally, _pull_lfs_archive is replaced by _pull_s3_archive using boto3, and a new bin/s3_push script handles compression and upload.

dimos/utils/data.py: new _get_s3_client() (with unsigned-access fallback) and _pull_s3_archive() replace the git-lfs subprocess path; get_project_root() no longer clones the repo when running as an installed package.
bin/s3_push: new compress-and-sync script that tars each data/* entry and pushes to S3 via aws s3 sync.
Config/ignore cleanup: LFS filter attributes removed from .gitattributes, data/ fully ignored in .gitignore, boto3>=1.34 added as a runtime dependency.

Confidence Score: 4/5

Safe to merge after addressing the ClientError mis-classification in _pull_s3_archive and the issues flagged in prior review threads.

The download error handler in _pull_s3_archive catches every ClientError and re-raises it as FileNotFoundError with a not found message, masking real permission and throttling failures.

dimos/utils/data.py — specifically the error-handling branches in _get_s3_client and _pull_s3_archive.

Important Files Changed

Filename	Overview
dimos/utils/data.py	Core S3 download logic replacing git-lfs; all ClientErrors (403, 429, 500) are incorrectly re-raised as FileNotFoundError with a "not found" message, masking real access/throttling failures.
bin/s3_push	New upload script replacing lfs_push; missing shebang `#` (known thread), dead `if [ $? -eq 0 ]` branch under `set -e` (known thread), and `compressed_entries` array populated but never consumed.
dimos/utils/test_data.py	Test suite updated to remove git-lfs subprocess calls and add two new S3-specific edge case tests (missing file, idempotency). No issues found.
bin/hooks/lfs_check	Hook message updated to reference bin/s3_push instead of bin/lfs_push. Straightforward string change, no issues.
pyproject.toml	Adds boto3>=1.34 runtime dependency and adds boto3/botocore mypy ignore overrides. Clean change.
.gitattributes	LFS filter attributes removed from image/media types and data/.lfs; binary markers retained. No issues.
.gitignore	Simplified from "ignore data/* but keep data/.lfs/" to ignoring the entire data/ tree since archives now live in S3. Correct.

_{Reviews (2): Last reviewed commit: "fix(data): add boto3 dependency and mypy..." | Re-trigger Greptile}

greptile-apps · 2026-05-07T06:02:57Z

@@ -0,0 +1,76 @@
+!/bin/bash


Missing # in shebang line

The file starts with !/bin/bash instead of #!/bin/bash. Without the #! two-byte magic sequence, the kernel cannot identify this as a script and will return "Exec format error" on most Linux systems, making the upload workflow completely broken for any developer who tries to run ./bin/s3_push directly.

greptile-apps · 2026-05-07T06:02:58Z

+def _get_s3_client():  # type: ignore[no-untyped-def]
+    """Get a boto3 S3 client, using instance profile / env credentials."""
    try:
-        relative_path = file_path.relative_to(repo_root)
-
-        env = os.environ.copy()
-        env["GIT_LFS_FORCE_PROGRESS"] = "1"
-
-        subprocess.run(
-            ["git", "lfs", "pull", "--include", str(relative_path)],
-            cwd=repo_root,
-            check=True,
-            env=env,
+        session = boto3.Session(region_name=S3_REGION)
+        client = session.client("s3")
+        # Quick check that credentials work
+        client.head_bucket(Bucket=S3_BUCKET)
+        return client
+    except ClientError:
+        # Fall back to unsigned access (public bucket)
+        return boto3.client(
+            "s3",
+            region_name=S3_REGION,
+            config=BotoConfig(signature_version=UNSIGNED),
        )


NoCredentialsError not caught — unsigned fallback unreachable in credential-less environments

botocore.exceptions.NoCredentialsError is a subclass of BotoCoreError, not ClientError. When no AWS credentials are configured (e.g. fresh developer machine, CI without IAM role), client.head_bucket() raises NoCredentialsError, which escapes the except ClientError block and propagates as an unhandled exception — the unsigned public-bucket fallback is never reached. The caller _pull_s3_archive has no handler for this, so tests will crash with a confusing NoCredentialsError instead of transparently using unsigned access as intended.

greptile-apps · 2026-05-07T06:02:59Z

+    if [ $? -eq 0 ]; then
+        compressed_size=$(du -sh "$compressed_file" | cut -f1)
+        echo -e "  ${GREEN}✓${NC} Compressed $entry_name (${GREEN}$entry_size${NC} → ${GREEN}$compressed_size${NC})"
+        compressed_entries+=("$entry_name")
+    else
+        echo -e "  ${RED}✗${NC} Failed to compress $entry_name"
+        exit 1
+    fi


Dead code: if [ $? -eq 0 ] is unreachable under set -e

set -e at line 5 causes the shell to exit immediately if tar returns a non-zero exit code, so execution never reaches the if [ $? -eq 0 ] check when tar fails. The else branch (failure message + exit 1) is dead code.

greptile-apps · 2026-05-07T06:03:00Z

@@ -181,46 +121,40 @@ def _decompress_archive(filename: str | Path) -> Path:
    return target_dir / filename_path.name.replace(".tar.gz", "")


tarfile.extractall() without filter — path-traversal risk and deprecation warning

Calling tar.extractall(target_dir) without a filter argument is deprecated since Python 3.12 and produces DeprecationWarning; a future Python release will change the default to filter='data'. More importantly, a crafted archive could use absolute paths or ../ components to write files outside target_dir. Passing filter='data' strips unsafe members and also silences the warning.

mypy was failing on dimos/utils/data.py because boto3 wasn't a declared dependency. Add it to pyproject.toml dependencies and the ignore_missing_imports list (boto3 doesn't ship inline type stubs without boto3-stubs).

greptile-apps · 2026-05-07T06:08:08Z

+    try:
+        client.download_file(S3_BUCKET, s3_key, str(local_path))
+    except ClientError as e:
        raise FileNotFoundError(
-            f"Test file '{filename}' not found at {file_path}. "
-            f"Make sure the file is committed to Git LFS in the tests/data directory."
+            f"Data file '{archive_name}' not found in s3://{S3_BUCKET}/{S3_PREFIX}. "
+            f"Make sure it has been uploaded with bin/s3_push. ({e})"
        )


All ClientErrors mis-reported as "not found"

except ClientError catches every S3 error — 403 AccessDenied, 503 SlowDown, 500 InternalError, etc. — and re-raises each one as FileNotFoundError with the message "not found in s3://…". A developer who hits a permission error or throttle will waste time hunting for a supposedly missing S3 object when the archive is present but inaccessible or rate-limited. Only HTTP 404 / NoSuchKey genuinely means the object is absent; other codes should surface their real cause.

leshy · 2026-05-07T06:33:17Z

try S3 backed LFS self hosting first

to not re-invent LFS (branching, third party devs etc) benchmark self hosted LFS vs WGET to s3 to compare if there are speed diffs (LFS seems slow sometimes, idk if due to github) to decide if we need raw S3 mirroring or just our own LFS servers

branching, third party devs

this needs solving for above to be viable, how do I test a third party branch that added a new big file? how do I test a branch that overrides existing file? when do these files get deleted?

we can solve all this:

keep using LFS as a branching/identity mechanism, defer to S3 as cache. store files as hashes in S3 and not filenames,

cat data/.lfs/xarm6.tar.gz 
version https://git-lfs.github.com/spec/v1
oid sha256:71c9990ab779d20b878ec4c6b0ee21b29bbb7963b59375acf9dd635b7241009b
size 1863215

(actual .lfs stub files store a hash - use that hash for your s3 filename)
keep using lfs but our dimos get_data function can try our S3 caching layer first, and auto-defer to LFS if it 404s)

GET our_bucket/71c9990ab779d20b878ec4c6b0ee21b29bbb7963b59375acf9dd635b7241009b

this allows us to whenever we want - upload whatever we want. If we don't upload, everything still works.

Uploading just current dev or main to S3, 99% of traffic goes from S3

spomichter marked this pull request as ready for review May 7, 2026 05:58

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

fix(data): add boto3 dependency and mypy ignore for botocore

87d4418

mypy was failing on dimos/utils/data.py because boto3 wasn't a declared dependency. Add it to pyproject.toml dependencies and the ignore_missing_imports list (boto3 doesn't ship inline type stubs without boto3-stubs).

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

spomichter mentioned this pull request May 7, 2026

consider moving LFS download to S3 #1998

Open

spomichter marked this pull request as draft May 7, 2026 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): replace git LFS with S3 for test data downloads#2001

feat(data): replace git LFS with S3 for test data downloads#2001
spomichter wants to merge 2 commits intodevfrom
s3-data-migration

spomichter commented May 7, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

leshy commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -181,46 +121,40 @@ def _decompress_archive(filename: str \| Path) -> Path:
		return target_dir / filename_path.name.replace(".tar.gz", "")

Conversation

spomichter commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Test plan

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 4 Tests Failed:

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

leshy commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

try S3 backed LFS self hosting first

branching, third party devs

we can solve all this:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spomichter commented May 7, 2026 •

edited

Loading

codecov Bot commented May 7, 2026 •

edited

Loading

greptile-apps Bot commented May 7, 2026 •

edited

Loading

leshy commented May 7, 2026 •

edited

Loading