feat(data): replace git LFS with S3 for test data downloads#2001
feat(data): replace git LFS with S3 for test data downloads#2001spomichter wants to merge 2 commits intodevfrom
Conversation
Move test data fetching from git LFS to S3 to avoid recurring LFS budget exhaustion and slow downloads. Devs no longer need git-lfs installed; data is pulled lazily from s3://dimos-github-lfs/.lfs/ on demand via boto3. Changes: - dimos/utils/data.py: replace _pull_lfs_archive() with _pull_s3_archive() using boto3, with unsigned-access fallback for public read - bin/s3_push: new compress + upload script (replaces bin/lfs_push) - bin/hooks/lfs_check: point devs to bin/s3_push instead of bin/lfs_push - .gitattributes: drop LFS filter rules - .gitignore: stop tracking data/.lfs/ in git - dimos/utils/test_data.py: drop LFS-specific pointer-file assertions Bucket location is overridable via DIMOS_DATA_S3_BUCKET / DIMOS_DATA_S3_PREFIX / DIMOS_DATA_S3_REGION env vars so devs can point at a regional mirror. Refs #1998
❌ 4 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
Greptile SummaryThis PR replaces git LFS with direct S3 downloads for test data, addressing LFS budget exhaustion. The
Confidence Score: 4/5Safe to merge after addressing the ClientError mis-classification in _pull_s3_archive and the issues flagged in prior review threads. The download error handler in _pull_s3_archive catches every ClientError and re-raises it as FileNotFoundError with a not found message, masking real permission and throttling failures. dimos/utils/data.py — specifically the error-handling branches in _get_s3_client and _pull_s3_archive. Important Files Changed
Reviews (2): Last reviewed commit: "fix(data): add boto3 dependency and mypy..." | Re-trigger Greptile |
| @@ -0,0 +1,76 @@ | |||
| !/bin/bash | |||
There was a problem hiding this comment.
The file starts with !/bin/bash instead of #!/bin/bash. Without the #! two-byte magic sequence, the kernel cannot identify this as a script and will return "Exec format error" on most Linux systems, making the upload workflow completely broken for any developer who tries to run ./bin/s3_push directly.
| def _get_s3_client(): # type: ignore[no-untyped-def] | ||
| """Get a boto3 S3 client, using instance profile / env credentials.""" | ||
| try: | ||
| relative_path = file_path.relative_to(repo_root) | ||
|
|
||
| env = os.environ.copy() | ||
| env["GIT_LFS_FORCE_PROGRESS"] = "1" | ||
|
|
||
| subprocess.run( | ||
| ["git", "lfs", "pull", "--include", str(relative_path)], | ||
| cwd=repo_root, | ||
| check=True, | ||
| env=env, | ||
| session = boto3.Session(region_name=S3_REGION) | ||
| client = session.client("s3") | ||
| # Quick check that credentials work | ||
| client.head_bucket(Bucket=S3_BUCKET) | ||
| return client | ||
| except ClientError: | ||
| # Fall back to unsigned access (public bucket) | ||
| return boto3.client( | ||
| "s3", | ||
| region_name=S3_REGION, | ||
| config=BotoConfig(signature_version=UNSIGNED), | ||
| ) |
There was a problem hiding this comment.
NoCredentialsError not caught — unsigned fallback unreachable in credential-less environments
botocore.exceptions.NoCredentialsError is a subclass of BotoCoreError, not ClientError. When no AWS credentials are configured (e.g. fresh developer machine, CI without IAM role), client.head_bucket() raises NoCredentialsError, which escapes the except ClientError block and propagates as an unhandled exception — the unsigned public-bucket fallback is never reached. The caller _pull_s3_archive has no handler for this, so tests will crash with a confusing NoCredentialsError instead of transparently using unsigned access as intended.
| if [ $? -eq 0 ]; then | ||
| compressed_size=$(du -sh "$compressed_file" | cut -f1) | ||
| echo -e " ${GREEN}✓${NC} Compressed $entry_name (${GREEN}$entry_size${NC} → ${GREEN}$compressed_size${NC})" | ||
| compressed_entries+=("$entry_name") | ||
| else | ||
| echo -e " ${RED}✗${NC} Failed to compress $entry_name" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
| @@ -181,46 +121,40 @@ def _decompress_archive(filename: str | Path) -> Path: | |||
| return target_dir / filename_path.name.replace(".tar.gz", "") | |||
There was a problem hiding this comment.
tarfile.extractall() without filter — path-traversal risk and deprecation warning
Calling tar.extractall(target_dir) without a filter argument is deprecated since Python 3.12 and produces DeprecationWarning; a future Python release will change the default to filter='data'. More importantly, a crafted archive could use absolute paths or ../ components to write files outside target_dir. Passing filter='data' strips unsafe members and also silences the warning.
mypy was failing on dimos/utils/data.py because boto3 wasn't a declared dependency. Add it to pyproject.toml dependencies and the ignore_missing_imports list (boto3 doesn't ship inline type stubs without boto3-stubs).
| try: | ||
| client.download_file(S3_BUCKET, s3_key, str(local_path)) | ||
| except ClientError as e: | ||
| raise FileNotFoundError( | ||
| f"Test file '{filename}' not found at {file_path}. " | ||
| f"Make sure the file is committed to Git LFS in the tests/data directory." | ||
| f"Data file '{archive_name}' not found in s3://{S3_BUCKET}/{S3_PREFIX}. " | ||
| f"Make sure it has been uploaded with bin/s3_push. ({e})" | ||
| ) |
There was a problem hiding this comment.
All
ClientErrors mis-reported as "not found"
except ClientError catches every S3 error — 403 AccessDenied, 503 SlowDown, 500 InternalError, etc. — and re-raises each one as FileNotFoundError with the message "not found in s3://…". A developer who hits a permission error or throttle will waste time hunting for a supposedly missing S3 object when the archive is present but inaccessible or rate-limited. Only HTTP 404 / NoSuchKey genuinely means the object is absent; other codes should surface their real cause.
try S3 backed LFS self hosting firstto not re-invent LFS (branching, third party devs etc) benchmark self hosted LFS vs WGET to s3 to compare if there are speed diffs (LFS seems slow sometimes, idk if due to github) to decide if we need raw S3 mirroring or just our own LFS servers branching, third party devsthis needs solving for above to be viable, how do I test a third party branch that added a new big file? how do I test a branch that overrides existing file? when do these files get deleted? we can solve all this:keep using LFS as a branching/identity mechanism, defer to S3 as cache. store files as hashes in S3 and not filenames, cat data/.lfs/xarm6.tar.gz
version https://git-lfs.github.com/spec/v1
oid sha256:71c9990ab779d20b878ec4c6b0ee21b29bbb7963b59375acf9dd635b7241009b
size 1863215(actual .lfs stub files store a hash - use that hash for your s3 filename)
this allows us to whenever we want - upload whatever we want. If we don't upload, everything still works. Uploading just current dev or main to S3, 99% of traffic goes from S3 |
Summary
Replace git LFS with S3 for fetching test data archives, addressing recurring LFS budget exhaustion and slow downloads. Refs #1998.
dimos/utils/data.py: lazy S3 download viaboto3(with unsigned-access fallback for public read), replacinggit lfs pullsmudgingbin/s3_push: new compress + upload script (replacesbin/lfs_push)bin/hooks/lfs_check: hook now directs devs tobin/s3_push.gitattributes/.gitignore: drop LFS filters; stop trackingdata/.lfs/in gitDIMOS_DATA_S3_BUCKET,DIMOS_DATA_S3_PREFIX,DIMOS_DATA_S3_REGIONThe
LfsPathAPI andget_data()function are unchanged — all callers continue working without modification.Notes
s3://dimos-github-lfs/.lfs/(us-east-2)go2_sf_office,models_graspgen,piper_description,xarm_description) remained as LFS pointer stubs locally and need to be sourced from a teammate's cache or a temp LFS budget bump before tests that depend on them can passTest plan