Skip to content

feat: add progress_data to worker_metadata#202

Open
gasperzgonec wants to merge 3 commits into
mainfrom
gasperz/ISS-252036
Open

feat: add progress_data to worker_metadata#202
gasperzgonec wants to merge 3 commits into
mainfrom
gasperz/ISS-252036

Conversation

@gasperzgonec
Copy link
Copy Markdown
Contributor

This PR adds extraction progress timestamps so the platform can detect looping incremental syncs and stop runs that keep re-processing the same time range.

  • NormalizedAttachment.created_date (optional): lets attachment extraction contribute real source timestamps, not only item extraction.
  • Repo.itemTimestamps: while uploading batches, tracks the oldest and newest created_date (Unix ms) seen per repo/item type.
  • worker_metadata.progress_data: on data and attachment extraction progress/done events, sends per–item-type { min, max } timestamp bounds to the callback.

Data extraction already worked via required NormalizedItem.created_date. Attachments need connectors to populate the new optional field (see Migration note).

Connected Issues

Checklist

  • Tests added/updated and ran with npm run test OR no tests needed.
  • Ran backwards compatibility tests with npm run test:backwards-compatibility.
  • Code formatted and checked with npm run lint.
  • Tested airdrop-template linked to this PR.
  • Documentation updated and provided a link to PR / new docs OR no docs needed.

Migration note

If your connector normalizes attachments (custom normalize on an attachment repo, or a custom attachment processor that builds NormalizedAttachment objects), set created_date on each normalized attachment using the source system’s creation time (RFC3339 string, same as NormalizedItem.created_date).

// In your attachment normalize function (or equivalent)
const normalizedAttachment: NormalizedAttachment = {
  id: attachment.id,
  url: attachment.downloadUrl,
  file_name: attachment.name,
  parent_id: attachment.parentId,
  created_date: attachment.createdAt, // RFC3339, e.g. "2024-03-15T10:30:00Z"
  // ...other fields
};
  • Required for loop detection on attachments: without created_date, that repo’s progress_data entry stays { min: 0, max: 0 } and attachment incremental sync cannot be monitored by timestamp.
  • No change for normal item repos: NormalizedItem already requires created_date.
  • Backwards compatible: created_date is optional on NormalizedAttachment; existing connectors keep working, but attachment timestamp tracking is only accurate once they populate it.

What progress_data does

What it is

progress_data is sent on the callback HTTP payload under worker_metadata, alongside adaas_library_version. It is not part of event_data (artifacts, errors, etc.).

It is attached only for these extraction event types:

  • DATA_EXTRACTION_PROGRESS / DATA_EXTRACTION_DONE
  • ATTACHMENT_EXTRACTION_PROGRESS / ATTACHMENT_EXTRACTION_DONE

What it returns

Shape:

worker_metadata: {
  progress_data: Record<string, { min: number; max: number }>;
  adaas_library_version: string;
}
  • Keys: repo itemType strings (e.g. "issues", "comments", attachment metadata types).
  • Values: { min, max } — Unix timestamps in milliseconds for the oldest and newest created_date seen in that repo during the current worker run (so far).
  • If nothing with created_date was uploaded for a type, that entry is typically { min: 0, max: 0 }.

Example:

{
  "worker_metadata": {
    "progress_data": {
      "issues": { "min": 1704067200000, "max": 1714521600000 },
      "comments": { "min": 1710000000000, "max": 1714521600000 }
    },
    "adaas_library_version": "..."
  }
}

What it is based on

  1. On each Repo.upload(), the SDK scans the batch for objects with a non-null created_date.
  2. It parses them with new Date(created_date).getTime() and updates that repo’s running min / max.
  3. On the progress/done events above, WorkerAdapter copies each repo’s itemTimestamps into progress_data.

For items, created_date is already required on NormalizedItem. For attachments, it only counts after connectors set the new optional NormalizedAttachment.created_date.

Why it exists

The platform can compare min/max across incremental sync runs and progress events. If extraction keeps revisiting the same timestamp window (or bounds stop advancing), that indicates a loop so incremental sync can be stopped instead of re-uploading the same data indefinitely.

Started returning `progress_data` in `worker_metadata`.
@gasperzgonec gasperzgonec requested review from a team and radovanjorgic as code owners May 19, 2026 11:03
Copy link
Copy Markdown
Collaborator

@radovanjorgic radovanjorgic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests at all? :/

Comment thread src/repo/repo.ts
const itemsToUpload = batch || this.items;

if (itemsToUpload.length > 0) {
for (const item of itemsToUpload) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I have few questions:

  1. Do we really need to do this for each item? How is that from performance perspective? Let's say you have 100+ repos at once each with 5000 items in it?
  2. What if timeout comes at this point?
  3. Can we offload this work to backend? After storing the files, extractor-adapter/snap-in manager scans them and picks progress from there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the #3, we'll need @GasperSenk 's input, but I don't think so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the #1 and #2, I think this is fine.
This logic takes O(n) in time and O(1) in memory use.
There's just integer comparisons, and it should be fast enough to not block any isTimeout checks.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, adapter doesn't know the normalization function to find the correct field name for the dates.

Comment thread src/repo/repo.ts Outdated
if (itemsToUpload.length > 0) {
for (const item of itemsToUpload) {
if (
item != null &&
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply just if (item?.created_date)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't work before I added created_date to the NormalizedAttachment.
Good catch!

Comment thread src/repo/repo.ts Outdated
'created_date' in item &&
item.created_date != null
) {
const created_date = new Date(item['created_date']).getTime();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use snake case for variable names please.

Comment thread src/repo/repo.ts
Comment on lines +28 to +29
min: 0,
max: 0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do min and max mean here? Maybe we should use oldest and newest instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are min and max timestamps. Due to them being numbers, we decided to go with the min and max.
Discussed in the ISS comments.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to use newest and oldest, we do that in the backend, but I see why you would use min and max when you are dealing with longs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants