feat: add progress_data to worker_metadata#202
Conversation
Started returning `progress_data` in `worker_metadata`.
radovanjorgic
left a comment
There was a problem hiding this comment.
No tests at all? :/
| const itemsToUpload = batch || this.items; | ||
|
|
||
| if (itemsToUpload.length > 0) { | ||
| for (const item of itemsToUpload) { |
There was a problem hiding this comment.
Huh, I have few questions:
- Do we really need to do this for each item? How is that from performance perspective? Let's say you have 100+ repos at once each with 5000 items in it?
- What if timeout comes at this point?
- Can we offload this work to backend? After storing the files, extractor-adapter/snap-in manager scans them and picks progress from there.
There was a problem hiding this comment.
For the #3, we'll need @GasperSenk 's input, but I don't think so.
There was a problem hiding this comment.
No, adapter doesn't know the normalization function to find the correct field name for the dates.
| if (itemsToUpload.length > 0) { | ||
| for (const item of itemsToUpload) { | ||
| if ( | ||
| item != null && |
There was a problem hiding this comment.
Simply just if (item?.created_date)?
There was a problem hiding this comment.
This didn't work before I added created_date to the NormalizedAttachment.
Good catch!
| 'created_date' in item && | ||
| item.created_date != null | ||
| ) { | ||
| const created_date = new Date(item['created_date']).getTime(); |
There was a problem hiding this comment.
Don't use snake case for variable names please.
| min: 0, | ||
| max: 0, |
There was a problem hiding this comment.
What do min and max mean here? Maybe we should use oldest and newest instead?
There was a problem hiding this comment.
These are min and max timestamps. Due to them being numbers, we decided to go with the min and max.
Discussed in the ISS comments.
There was a problem hiding this comment.
Might make sense to use newest and oldest, we do that in the backend, but I see why you would use min and max when you are dealing with longs.
This PR adds extraction progress timestamps so the platform can detect looping incremental syncs and stop runs that keep re-processing the same time range.
NormalizedAttachment.created_date(optional): lets attachment extraction contribute real source timestamps, not only item extraction.Repo.itemTimestamps: while uploading batches, tracks the oldest and newestcreated_date(Unix ms) seen per repo/item type.worker_metadata.progress_data: on data and attachment extraction progress/done events, sends per–item-type{ min, max }timestamp bounds to the callback.Data extraction already worked via required
NormalizedItem.created_date. Attachments need connectors to populate the new optional field (see Migration note).Connected Issues
Checklist
npm run testOR no tests needed.npm run test:backwards-compatibility.npm run lint.Migration note
If your connector normalizes attachments (custom
normalizeon an attachment repo, or a custom attachment processor that buildsNormalizedAttachmentobjects), setcreated_dateon each normalized attachment using the source system’s creation time (RFC3339 string, same asNormalizedItem.created_date).created_date, that repo’sprogress_dataentry stays{ min: 0, max: 0 }and attachment incremental sync cannot be monitored by timestamp.NormalizedItemalready requirescreated_date.created_dateis optional onNormalizedAttachment; existing connectors keep working, but attachment timestamp tracking is only accurate once they populate it.What
progress_datadoesWhat it is
progress_datais sent on the callback HTTP payload underworker_metadata, alongsideadaas_library_version. It is not part ofevent_data(artifacts, errors, etc.).It is attached only for these extraction event types:
DATA_EXTRACTION_PROGRESS/DATA_EXTRACTION_DONEATTACHMENT_EXTRACTION_PROGRESS/ATTACHMENT_EXTRACTION_DONEWhat it returns
Shape:
itemTypestrings (e.g."issues","comments", attachment metadata types).{ min, max }— Unix timestamps in milliseconds for the oldest and newestcreated_dateseen in that repo during the current worker run (so far).created_datewas uploaded for a type, that entry is typically{ min: 0, max: 0 }.Example:
{ "worker_metadata": { "progress_data": { "issues": { "min": 1704067200000, "max": 1714521600000 }, "comments": { "min": 1710000000000, "max": 1714521600000 } }, "adaas_library_version": "..." } }What it is based on
Repo.upload(), the SDK scans the batch for objects with a non-nullcreated_date.new Date(created_date).getTime()and updates that repo’s runningmin/max.WorkerAdaptercopies each repo’sitemTimestampsintoprogress_data.For items,
created_dateis already required onNormalizedItem. For attachments, it only counts after connectors set the new optionalNormalizedAttachment.created_date.Why it exists
The platform can compare
min/maxacross incremental sync runs and progress events. If extraction keeps revisiting the same timestamp window (or bounds stop advancing), that indicates a loop so incremental sync can be stopped instead of re-uploading the same data indefinitely.