Skip to content

Sarcastic Fringehead: Upload Path Pt2 (GSI-1960)#159

Open
TheByronHimes wants to merge 53 commits intomainfrom
feature/upload_path_pt2_GSI-1960
Open

Sarcastic Fringehead: Upload Path Pt2 (GSI-1960)#159
TheByronHimes wants to merge 53 commits intomainfrom
feature/upload_path_pt2_GSI-1960

Conversation

@TheByronHimes
Copy link
Copy Markdown
Member

@TheByronHimes TheByronHimes commented Nov 19, 2025

This epic details the backend work for the remaining portion of the file upload path.

The FileUploadReport has been augmented from the original concept. Instead of being generated by the DHFS and maintained by the FIS, it is generated with partial data by the FIS and the remaining data is supplied by DHFS after interrogation.

- GET /uploads/{storage_alias}/inbox
- Returns a 200 status code and list of `FileUploads`
- GET /uploads/{storage_alias}/archived
- Returns a 200 status code and a list of file_ids for deletion
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about the deletion part here.
From the route I would expect this to provide a list of files that have been successfully moved to permanent storage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad wording on my part. That's exactly what's in the list. The DHFS deletes those files from its interrogation bucket.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still find the phrasing a bit weird, event though I get that that's the primary purpose of what's done with the information. On the risk of sounding redundant, this could be the list of successfully processed file_ids or something in that vein to be explicit about which file IDs we are talking about.

#### FileUploadReport
```python
file_id: UUID4 # Unique identifier for the file upload
secret_id: str | None # The Vault ID of the file secret used for re-encryption
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be populated after passed_inspection is set to True?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new secret key would be deposited before re-encryption starts, so from a purely technical standpoint the secret_id could be included for both successful and unsuccessful interrogations. The FileUploadReport is only submitted once a conclusion is reached either way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just wondered when exactly that would be populated.
We don't need to store a key for an unsuccessful interrogation attempt, that would result in more than one wasted HTTP call if thinking about the vault interaction resulting from that.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, there's a potential problem if we store the secret after re-encryption. If the file re-encryption is successful but there's a network hiccup or some other error that prevents the secret from being stored, then that re-encrypted data is bricked and re-encryption has to be repeated.

As I step through this, I don't think I accounted for error recovery if something prevents the DHFS from being able to send the interrogation results to the FIS. Maybe it can write something to disk before it crashes/moves on. I'd like to keep it lightweight and not use a database

@TheByronHimes TheByronHimes marked this pull request as draft November 21, 2025 11:45
@TheByronHimes TheByronHimes marked this pull request as ready for review November 24, 2025 22:00
Copy link
Copy Markdown
Member

@mephenor mephenor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much easier to read and well thought out version of the spec.
There's one minor and one major issue remaining:

  • Box ID is mentioned in the models, but not where/when it's populated. Should be added for consistency with the other information
  • The whole point where the actual data finally resides needs some more current input from the other reviewers. I think I'm operating on outdated information here, but in my mental model the data does not have to leave the data hub and only the secrets are stored at central in every case.

The `dcsPersistedEvents` collection needs the following changes to the `payload` field:
- Where `type_` == `drs_object_served`:
- Replace the value for `file_id` with the value from `target_object_id`
- **QUESTION**: *Should we update the actual schema so `s3_endpoint_alias` becomes `storage_alias`?*
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're migrating other fields, let's make this consistent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my feeling too

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per offline discussion, let's hold off on that change until we get some clarity about the final move.

Cito
Cito previously approved these changes Nov 25, 2025
Copy link
Copy Markdown
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Made some comments below, but essentially looks good to me.

AC007 should be adapted afterwards to reflect the current state.

@TheByronHimes TheByronHimes marked this pull request as draft December 12, 2025 08:45
@TheByronHimes TheByronHimes marked this pull request as ready for review December 15, 2025 10:17
- **One alternative** is to manually exfiltrate the data to UCS.
- **A second alternative** is to simply drop the data since the files have already been archived and all relevant information is actually stored by IFRS.

There is one **problem**: IFRS has different object IDs than FIS. If we keep the event data which FIS currently has, we will need to update the information so that the file IDs in FIS are set to the file IDs (object IDs) known to IFRS, using the accession to match. That action would not be reversible, of course.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure mode here would be IFRS loosing information in its internal DB about files that are already archived.

Theoretically we should preserve the information to keep guarantees about what we are able to restore by replaying events, even though personally I'd prefer a clean cutoff as there's a separation between current and new FIS which, in my mind, make them two distinct services and the new one shouldn't be burdened by one off logic permanently in it's code.
Would it make sense to possibly keep this legacy migration separate as a one off script?

On another note, why does this have to be reversible?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note, why does this have to be reversible?

It does not have to be reversible - I wrote that because when I initially looked at the data it seemed that the potential migration would be reversible in nature. And if it's possible to write a reverse migration without unreasonable effort, it's good practice to do so.

Would it make sense to possibly keep this legacy migration separate as a one off script?

Sure, that is of course a possible avenue.

which, in my mind, make them two distinct services

I had the same thought a few times. "File Ingest Service" doesn't really apply. It's more of an "Assistant to the DHFS". Just a DH-Central liaison or outsourced persistence mechanism. We also have the power to delete the FIS entirely and write a new service with a different identity but the same purpose described in the spec.

Copy link
Copy Markdown
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some suggestions, but I have no major objections.

size: int
storage_alias: str
```

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have a validator that checks/sets locked = True if archived = True.

Cito
Cito previously approved these changes Dec 16, 2025
Copy link
Copy Markdown
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment regarding passing file IDs as query parameters.

@TheByronHimes TheByronHimes requested a review from Cito January 7, 2026 12:51
Cito
Cito previously approved these changes Jan 7, 2026
mephenor
mephenor previously approved these changes Jan 9, 2026
@TheByronHimes TheByronHimes dismissed stale reviews from mephenor and Cito via 424ee8c January 21, 2026 17:00
@TheByronHimes TheByronHimes force-pushed the feature/upload_path_pt2_GSI-1960 branch from fd6f1e8 to c2f5ade Compare January 23, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants