Sarcastic Fringehead: Upload Path Pt2 (GSI-1960)#159
Sarcastic Fringehead: Upload Path Pt2 (GSI-1960)#159TheByronHimes wants to merge 53 commits intomainfrom
Conversation
| - GET /uploads/{storage_alias}/inbox | ||
| - Returns a 200 status code and list of `FileUploads` | ||
| - GET /uploads/{storage_alias}/archived | ||
| - Returns a 200 status code and a list of file_ids for deletion |
There was a problem hiding this comment.
Confused about the deletion part here.
From the route I would expect this to provide a list of files that have been successfully moved to permanent storage.
There was a problem hiding this comment.
Bad wording on my part. That's exactly what's in the list. The DHFS deletes those files from its interrogation bucket.
There was a problem hiding this comment.
Still find the phrasing a bit weird, event though I get that that's the primary purpose of what's done with the information. On the risk of sounding redundant, this could be the list of successfully processed file_ids or something in that vein to be explicit about which file IDs we are talking about.
| #### FileUploadReport | ||
| ```python | ||
| file_id: UUID4 # Unique identifier for the file upload | ||
| secret_id: str | None # The Vault ID of the file secret used for re-encryption |
There was a problem hiding this comment.
This will only be populated after passed_inspection is set to True?
There was a problem hiding this comment.
The new secret key would be deposited before re-encryption starts, so from a purely technical standpoint the secret_id could be included for both successful and unsuccessful interrogations. The FileUploadReport is only submitted once a conclusion is reached either way.
There was a problem hiding this comment.
Thanks, I just wondered when exactly that would be populated.
We don't need to store a key for an unsuccessful interrogation attempt, that would result in more than one wasted HTTP call if thinking about the vault interaction resulting from that.
There was a problem hiding this comment.
On the other hand, there's a potential problem if we store the secret after re-encryption. If the file re-encryption is successful but there's a network hiccup or some other error that prevents the secret from being stored, then that re-encrypted data is bricked and re-encryption has to be repeated.
As I step through this, I don't think I accounted for error recovery if something prevents the DHFS from being able to send the interrogation results to the FIS. Maybe it can write something to disk before it crashes/moves on. I'd like to keep it lightweight and not use a database
mephenor
left a comment
There was a problem hiding this comment.
Much easier to read and well thought out version of the spec.
There's one minor and one major issue remaining:
- Box ID is mentioned in the models, but not where/when it's populated. Should be added for consistency with the other information
- The whole point where the actual data finally resides needs some more current input from the other reviewers. I think I'm operating on outdated information here, but in my mental model the data does not have to leave the data hub and only the secrets are stored at central in every case.
| The `dcsPersistedEvents` collection needs the following changes to the `payload` field: | ||
| - Where `type_` == `drs_object_served`: | ||
| - Replace the value for `file_id` with the value from `target_object_id` | ||
| - **QUESTION**: *Should we update the actual schema so `s3_endpoint_alias` becomes `storage_alias`?* |
There was a problem hiding this comment.
While we're migrating other fields, let's make this consistent.
There was a problem hiding this comment.
That was my feeling too
There was a problem hiding this comment.
Per offline discussion, let's hold off on that change until we get some clarity about the final move.
Cito
left a comment
There was a problem hiding this comment.
Thanks. Made some comments below, but essentially looks good to me.
AC007 should be adapted afterwards to reflect the current state.
| - **One alternative** is to manually exfiltrate the data to UCS. | ||
| - **A second alternative** is to simply drop the data since the files have already been archived and all relevant information is actually stored by IFRS. | ||
|
|
||
| There is one **problem**: IFRS has different object IDs than FIS. If we keep the event data which FIS currently has, we will need to update the information so that the file IDs in FIS are set to the file IDs (object IDs) known to IFRS, using the accession to match. That action would not be reversible, of course. |
There was a problem hiding this comment.
The failure mode here would be IFRS loosing information in its internal DB about files that are already archived.
Theoretically we should preserve the information to keep guarantees about what we are able to restore by replaying events, even though personally I'd prefer a clean cutoff as there's a separation between current and new FIS which, in my mind, make them two distinct services and the new one shouldn't be burdened by one off logic permanently in it's code.
Would it make sense to possibly keep this legacy migration separate as a one off script?
On another note, why does this have to be reversible?
There was a problem hiding this comment.
On another note, why does this have to be reversible?
It does not have to be reversible - I wrote that because when I initially looked at the data it seemed that the potential migration would be reversible in nature. And if it's possible to write a reverse migration without unreasonable effort, it's good practice to do so.
Would it make sense to possibly keep this legacy migration separate as a one off script?
Sure, that is of course a possible avenue.
which, in my mind, make them two distinct services
I had the same thought a few times. "File Ingest Service" doesn't really apply. It's more of an "Assistant to the DHFS". Just a DH-Central liaison or outsourced persistence mechanism. We also have the power to delete the FIS entirely and write a new service with a different identity but the same purpose described in the spec.
Cito
left a comment
There was a problem hiding this comment.
Made some suggestions, but I have no major objections.
| size: int | ||
| storage_alias: str | ||
| ``` | ||
|
|
There was a problem hiding this comment.
Should have a validator that checks/sets locked = True if archived = True.
Cito
left a comment
There was a problem hiding this comment.
See comment regarding passing file IDs as query parameters.
Co-authored-by: Christoph Zwerschke <c.zwerschke@dkfz-heidelberg.de>
Co-authored-by: Christoph Zwerschke <c.zwerschke@dkfz-heidelberg.de>
fd6f1e8 to
c2f5ade
Compare
This epic details the backend work for the remaining portion of the file upload path.
The
FileUploadReporthas been augmented from the original concept. Instead of being generated by the DHFS and maintained by the FIS, it is generated with partial data by the FIS and the remaining data is supplied by DHFS after interrogation.