Ilongin/12872 dataset soft delete#1770
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Deploying datachain with
|
| Latest commit: |
fe1a214
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://b09cc851.datachain-2g6.pages.dev |
| Branch Preview URL: | https://ilongin-12872-dataset-soft-d.datachain-2g6.pages.dev |
amritghimire
left a comment
There was a problem hiding this comment.
I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,
I did add |
…ain-ai/datachain into ilongin/12872-dataset-soft-delete
If we have removed_at column, why add another status? We can preserve the status as it was before removed imo. |
It's strange to me to leave status field to COMPLETE or similar but dataset is removed. We also have created_at and CREATED status. I would rather have little bit of duplicate than risk of reading wrong information. |
| DatasetStatus.REMOVING, | ||
| DatasetStatus.REMOVED, | ||
| ): | ||
| if keep_metadata and not v.is_soft_deletable: |
There was a problem hiding this comment.
can we call it is_internal? or even is_system?
is soft deletable again is not reusable - we are just leaking removal business logic outside
There was a problem hiding this comment.
I completely removed this method as it's not really needed so no need for figuring out naming. I agree that soft delete should not be used anywhere but not sure what is substitute for that to be honest ..
dreadatour
left a comment
There was a problem hiding this comment.
I would prefer to be able to soft delete listing datasets (to keep meta) as it can be a dependency for all other datasets.
It also can be useful for Knowledge Base.
| ) | ||
| if ( | ||
| not keep_metadata | ||
| and v.status == DatasetStatus.REMOVING |
There was a problem hiding this comment.
What about other statuses? REMOVING_TOTAL, REMOVED?
There was a problem hiding this comment.
This is about checking if someone wants to fully remove / wipe dataset version but in the same time default removing with keeping metadata is present...in that way we should raise.
There was a problem hiding this comment.
so REMOVING is ongoing removal where we keep metadata. REMOVING_TOTAL is ongoing removal status which ends up with removing metadata and actual data table
Why do we need these statuses ( Will it be easier to basically mark datasets (versions) as I see this as more robust and easier logic, rather than to have state machine and all the logic around to process edge cases (failed removal, locks, etc). |
|
|
||
| if dataset.versions and len(dataset.versions) == 1: | ||
| # had only one version, fully deleting dataset | ||
| # Count in DB, not in the in-memory record: GC-shaped paths |
There was a problem hiding this comment.
this is a very bad comment:
- what is
GC-shaped paths? - it is referring to now non-existent line
len(dataset.versions) == 1- it will be impossible to understand w/o PR - can be simpler ... a lot simpler
Please review everything AI generates, clean it up
| self.update_dataset_version(dataset, version, **update_data) | ||
| with self.db.transaction(): | ||
| if version: | ||
| # Update the version row first. If a status guard was requested |
There was a problem hiding this comment.
What is updated "second"?
There was a problem hiding this comment.
It was about updating Dataset object below. I updated comment to make it more clear
We need those statuses because we have 2 different types of delete now:
If any of the process fails GC needs to continue removing but we need to save information of what kind of delete was in action. If it get's stuck in REMOVING, GC will know how to continue. I don't like depending completely on GC for removing warehouse data as it's little bit strange and it won't work on local where user needs to run GC explicitly. |
Closes datachain-ai/studio#12872.
catalog.remove_dataset_versionno longer hard-deletes a COMPLETE user dataset version by default. Instead it:status = REMOVED+removed_at = now(),dataset_dependenciesso dependents can still render lineage,Non-COMPLETE versions (
CREATED/FAILED/STALEleftovers from the GC path) and internal datasets (lst__*,session_*) continue to fully delete — no benefit to keeping their metadata.keep_metadataflagA new
keep_metadata: bool = Trueparameter oncatalog.remove_dataset_version,catalog.remove_dataset, anddc.delete_datasetcontrols the behavior:keep_metadata=True): the behavior above — keeps the REMOVED record.keep_metadata=False: fully wipes the version (rows, dependencies, version row, and the dataset row if it was the last). Reserved for cases like GDPR/PII removal or cleaning up garbage versions.Exposed via CLI as
datachain datasets rm <name> --no-keep-metadata.State machine
Three new/repurposed dataset statuses:
REMOVING = 7(repurposed) — keep-metadata removal in progress; GC resumes to REMOVEDREMOVED = 8(new) — terminal state for keep-metadata path; semver permanently reservedREMOVING_DROP_METADATA = 9(new) — wipe in progress; GC resumes to row deletionremove_dataset_versionroutes resumption purely off current status, so a GC retry (which doesn't have access to the caller's original intent) lands the row in the correct end state for both paths.Other surface changes
DatasetRecord.live_versionsreturns versions excluding REMOVED ones.latest_version/latest_major_version/latest_compatible_version/DatasetListRecord.latest_versionskip REMOVED._max_version/_max_version_value(private) consider all versions including REMOVED, so auto-bump never reclaims a reserved semver.DatasetVersion.removed_atfield added (timestamp of removal).Schema
Column-only:
removed_at(nullable timestamp). OSS handles it via the existing auto-migration in_migrate_table_schema; Studio companion PR adds the corresponding Django migration.