Skip to content

Ilongin/12872 dataset soft delete#1770

Open
ilongin wants to merge 61 commits into
mainfrom
ilongin/12872-dataset-soft-delete
Open

Ilongin/12872 dataset soft delete#1770
ilongin wants to merge 61 commits into
mainfrom
ilongin/12872-dataset-soft-delete

Conversation

@ilongin

@ilongin ilongin commented May 14, 2026

Copy link
Copy Markdown
Contributor

Closes datachain-ai/studio#12872.

catalog.remove_dataset_version no longer hard-deletes a COMPLETE user dataset version by default. Instead it:

  • drops the warehouse rows table,
  • sets status = REMOVED + removed_at = now(),
  • keeps the version row and all its dataset_dependencies so dependents can still render lineage,
  • permanently reserves the semver — saving the same name again auto-bumps past the removed version (no slot reuse).

Non-COMPLETE versions (CREATED/FAILED/STALE leftovers from the GC path) and internal datasets (lst__*, session_*) continue to fully delete — no benefit to keeping their metadata.

keep_metadata flag

A new keep_metadata: bool = True parameter on catalog.remove_dataset_version, catalog.remove_dataset, and dc.delete_dataset controls the behavior:

  • Default (keep_metadata=True): the behavior above — keeps the REMOVED record.
  • keep_metadata=False: fully wipes the version (rows, dependencies, version row, and the dataset row if it was the last). Reserved for cases like GDPR/PII removal or cleaning up garbage versions.

Exposed via CLI as datachain datasets rm <name> --no-keep-metadata.

State machine

Three new/repurposed dataset statuses:

  • REMOVING = 7 (repurposed) — keep-metadata removal in progress; GC resumes to REMOVED
  • REMOVED = 8 (new) — terminal state for keep-metadata path; semver permanently reserved
  • REMOVING_DROP_METADATA = 9 (new) — wipe in progress; GC resumes to row deletion

remove_dataset_version routes resumption purely off current status, so a GC retry (which doesn't have access to the caller's original intent) lands the row in the correct end state for both paths.

Other surface changes

  • DatasetRecord.live_versions returns versions excluding REMOVED ones. latest_version / latest_major_version / latest_compatible_version / DatasetListRecord.latest_version skip REMOVED.
  • _max_version / _max_version_value (private) consider all versions including REMOVED, so auto-bump never reclaims a reserved semver.
  • DatasetVersion.removed_at field added (timestamp of removal).
  • Checkpoint and delta paths detect REMOVED and fall back to recreate / rebuild instead of trying to read a stub.

Schema

Column-only: removed_at (nullable timestamp). OSS handles it via the existing auto-migration in _migrate_table_schema; Studio companion PR adds the corresponding Django migration.

@ilongin ilongin marked this pull request as draft May 14, 2026 08:58
@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.63636% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/datachain/catalog/catalog.py 84.78% 5 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/datachain/catalog/catalog.py Outdated
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented May 15, 2026

Copy link
Copy Markdown

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: fe1a214
Status: ✅  Deploy successful!
Preview URL: https://b09cc851.datachain-2g6.pages.dev
Branch Preview URL: https://ilongin-12872-dataset-soft-d.datachain-2g6.pages.dev

View logs

@amritghimire amritghimire left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

@ilongin

ilongin commented May 15, 2026

Copy link
Copy Markdown
Contributor Author

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

I did add removed_at column.

@ilongin ilongin requested a review from amritghimire May 15, 2026 07:03
Comment thread src/datachain/catalog/catalog.py Outdated
Comment thread src/datachain/data_storage/metastore.py Outdated
Comment thread src/datachain/dataset.py Outdated
Comment thread src/datachain/delta.py Outdated
Comment thread src/datachain/delta.py Outdated
@ilongin ilongin requested a review from shcheklein May 18, 2026 13:28
@ilongin ilongin marked this pull request as ready for review May 18, 2026 13:56
@amritghimire

Copy link
Copy Markdown
Contributor

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

I did add removed_at column.

If we have removed_at column, why add another status? We can preserve the status as it was before removed imo.

@ilongin

ilongin commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

I think more general approach on how soft delete are used is, adding a new column deleted_at, so we can clear items in trash for long time if needed by filtering with time. Also, a simple filter on selection to deleted_at to null would do the work and so on,

I did add removed_at column.

If we have removed_at column, why add another status? We can preserve the status as it was before removed imo.

It's strange to me to leave status field to COMPLETE or similar but dataset is removed. We also have created_at and CREATED status. I would rather have little bit of duplicate than risk of reading wrong information.

Comment thread src/datachain/catalog/catalog.py Outdated
Comment thread src/datachain/catalog/catalog.py Outdated
Comment thread src/datachain/catalog/catalog.py Outdated
Comment thread src/datachain/catalog/catalog.py Outdated
@shcheklein shcheklein requested a review from Copilot May 20, 2026 00:46
Comment thread src/datachain/catalog/catalog.py Outdated
Comment thread src/datachain/data_storage/metastore.py Outdated
Comment thread src/datachain/catalog/catalog.py Outdated
DatasetStatus.REMOVING,
DatasetStatus.REMOVED,
):
if keep_metadata and not v.is_soft_deletable:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call it is_internal? or even is_system?

is soft deletable again is not reusable - we are just leaking removal business logic outside

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was it addressed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely removed this method as it's not really needed so no need for figuring out naming. I agree that soft delete should not be used anywhere but not sure what is substitute for that to be honest ..

@ilongin ilongin requested a review from shcheklein June 10, 2026 21:38

@dreadatour dreadatour left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to be able to soft delete listing datasets (to keep meta) as it can be a dependency for all other datasets.

It also can be useful for Knowledge Base.

Comment thread src/datachain/catalog/catalog.py Outdated
)
if (
not keep_metadata
and v.status == DatasetStatus.REMOVING

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other statuses? REMOVING_TOTAL, REMOVED?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is about checking if someone wants to fully remove / wipe dataset version but in the same time default removing with keeping metadata is present...in that way we should raise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so REMOVING is ongoing removal where we keep metadata. REMOVING_TOTAL is ongoing removal status which ends up with removing metadata and actual data table

@dreadatour

Copy link
Copy Markdown
Contributor

State machine

Three new/repurposed dataset statuses:

  • REMOVING = 7 (repurposed) — keep-metadata removal in progress; GC resumes to REMOVED
  • REMOVED = 8 (new) — terminal state for keep-metadata path; semver permanently reserved
  • REMOVING_DROP_METADATA = 9 (new) — wipe in progress; GC resumes to row deletion

Why do we need these statuses (REMOVING, REMOVING_DROP_METADATA)? What if "removing" will fail, will dataset (version) ends up with broken state (REMOVING) forever?

Will it be easier to basically mark datasets (versions) as REMOVED and remove actual rows tables in garbage collector? Then next GC run will remove all stale rows tables, no matter if previous was failed or not. Same in SaaS. We can also run GC (remove rows tables) in SaaS only if/when ClickHouse is active.

I see this as more robust and easier logic, rather than to have state machine and all the logic around to process edge cases (failed removal, locks, etc).

Comment thread src/datachain/data_storage/metastore.py Outdated

if dataset.versions and len(dataset.versions) == 1:
# had only one version, fully deleting dataset
# Count in DB, not in the in-memory record: GC-shaped paths

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very bad comment:

  • what is GC-shaped paths?
  • it is referring to now non-existent line len(dataset.versions) == 1 - it will be impossible to understand w/o PR
  • can be simpler ... a lot simpler

Please review everything AI generates, clean it up

self.update_dataset_version(dataset, version, **update_data)
with self.db.transaction():
if version:
# Update the version row first. If a status guard was requested

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is updated "second"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was about updating Dataset object below. I updated comment to make it more clear

@ilongin

ilongin commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

State machine

Three new/repurposed dataset statuses:

  • REMOVING = 7 (repurposed) — keep-metadata removal in progress; GC resumes to REMOVED
  • REMOVED = 8 (new) — terminal state for keep-metadata path; semver permanently reserved
  • REMOVING_DROP_METADATA = 9 (new) — wipe in progress; GC resumes to row deletion

Why do we need these statuses (REMOVING, REMOVING_DROP_METADATA)? What if "removing" will fail, will dataset (version) ends up with broken state (REMOVING) forever?

Will it be easier to basically mark datasets (versions) as REMOVED and remove actual rows tables in garbage collector? Then next GC run will remove all stale rows tables, no matter if previous was failed or not. Same in SaaS. We can also run GC (remove rows tables) in SaaS only if/when ClickHouse is active.

I see this as more robust and easier logic, rather than to have state machine and all the logic around to process edge cases (failed removal, locks, etc).

We need those statuses because we have 2 different types of delete now:

  1. Completely remove dataset metadata and warehouse data from DB - current way
  2. Keep metadata, mark it as REMOVED, but remove warehouse data

If any of the process fails GC needs to continue removing but we need to save information of what kind of delete was in action. If it get's stuck in REMOVING, GC will know how to continue. I don't like depending completely on GC for removing warehouse data as it's little bit strange and it won't work on local where user needs to run GC explicitly.

@ilongin ilongin requested review from dreadatour and shcheklein June 19, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants