Skip to content

Fix fragment cleanup failures on NFS-backed workspaces#139

Open
ErenYurekAgena wants to merge 1 commit into
ModelioOpenSource:masterfrom
ErenYurekAgena:fix-nfs-fragment-cleanup
Open

Fix fragment cleanup failures on NFS-backed workspaces#139
ErenYurekAgena wants to merge 1 commit into
ModelioOpenSource:masterfrom
ErenYurekAgena:fix-nfs-fragment-cleanup

Conversation

@ErenYurekAgena

Copy link
Copy Markdown

Summary

This pull request fixes fragment cleanup failures that can occur when a Modelio workspace is stored on an NFS-backed filesystem.

The issue can happen during module or RAMC update/removal. Modelio closes and deletes the old fragment directories before reinstalling or updating them. On local filesystems this normally works as expected. On NFS-like filesystems, however, recently closed files may remain temporarily busy and can appear as .nfs* files. While that temporary state exists, recursive directory deletion may fail with errors such as DirectoryNotEmptyException or Device or resource busy.

When this happens, the module/RAMC update transaction can roll back even though the failure is only caused by a transient NFS cleanup state.

Problem this prevents

This change prevents module/RAMC update or removal from failing just because the previous fragment directory cannot be removed immediately on an NFS-backed workspace.

Without this fix, the canonical fragment directory may remain in place after a transient NFS delete failure. That prevents the new fragment version from being installed cleanly and can cause the update transaction to roll back.

With this fix, when the failure is identified as a network-filesystem-style transient delete failure, the old fragment directory is moved away from its canonical location into a delete-pending sibling directory. This immediately frees the original fragment path, allowing the update or reinstall operation to continue.

The moved directory is then cleaned up with bounded retry. If NFS still reports busy files, cleanup is retried asynchronously and also retried on later fragment mount.

What changed

The standard deletion path is still used first.

If FileUtils.delete(path) succeeds, the method returns immediately and no fallback logic is executed.

Only if deletion fails, and only if the failure matches a network-filesystem-style transient cleanup problem, the code applies the fallback behavior:

  • detect whether the fragment directory is on an NFS/EFS-like filesystem
  • recognize transient cleanup failures such as DirectoryNotEmptyException, Device or resource busy, or .nfs* files
  • move the old fragment directory to a unique delete-pending sibling path
  • create a Modelio-specific sibling marker file for that temporary directory
  • retry deletion synchronously
  • schedule bounded asynchronous cleanup if NFS still keeps files busy
  • retry stale delete-pending cleanup on later network-backed fragment mount
  • remove orphan marker files when their matching delete-pending directory is already gone

Why this should not affect normal local usage

The existing behavior is preserved for normal successful local filesystem deletion.

On a local filesystem, when FileUtils.delete(path) succeeds, the new code simply returns immediately. The delete-pending fallback is not entered.

Delete-pending directory scans are also guarded by a filesystem check and are only run for network-backed fragment parent directories. This means normal local filesystems do not automatically scan and clean delete-pending directories during mount.

The fallback is intentionally conservative:

  • it is scoped to fragment directory cleanup
  • it does not change repository mounting or module installation logic
  • it does not replace the normal delete path
  • it only handles failures that look like transient NFS/EFS cleanup failures
  • unrelated I/O failures are still propagated
  • temporary directories are only treated as cleanup candidates if they have a Modelio-specific sibling marker file
  • a directory that only happens to match the naming pattern is not deleted unless the marker file exists
  • marker cleanup is best-effort and cannot make fragment cleanup fail

Safety of the marker file

The delete-pending marker is stored as a sibling file instead of being placed inside the temporary directory.

This avoids a case where a recursive cleanup could delete an internal marker before the parent directory itself is successfully removed. By keeping the marker next to the temporary directory, later cleanup attempts can still safely identify whether the directory was created by this fallback.

Only directories with the matching .modelio-delete-pending marker are considered delete-pending cleanup candidates.

Tested

Tested with an NFS/EFS-backed workspace:

  • module/RAMC update succeeds without rollback
  • old fragment directories are moved to delete-pending sibling paths when NFS temporarily prevents deletion
  • matching marker files are created
  • delete-pending directories and marker files are removed after NFS releases the busy files
  • stale delete-pending directories are cleaned on later project/fragment mount
  • orphan marker files are cleaned when the matching temporary directory no longer exists

Tested with the same Ubuntu 22.04 container runtime using a non-EFS local filesystem workspace:

  • application starts normally
  • local filesystem successful deletion remains on the original FileUtils.delete(path) path
  • delete-pending cleanup is not triggered on startup for non-network-backed fragment directories
  • manually created delete-pending marker artifacts are not removed on startup when the workspace is not on NFS/EFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant