Why the heck are my NetApp volumes stuck deleting?
I am building a project for a client that requires a lot of DevOps style integration into ONTAP Cloud. In their particular case, I need to create and destroy many volumes to trigger test scenarios. I creates 10s of new volumes per hour and delete them within minutes. Previously, this type of work is not a big deal but I recently found an option in Data ONTAP 9.x that well is cool but annoying in this situation. I will tell you the back story first. I developed a really neat Python library integration with Docker and the awesome NetApp Docker Volume Plug-in. (I will soon be announcing a new CHEF cookbook designed to help streamline deploying Docker and the NetApp Docker Volume Plug-in). The library handles multiple case scenarios designed to trigger issues in ONTAP by artificially replicating the problem. I will have a later article on this exact use case and talk about how I created the solution. The first three use cases evolve around creating volumes and either filling an Aggregate, Volume, or Lun. Early in my testing, I kept running into issues where the volume creation would fail with messages like:
VolumeDriver.Create: Error creating volume: resultStatusAttr: failed resultReasonAttr: Failed to create the volume on node "jgoodrum01-01". Reason: Request to create volume "scenario_aggr_aggr_full_large" failed because there is not enough space in aggregate "aggr_small_1". Either create 7.28GB of free space in the aggregate or select a size of at most 72.7GB for the new volume.
Ok, this is odd as I told it to delete the volumes and the response in the plug-in shows a successful removal. In fact, when I logged into System Manager, I could see that none of the volumes still exist. When I looked at the aggregate view, it still shows consuming that space. At first, I assumed that this was some type of bug in System Manager, so I fired up the old trusty iTerm and logged into the cluster via SSH. I ran the command volume show and saw some odd output
jgoodrum01::> vol show Vserver Volume Aggregate State Type Size Available Used% --------- ------------ ------------ ---------- ---- ---------- ---------- ----- jgoodrum01-01 vol0 aggr0 online RW 69.72GB 59.50GB 14% svm_jgoodrum01 scenario_aggr_aggr_full_1270 aggr_small_1 offline DEL 200GB - - svm_jgoodrum01 scenario_aggr_aggr_full_1274 aggr_small_1 offline DEL 18.63GB - - svm_jgoodrum01 scenario_aggr_aggr_full_large_1269 aggr_small_1 offline DEL 80GB - - svm_jgoodrum01 scenario_nas_mysql_data aggr_small_2 online RW 20GB 1 0.75GB 46%
Notice that there are three volumes that are currently marked as offline and have a type of DEL. What is this type of DEL and why are those volumes just hanging around like those guests at a BBQ that just never leave? I decided to do some quick searching and found a great article from Justin Parisi talking about the NetApp volume recovery queue. I definitely suggest that you give it a read along with his other great articles. As it turns out, NetApp implemented a feature in ONTAP that marks a deleted volume as soft-delete. After the deletion retention period expires (The current default is 12 hours), the volume will be completely removed and the space will be available again. I believe that this is an awesome feature and helps to protect production environments from accidental deletion but I am not production, I am DevOps. Now the question comes, how I can just skip this process altogether.
After a little further research, I found that this recovery queue works on a per Storage Virtual Machine (SVM) basis and I can disable it. Since the SVM that I am leveraging for this process will always have a high turnover of transient data and volumes, I decided to disable the feature. Here are the steps:
NOTE: By disabling this feature, you lose the inherent value of the recovery queue for any SVM on which you disable this setting. Please check your requirements before implementing and maybe don’t do this to any production SVM.
# This is a diagnostic command and not visible to the normal admin command. jgoodrum01::> set -privilege diagnostic Warning: These diagnostic commands are for use by NetApp personnel only. Do you want to continue? {y|n}: y jgoodrum01::*> vserver show -vserver svm_jgoodrum01 -fields volume-delete-retention-hours vserver volume-delete-retention-hours -------------- ----------------------------- svm_jgoodrum01 12 # Set the retention to zero to disable the process for this SVM jgoodrum01::*> vserver modify -vserver svm_jgoodrum01 -volume-delete-retention-hours 0
Changing the retention only affects new volume deletions. If we look at the queue once again, you will notice that the volumes all still exist and are set to expire in 12 hours.
# Show the existing volume recovery queue. jgoodrum01::*> vol recovery-queue show Vserver Volume Deletion Request Time Retention Hours --------- ----------- ------------------------ --------------- svm_jgoodrum01 scenario_aggr_aggr_full_1312 Wed Jun 14 17:28:13 2017 12 svm_jgoodrum01 scenario_aggr_aggr_full_large_1311 Wed Jun 14 17:28:14 2017 12 svm_jgoodrum01 scenario_nas_vol_full_1293 Wed Jun 14 16:30:36 2017 12 .... 23 entries were displayed.
If we would like to remove only a single volume or handful, then the following code example shows how to do this action.
# Delete a single volume from the recovery queue jgoodrum01::*> vol recovery-queue purge -vserver svm_jgoodrum01 -volume scenario_aggr_aggr_full_1312 Queued private job: 344 jgoodrum01::*> vol recovery-queue show -volume scenario_aggr_aggr_full_1312 There are no entries matching your query.
In my case, I would like to just simply remove all these volumes as they are not needed.
jgoodrum01::*> vol recovery-queue purge-all Initializing jgoodrum01::*> vol recovery-queue show This table is currently empty.
Hopefully, you find this helpful. Now, I am back to writing this test case scenarios. I even decided to include a scenario where the retention exists for 1 hour and will trigger a failure to create the volume. I think that it would be a great use case but more on that later.