Lately, I've been hearing system administrators and managers ask about solutions to keep people from accidentally removing their data. These are very smart and dedicated people asking for a solution so that data isn't lost either by accident or on purpose. A wild idea I've heard to solve the problem is getting rid of user access to the rm command. Is this truly a crazy idea?
Lately, in my day job, I’ve been getting questions from very smart, dedicated, and resource constrained administrators who have users who either accidentally, or sometimes absentmindedly, remove data. The problem they are facing is that the user removes the data, most likely using rm, and then realizes they actually needed the data. Of course, they ask the administrator to recover the data or restore it from a backup, which may or may not be possible (sometimes the requests are, shall we say, “impassioned”). If the recovery doesn’t happen fast enough or if the user is not getting enough attention then the users do what is natural to them, they escalate the request through their management chain, indicating the importance of the data. Consequently, the administrators have to drop everything they are doing and spend a great deal of time recovering the data (if they can). This means that other tasks fall by the wayside and administrator frustration builds (and from the comments/requests, it seems as though the frustration level is pretty high).
The requests for solutions are coming from many different places so this isn’t a single administrator with a single user asking for this capability. Actually, I think this is an age-old question from administrators that one admin described as, “how do we help the users from hurting themselves?”
Being an engineer I like to look for solutions so I started examining the request from several angles and then asked questions hoping to clarify the issues. Perhaps this problem is looking for more of a policy solution? Or perhaps it is a problem requiring a technical solution? Or is it both?
One of the common themes I encountered during my discussions was that whatever policies were created and communicated by administrators, sometimes upper management intervened and countered the policies. Sometimes this happened infrequently, typically for a critical project and a critical piece of data, but more often than not, it followed the old adage “the squeaky wheel gets the grease.” While I didn’t conduct a scientific survey one thing I did notice was that when the problem made it to upper levels of management, managers were not aware of the policies that had been set and publicized. To my mind this pointed to one possible solution to the problem – developing policies in conjunction with management while addressing the resource implications of the policies.
To begin the development of policies, one should assume that users will erase data by accident and will need it restored. The resource implications can be quantified based on past experience, data growth, and other factors. Will it require additional staff? Will it require additional hardware? Then the conclusions are presented to management. During the discussion management should be made aware of the impact of restoring or recovering data and the impact of users erasing data. Then a “management approved” policy is established and communicated to the user base and any changes to resources is resolved.
I think this approach can help alleviate the problem because management is fully aware of the resource implications of whatever the decision is, and perhaps even more importantly, users are made aware of the policies. The subtle context is that the entire management hierarchy is now aware of the policies so that using the “squeaky wheel” approach will have little impact on operations (although there will always be exceptions).
If you are reading this article, you are likely to be more technically focused which typically means that you like to find technical solutions to problems. I think there might be something of a technical solution to the problem discussed in this article – getting rid of user access to rm directly. But before you write a comment to this article stating that the title is inflammatory (hey, you’re reading this article – right?), please read on to understand my point.
I first started using Unix on an old VAX in 19.., er, a long time ago. We didn’t have much disk space, but perhaps more importantly, it was a new operating system to everyone. Using Unix one could interactively log into a system sitting in front of a terminal rather than submit jobs to the big mainframe (that really dates me doesn’t it?). This was a shift in how we worked which meant there was an associated period of learning. One of things people had to learn was how to erase files when they were finished including the deadly options for rm, “-r”, “-f”, and “-rf *”.
To help people during the adjustment period, one of the things the administrators did was to “alias” the rm command. So when you used rm what actually happened was that the data was moved to a temporary disk location instead of actually being erased. So if you had an “oops” moment, you could go get the files yourself, if you knew the location, or you could send an email to the administrator and they would do it for you. Since disk space was expensive, the data only lived in the temporary disk location for a certain period of time and then it was removed. But this could save your bacon in a pinch and it saved mine on several occasions.
So my thought is to do exactly this – alias the rm command to something else, most likely mv, so that the data is not actually erased, but moved to a different location. Then a cron job or a daemon is used to erase the files based on some policies (e.g. oldest files are erased if a certain usage level is reached (“high water” mark) and/or the files have reached a certain age). It takes disk resources to do this because you need a target location for storing the data but that can be part of the resource planning mentioned in the previous section.
The question is, how easy is it to alias rm? If you read the manpage for rm you will see that there are a few options.
- -f, –force ignore nonexistent files, never prompt
- -i prompt before every removal
- -I prompt once before removing more than three files, or when removing recursively. Less intrusive than -i, while still giving protection against most mistakes
- –interactive[=WHEN] prompt according to WHEN: never, once (-I), or always (-i). Without WHEN, prompt always
- –one-file-system when removing a hierarchy recursively, skip any directory that is on a file system different from that of the corresponding command line argument
- –no-preserve-root do not treat “/” specially
- –preserve-root do not remove “/” (default)
- -r, -R, –recursive remove directories and their contents recursively
- -v, –verbose explain what is being done
- –help display this help and exit
- –version output version information and exit
Some of these options are fairly rare in my experience but since they are part of the tool, then they need to be considered.
The next thing to examine is the manpage for mv. It too has a few options:
- –backup[=CONTROL] make a backup of each existing destination file
- -b like –backup but does not accept an argument
- -f, –force do not prompt before overwriting
- -i, –interactive prompt before overwrite
- –strip-trailing-slashes remove any trailing slashes from each SOURCE argument
- -S, –suffix=SUFFIX override the usual backup suffix
- -t, –target-directory=DIRECTORY move all SOURCE arguments into DIRECTORY
- -T, –no-target-directory treat DEST as a normal file
- -u, –updatev move only when the SOURCE file is newer than the destination file or when the destination file is missing
- -v, –verbose explain what is being done
- –help display this help and exit
- –version output version information and exit
There is some similarity between options for the two commands but it will likely take some work to have mv “imitate” rm. Perhaps a bash, python, or perl script can parse the command line and then use mv to perform the operations.
An alternative is create a very simple rm command that drops many of the lesser used options and then aliases mv to rm. This does create some restrictions for the users, but I think if this was put into production, then you would quickly find the small number of users who needed some specific options in rm.
Using mv as a substitute for rm is not a perfect solution and you might be able to find some difficult corner cases. For example, if a user rm-ed a file using the aliased rm command, it would be copied to the temporary disk storage and could be recovered (copied back). If the user then created a new file with the exact same name and then used rm to erase that file, then the first file that is on the temporary storage would be overwritten. Perhaps this is acceptable, perhaps it is not (this could be part of a policy).
However, as part of the “new” rm you could compensate for this scenario by using the a different path to the copied data. You could add a time stamp to the front of the user’s directory path to create a unique path. Just use the time when the command is executed (seconds since epoch) as the unique identifier at the head of the user’s path. But, this creates some potential problems because users may not know when they ran the command meaning that the administrator will have to search for all occurrences of the files within the user’s directory tree and then ask the user which one(s) they wanted restored. This isn’t an intolerable situation since it’s fairly easy to use find to locate the file, but it does add some work for the administrator (adding administrator resources to the planning).
One thing this approach cannot help with are applications that erase or remove files as part of their processing. If this happens, the only real alternative is to have a second copy of the data somewhere. But this scenario should be brought to the attention of management so that policies can be developed (e.g. have two copies of all data at all times, or telling users that there is no recourse if this happens).
One thing I haven’t touched on yet, are backups. Backups can be beautiful things that save your bacon, but they are somewhat limited as I’m sure all administrators are aware. Backups happen at certain intervals whether they are full or incremental backups. In between backups, users create, change, and remove data that backups miss. Backups may be able to restore data but only if the data has, in fact, been backed up. Also, how many times have administrators tried to restore a file from a backup, only to discover that the backup failed or the media, most likely tape, is corrupt? Fortunately, this scenario is becoming more rare, but it still happens (be sure to check your backup logs and be sure to test a data restoration on a small amount of data every so often).
So backups can help with data recovery but they are not perfect. Moreover, given the size of file systems, it may be impossible to do full backups, at least in an economical way. So you may be restricted to performing a single full backup when the system is first installed and then doing incremental backups for the life of the system. However, even for Petabyte size file systems this may be very difficult to accomplish and it may be require more hardware than can be afforded.
Using backups in combination with aliasing the rm command can offer some data protection and perhaps reduce that likelihood that, “users will hurt themselves.”
Normally having users remove data and then yell about getting it recovered quickly is a fairly rare occurrence. I talk to a great number of administrators and this scenario was something they rarely encountered and if they did, they either restored the data and/or talked to the user to help educate them or help develop scripts or tools to help alleviate the problems. But in the last 3-6 months, I’ve been hearing a fairly consistent but increasing “hum of discontent” from administrators about users erasing data who then employ the technique of becoming a squeaky wheel to get management’s attention to quickly recover the erased data. In these discussions, the majority of the administrators feel that this problem is only going to get worse because of more users and more data. So they are looking for solutions (since this is linuxdlsazine, they are naturally looking for solutions on Linux).
I see two aspects to the problem. The first is a policy aspect where upper level management needs to be brought into discussions to develop appropriate policies. But as part of this people need to remove emotion from the discussions and present real data on the frequency of the data restoration requests and how much work it requires and how much it disrupts normal operations. In essence, the discussion, like many other discussions, should be around resource allocation and associated benefits. But the benefit of having upper management involved is that there is agreement on the policies at the highest levels. Then they can be published to all users with the implication that management is very aware of the issues surrounding data recovery so no more squeaky wheels will be tolerated (score one for the administrators).
The second aspect, which really accompanies the first, is a technical aspect. Are they any tools to help easily restore or recover erased data or basically prevent a user from accidentally erasing data? (“hurting themselves” was the phrase I heard several time). Backups can help but they are only part of the solution (IMHO). Going back to my early UNIX education, I remember the administrators set up the system so that if you used the rm command, the data was moved to a temporary disk location where I could go copy it back if I acted quickly. After a period of time the data was erased from the temporary disk storage (presumably via a cron job or daemon).
I think that using something like this approach, aliasing the rm command so that the data is actually moved to a temporary disk location, coupled with normal backups, could help alleviate some of the problems administrators are having. It will take some work to write the scripts that would implement such a solution, but it’s not that difficult. Then you just need some sort of temporary disk based storage and you are off to the races. Plus, the size of the temporary storage is adjustable so if you need more space, it’s fairly easy to add more hardware, but it does cost money.
So does this make the argument for abolishing user access to the rm command? I think the answer is unique to each system, the administrators, and the users. In many cases this approach can really help users quickly recover needed files, but it takes work to develop the scripts (and test them). However, it can help reduce the workload on administrators. I personally think this is a very useful approach for larger systems but again, the choice is yours.