Click to See Complete Forum and Search --> : Bash script for deleting duplicate files


xiphias
04-28-2007, 02:16 PM
im having some problem with deleting duplicate files after user confirmation. i have written this script, but its incomplete,


find . -maxdepth 1 -type f -print0 | xargs -0 md5sum |
sort | uniq -w 32 -c | awk '{print $1, $3 }'| grep "^[23]"



So as u can see, it will display files having 2 or 3 copies, but i dont know how to delete them, 'rm -i' doesnt seem to work. please help.

thnx for reading the post.

happybunny
04-28-2007, 02:30 PM
-exec rm -f {} \; ?

xiphias
04-28-2007, 02:43 PM
-exec rm -f {} \; ?

i want user confirmation , so i tried -ok rm{} \; in the 'find..' part of the pipe , but then it prompts to delete every file. if i put it at the end of the pipe after grep, it gives a syntax error. :(

bwkaz
04-28-2007, 02:46 PM
Well, exec'ing rm would work if that whole command was part of the find. (Because -exec is an argument that find accepts.) But it isn't; there's a lot of md5summing and sorting, etc. going on.

What could work is another pipe to "xargs rm -i" at the end of that pipeline, though. :)

(And I'm assuming you don't have any files whose names might contain newlines. The "find -print0 | xargs -0" will handle those files, but nothing else in that pipeline will. ;))

Edit: just saw the previous post. When you run rm with the -i option, then it will prompt you about every file you tell it to delete. That's what -i does. ;) (And a lot of distros alias rm to rm -i for some reason.) To prevent it from prompting, either remove the alias (unalias rm), or get rid of the -i option; then it'll only prompt about files that you don't have write permission to. (If you have write permission to their directory, you'll still be able to delete them, but it asks just to make sure you want to.) You can also add -f, but that suppresses all prompts.

Neither of these will give you just a single prompt for the whole multi-file delete, though; rm doesn't work that way. If you want a prompt like that, you'll have to ask the user in your script before running the pipeline (or run the pipeline and display the list of files first; that could work too).

hotcold
04-28-2007, 05:28 PM
Hi.

You need to be careful here. The script you have does not list the names of all the files, only the first one it encounters. I modified your script to search for duplicates only from the set of files names t1, t2, ..., etc., and it printed:
3 ./t1
2 ./t5

There are not 2 or 3 copies of the files, there are 2 or 3 instances. If you delete them all, you will no instances remaining. If I were doing this, I would get the list except the first, and delete the remainder. You need to decide whether it is easier to set up the script to perhaps just list the duplicates, then you delete them more or less manually, or to set things up to request confirmation on every file, or to display a group and request confirmation on that group, or to simply remove the set without confirmation. Some choices will be more difficult for you to code, others will require more work when you finally execute the script.

There is a variation of your script posted at http://www.linuxquestions.org/questions/showthread.php?p=2727401#post2727401 which appears to list all the files of a group. ... cheers, hotcold

xiphias
04-29-2007, 12:38 AM
@bwaz

thnx. it works.:)


@hotcold

i was aware of that problem of deleting all files instead of just the instances. but i didnt have any idea how to do it.:( it would be helpful if u can explain how to select and delete instances of a file except the original one.

i read the post u pointed me to,im not sure but i think it does the same thing as listing all the instances and i would arrive at the same problem as at the start of this thread.

thnx for helping.

hotcold
04-29-2007, 07:01 AM
Hi, xiphias.
... how to select and delete instances ...
Assuming we are talking about the solution that uses the more complex awk script, one action you could take is to copy & paste with the mouse between terminal windows -- copy the filenames except for the first one, then paste into an rm command in a different window. For something more automatic, I would think about modifying the awk script so that it doesn't print the md5sum -- that seems to be of little value except to compare solutions. Then I would try to think of ways to omit the first instance on each line, either in the awk script or piping into another utility that would cut out the first instance.

I don't see any reason to identify the earliest instance of a set of duplicates -- the original source of the copies -- but with more work you could do that ... cheers, hotcold

( edit 1: clarify )

xiphias
04-29-2007, 07:09 AM
Hi, xiphias.

Assuming we are talking about the solution that uses the more complex awk script, one action you could take is to copy & paste with the mouse between terminal windows -- copy the filenames except for the first one, then paste into an rm command in a different window. For something more automatic, I would think about modifying the awk script so that it doesn't print the md5sum -- that seems to be of little value except to compare solutions. Then I would try to think of ways to omit the first instance on each line, either in the awk script or piping into another utility that would cut out the first instance.

I don't see any reason to identify the earliest instance of a set of duplicates -- the original source of the copies -- but with more work you could do that ... cheers, hotcold

( edit 1: clarify )

thnx i will try that. just one small doubt... i've read in many tutorials that linux filesystem keeps a track of files through their modification date. would it be possible to know the file-creation date ? does linux store file-creation date. is there a command for knowing the creation date ?

thnx.

hotcold
04-29-2007, 09:42 AM
Hi.
would it be possible to know the file-creation date ? does linux store file-creation date. is there a command for knowing the creation date ?
There are many filesystems available in Linux, ext2/3, Reiser, etc. They generally will keep track of some times, but none that I know of will keep track of creation time. See the notes below, and I suggest you obtain a book and become familiar with resources like Wikipedia and Google to help answer your questions ... cheers, hotcold
bookkeeping information is typically associated with each file within a file system. The length of the data contained in a file may be stored as the number of blocks allocated for the file or as an exact byte count. The time that the file was last modified may be stored as the file's timestamp. Some file systems also store the file creation time, the time it was last accessed, and the time that the file's meta-data was changed.

http://en.wikipedia.org/wiki/Filesystem
Timestamps telling when the inode itself was last modified (ctime, change time), the file content last modified (mtime, modification time), and last accessed (atime, access time).

http://en.wikipedia.org/wiki/Inode