Click to See Complete Forum and Search --> : Utility/script for finding duplicates?


romandas
07-21-2003, 02:44 AM
I'm running a fileserver using Slack9 that contains well over 2 million files on it. My users have a habit of sharing data by making multiple copies of the same file, instead of sym links to one main file. Needless to say, this practice is causing the server to fill up fast.

I need a way to either show me a ls list of files with duplicate names and file sizes, or a script/utility that can take a large text file (the ls output for the entire drive), sort it by columns, and display the results.

Anyone know of any utilities or a way to tweak ls to do that?

mrBen
07-21-2003, 05:30 AM
Hmmm. Not sure of the complete answer, but here are a few pointers:

1. ls -1 will print in a single column, which might be handy for entering into a script.

2. The cmp utility compares 2 files, and tells you if they are the same.

3. sort will sort a file.

So you could hopefully put these together in a script and do the following:

ls the contents of the disk (hopefully stripping off directory names, and then sort into alphabetical order. Pick out all the duplicate file names. Run a cmp to see if they are identical. Output the full directory name. (somehow - perhaps using locate?)

OK - I'm waffling now. Hope this helps in some way.

chrism01
07-21-2003, 06:14 AM
In shell, try this

goto root dir
cd /

ls recursive and sort
ls -1R | sort >files.lis

Think that will give you list of files without paths, which you need to get all same name files consecutively.
Then script a check for multiple occurences and use 'find' to find each occurence of a multiple and check filesizes.
Basically, you're trying to cross-check your whole disk. There's no easy option.

edit:
Actually, you could write a script to use the find cmd and then use the basename cmd to get the separate components in one pass, creating a file as you go (2 fields, filename, path). sort on the filename field.
/edit

I'd be inclined to do it all in one perl script. Its exactly the sort of thing perl was designed for and its very fast eg about 80% speed of C.

I don't mind writing it for you if you can wait a couple of days. (I've got to go away).

romandas
07-21-2003, 08:26 PM
I have no experience with using Perl. So, I don't even know what these do, let alone know if they'll do what I want.

Here's the links:
http://www.perlmonks.org/index.pl?node_id=49819
http://www.perlmonks.org/index.pl?node_id=224748

Can someone skilled in Perl tell me if this'll do what I need? (I don't necessarily need to delete the duplicates, just need to report where they are, and maybe how much space they're consuming)

Thanks for your help so far!

chrism01
07-24-2003, 12:37 PM
Here's a solution i wrote over the last day or 2.
Ran it on my RH7.3 box: 200Mhz, 2GB of files, 128MB RAM. Took 3 mins :-)

Feel free to pm me for more info/customizations.
The cmd line is

./find_duplicates.pl -s /

If you supply no/wrong args, you'll get the help text.

HTH

PS JL upload won't accept '.pl' extension :( so had to rename to '.txt'. You'll have to rename it back after you download it. sigh....