# script for detecting files in a directory but not in another



## fluca1978 (Jun 9, 2014)

Hi all,
I've two very large directory trees (around 60 GB each): the first one, called _src_ has been copied and reorganized into the second one _dst_. The reorganization part means that files under a specific path in _src_ could be on a very different path on _dst_. Is there a quick way to see which files are only in _src_ and not in _dst_?
I was wondering to create a kind of directory scanner that stores each file checksum and peform a kind of left join against them, but I hope there is some tool ready for this kind of purpose. Please note that files are binary (images and video).
Any idea is welcome.

Thanks


----------



## kpa (Jun 9, 2014)

*Re: script for detecting files in a directory but not in ano*

I can't give you a better answer now but net/rsync uses exactly the kind of comparisons you're after when it determines if a file has to be copied over or not. Maybe it can be adapted to your needs with the right options.


----------



## asteriskRoss (Jun 9, 2014)

*Re: script for detecting files in a directory but not in ano*

It's a little rough and ready but how about the following series of commands (tested in tcsh(1)):

Generate a list of files and associated SHA-1 hashes from all files in src directory and descendants with find(1) and sha1(1):

```
# find /path/to/src -type f -exec sha1 {} \; > src.list
```

Generate a list of SHA-1 hashes from all files in dst directory and descendants with find(1) and sha1(1):

```
# find /path/to/dst -type f -exec sha1 -q {} \; > dst.list
```

Print a list of lines in src.list where the hash is not found in dst.list with fgrep(1):

```
# fgrep -v -f dst.list src.list
```

The output could be formatted with sed(1) or similar if required.


----------



## wblock@ (Jun 9, 2014)

*Re: script for detecting files in a directory but not in ano*

mtree(8) can generate lists of files and checksums.


----------



## SirDice (Jun 10, 2014)

*Re: script for detecting files in a directory but not in ano*

Are the files renamed too? Or are they just moved around?

`find /path/to/src -type f -exec basename {} \; | sort | uniq > src.list`
`find /path/to/dst -type f -exec basename {} \; | sort | uniq > dst.list`
`diff src.list dst.list`

Not really tested but I think you get the idea.


----------



## fluca1978 (Jun 12, 2014)

*Re: script for detecting files in a directory but not in ano*

So far this is the script I've produced, but quite frankly is a dirty piece of code, so I'm wondering to rewrite it in Perl.
Thanks to everyone.


```
#!/bin/sh

# check arguments
if [ $# -lt 2 ]
then
    echo "Usage: $0 src_directory dst_directory"
    exit 1
else
    SRC_DIR=$1
    DST_DIR=$2

    if [ ! -d "$SRC_DIR" -o ! -d "$DST_DIR" ]
    then
        echo "Please specifies only directories! [$SRC_DIR] [$DST_DIR]"
        exit 2
    fi
fi


# # setup
SRC_DB="/tmp/src.$$"
DST_DB="/tmp/dst.$$"



touch $SRC_DB $DST_DB > /dev/null



echo "Indexing source directory [ $SRC_DIR ] => $SRC_DB"
find "$SRC_DIR" -type f -exec md5sum {} \;  > $SRC_DB

echo "Indexing target directory [ $DST_DIR ] => $DST_DB"
find "$DST_DIR" -type f -exec md5sum {} \;  > $DST_DB



echo "Doing cross-lookup..."
while read hash file 
do

    grep $hash $DST_DB > /dev/null 2>&1
    if [ $? -ne 0 ]
    then
        echo "Source file [ $file $hash ] is missing!"
    fi

done < $SRC_DB

echo "All done!"
```


----------



## SirDice (Jun 12, 2014)

*Re: script for detecting files in a directory but not in ano*

That probably works but it's not really efficient. It does a grep for each file, which can have quite an impact if the file list gets large. To speed things up you could write the hash and the filename on 1 line each, hash first. Then sort(1) both files and run a diff(1) against the sorted lists. That should be significantly faster.


----------



## fluca1978 (Jun 12, 2014)

*Re: script for detecting files in a directory but not in ano*



			
				SirDice said:
			
		

> Then sort(1) both files and run a diff(1) against the sorted lists. That should be significantly faster.



Uhm...since file names could have been changed, and for sure directory names have changed, I believe a diff(1) would produce too many false positives. Am I wrong?


----------



## SirDice (Jun 12, 2014)

*Re: script for detecting files in a directory but not in ano*

No, you're not wrong. But you could coax diff(1) into only looking at the first column. The hashes alone should be different enough.


----------



## junovitch@ (Jun 13, 2014)

*Re: script for detecting files in a directory but not in ano*

I've used sysutils/duff for this basic purpose.  It should be a bit more efficient than the others since it first checks make sure files are the same size before attempting a more detailed comparison.  It looks for files that are duplicates in so the list of files it produces would be everything that is that is in both.  Anything outside that list should be files unique to the src directory.


----------

