Which Photos are Actually Documents?
I had the removable hard disk from an old friend.
I wanted to pass the photos and videos on to someone else.
Unfortunately, it was also full of documents and private information.
It’s easy to get rid of the documents that can be identified by filename extension.
But what about photos of documents?
I could manually look at each of them, but there are more than 100,000 images!
This is how I did it. Note I am using the zsh shell.
When I approach this kind of problem, I imagine script wizards who write a single long line of ridiculous bash, run it, and the job is done.
I am not that person. I am, however, someone who can break a job down into single steps, test them on small batches, then kick off long-running jobs while I work on the next script down the chain.
That is an efficient, effective, accurate and testable approach to the problem. It might not yield the fastest runtime, but I was happy to spend a day or two writing the scripts and a day or two to run them in parallel.
That was still going to be a LOT faster than manually checking the photos!
I can also identify problems more easily if only one requirement is addressed at a time.
1. Establish Reversability
The most important thing with any project is be able to reverse any action.
Everyone has learnt this the hard way. Often more than once.
I made a copy of the entire archive and put it to one side.
2. Identify Non-media Files And Delete Them
I identified the media files (movies and images) by extension that I wanted to keep (after expanding any zip files).
Then I removed the rest. Find is very powerful and can do a lot (efficiently) in a single command.
# find non-media files
find . -not -name "*.jpg" -not -name "*.JPG" -not -name "*.HEIC"
-not -name "*.json" -not -name "*.mp3" -not -name "*.mpg" -not -name "*.MOV"
-not -name "*.GIF" -not -name "*.PNG" -not -name "*.bmp" -not -name "*.wma"
-not -name "*.jpeg" -not -name "*.AVI" -not -name "*.jpg" -not -name "*.m4a"
-not -name "*.wmv" -not -name "*.gif" -not -name "*.png" -not -name "*.TIF"
-not -name "*.MP3" -not -name "*.mov" -not -name "*.MPG" -not -name "*.avi"
-not -name "*.3gp" -not -name "*.BMP" -not -name "*.WMV" -not -name "*.mp4"
-not -name "*.mpeg" -not -name "*.wav" -not -name "*.ogg" -not -name "*.m4v"
-not -name "*.dxf" -not -name "*.svg" -not -name "*.eps" -not -name "*.ai"
-type f > out.txt
# find non-media files and copy
find . -not -name "*.jpg" -not -name "*.JPG" -not -name "*.HEIC"
-not -name "*.json" -not -name "*.mp3" -not -name "*.mpg" -not -name "*.MOV"
-not -name "*.GIF" -not -name "*.PNG" -not -name "*.bmp" -not -name "*.wma"
-not -name "*.jpeg" -not -name "*.AVI" -not -name "*.jpg" -not -name "*.m4a"
-not -name "*.wmv" -not -name "*.gif" -not -name "*.png" -not -name "*.TIF"
-not -name "*.MP3" -not -name "*.mov" -not -name "*.MPG" -not -name "*.avi"
-not -name "*.3gp" -not -name "*.BMP" -not -name "*.WMV" -not -name "*.mp4"
-not -name "*.mpeg" -not -name "*.wav" -not -name "*.ogg" -not -name "*.m4v"
-not -name "*.dxf" -not -name "*.svg" -not -name "*.eps" -not -name "*.ai"
-type f -exec cp --parents {} /NonMediaFiles \;
# delete non-media files
find . -not -name "*.jpg" -not -name "*.JPG" -not -name "*.HEIC"
-not -name "*.json" -not -name "*.mp3" -not -name "*.mpg" -not -name "*.MOV"
-not -name "*.GIF" -not -name "*.PNG" -not -name "*.bmp" -not -name "*.wma"
-not -name "*.jpeg" -not -name "*.AVI" -not -name "*.jpg" -not -name "*.m4a"
-not -name "*.wmv" -not -name "*.gif" -not -name "*.png" -not -name "*.TIF"
-not -name "*.MP3" -not -name "*.mov" -not -name "*.MPG" -not -name "*.avi"
-not -name "*.3gp" -not -name "*.BMP" -not -name "*.WMV" -not -name "*.mp4"
-not -name "*.mpeg" -not -name "*.wav" -not -name "*.ogg" -not -name "*.m4v"
-not -name "*.dxf" -not -name "*.svg" -not -name "*.eps" -not -name "*.ai"
-type f -exec rm {} \;
After this process, I removed any directories that were left empty.
# find empty directories
find . -empty -type d > emptydirs.txt
# delete empty directories
find . -empty -type d -delete
Next we need OCR. I installed Tesseract OCR, https://tesseractocr.org/.
Depending on your linux installation, you may or may not need to also choose a language pack from about fifty.
I picked the one that ended with -en for English.
3. Identify the Dimensions of the Job
We can run tesseract like this.
# tesseract usage
tesseract inputfile outputfilename
This creates outputfilename by default as
I only want to OCR image files so I identified them with this script.
#!/bin/bash
#scango.sh
# get images, not movies, etc
find . -type f -exec sh -c '
for file do
if file --mime-type -b "$file" | grep -q "image/"; then
echo "$file"
fi
done
' sh {} + > $1
I ran the script from the top directory of the archive and got a count of the number of files I have to run OCR on (about 110,000).
~/scripts/scango.sh allfiles.lst
wc -l allfiles.lst
I created a little script to run against each image file in a list.
In testing, I discovered double-quotes were needed around $arg to handle filenames with spaces in them.
#!/bin/bash
#tessgo.sh
while read arg; do
tesseract "$arg" "$arg"
done
Then I created a few tests to estimate the length of time of the OCR process.
# testing one file with imagefiles1.txt
~/scripts/tessgo.sh < imagefiles1.lst
# testing a few files with imagefiles.txt
~/scripts/tessgo.sh < imagefiles.lst
tessgo.sh seemed to process about three files per second.
So this process would take about 10 hours.
Still better than manually checking the files.
4. Kick off the OCR job
# start main job while continuing to work on other scripts
# expect finish about 10pm
~/scripts/tessgo.sh < allfiles.lst
This job produces a txt file for every image file throughout the directory structure.
5. Processing the Output
The OCR was amazing. It identified a photo as containing words even if they were just a few words on a toy box in the background of the image.
While I waited for the OCR to finish, I wrote and tested little scripts to inspect the output txt files, and remove any images that were potential documents.
#!/bin/bash
# countwords.sh
while read arg; do
wc -l "$arg"
done
#!/bin/bash
# wcountfilter.sh
minwords="$1"
#echo $minwords
while read arg; do
wcarr=($(wc "$arg"))
words=${wcarr[1]}
if [ $words -ge $minwords ]; then
echo $arg
fi
done
#!/bin/bash
# renamefiles.sh
#echo $minwords
while read arg; do
fname=$arg
imgfname=${fname:0:-4}
echo $imgfname
done
#!/bin/bash
# delimgfiles.sh
while read arg; do
rm "$arg"
done
# check word count in each file
~/scripts/countwords.sh < textfiles.lst
# test parameter + input file
~/scripts/wcountfilter.sh 15 < textfiles.lst
# if more than 15-25, write file to deltextfiles.lst
~/scripts/wcountfilter.sh 25 < textfiles.lst > deltextfiles.lst
# write list of files with .txt removed
~/scripts/renamefiles.sh < deltextfiles.lst > delimgfiles.lst
# delete image files after spot checking
~/scripts/delimgfiles.sh < delimgfiles.lst
# remove all text files
find . -name "*.txt" -delete
With over 100,000 photos I could afford to lose a few rather than let an important document slip through. I set my threshold word count appropriately after testing.
And there it is. A sanitized directory tree with over 100,000 image files, plus many thousand more home video files.