Module: Duplicate Photo Checker via md5sum

jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Tue, 2012-04-17 17:55

An attempt at an MD5 module to do some duplicate checking/etc.
http://codex.gallery2.org/Gallery3:Modules:dupcheck

Here is the initial release- It's only tested locally on my gallery, so, with that being said, ymmv, but I think it's good.

Here is what the module does as of now:

1) Upon activation, it adds a new table to the DB (fullsize_md5sums) and sets a warning that photos need to be scanned for missing md5s
2) Adds a maintenance task that will allow you to scan all photos and insert any missing MD5s in the DB
3) Upon adding (or changing) an image to the gallery, will insert the MD5 into the DB automatically
4) Upon deleting an image from the gallery, will delete the MD5 from the DB for that image
5) Adds a 'Gallery Photo Duplicates' menu item to ADMIN->CONTENT menu which currently loads a dynamic album with duplicate thumbnails
-5a) On the duplicate item album page, shows the thumbnail of the image that has duplicates
-5b) Under the thumbnail there are links to each image location where it's duplicated - the links show which album it's in
-5c) This check for duplicates across ALL of the gallery, not just in a particular album - (Maybe will code this later, but not now)

ADTL:
6) Shows what a bad coder I am via the php code... :)
7) Works from what I can tell/test/etc.



--------------------------------------------------------------------------------


CHANGE LOG:
------------------
v7 - Will now provide a warning when uploading a duplicate image via the normal web way and provide an option to delete it. (Will display warning after click DONE when uploading) - thanks Sangie
---------------------------------------------------------------------------------------------
v6 - Cosmetic change when rounded percent = 0, changed to 'less than 1%' - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v5 - Added a warning when the MD5SUM of an image is invalid - now admin will see the offending image to solve in the event of an invalid image or something - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v4 - Added a task in the maintenance run to automatically remove invalid MD5SUM entries from the DB - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v3 - Added a 'delete' link in the context menu under each photo - thanks cwallace91472
---------------------------------------------------------------------------------------------
v2 - Fix for when no duplicates are found - thanks floridave!
---------------------------------------------------------------------------------------------
v1 - Initial Release

 
cchiappa
cchiappa's picture

Joined: 2008-08-11
Posts: 101
Posted: Wed, 2012-04-18 03:07

This module would be useful to those of us who came from Gallery2 and used the Links module which led a picture reside in multiple albums, but which get imported into G3 as two separate, but identical pictures.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2012-04-20 22:21

Did I mention I'm not a coder?

 
dyingdemon

Joined: 2009-08-09
Posts: 25
Posted: Sun, 2012-04-22 07:10

dude, you have my full RAWR support :p. i have 17k images, and with a user base of 1.8k users, who are uploading 200-300 images a day, odd are that some are duplicates. im the only admin, so i spend huge chunks of time just making sure my site is running, and just recently got it open to the public again. your module will be a god in human form. (i only have one question, what kind of server resources are we talking about here? im on a shared host, so i have metered server resources, and need to know just how much this would take to use, as md5 scanning/comparing even through mysql can take decent chunks of server resources)

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2012-04-22 18:07

dyingdemon, my tests so far (I have 26k images) consume about the same scanning time as the exif modules do (perhaps a bit faster) I use PHP's md5sum capabilities for polling the md5sums (appears not to use too much cpu/mem) I didn't time the runs or anything unfortunately.

The initial MD5sum scan will take a while, but after that, it does the scan for each upload/modification, and that's cake.

For the queries, they are pretty quick and not that resource intensive at all. The original method I was using was very slow on the duplicate polling, however, I've refined it quite a bit, and it's speedy and not that hungry.

I've attached the module into the original post. So you're welcome to test it out. (It's only been -tested- on my gallery, so let me know if there are any issues... I think I've cleaned up any specifics to my gallery, but this is my 'first' public module)

 
mattdm

Joined: 2005-07-22
Posts: 181
Posted: Fri, 2012-07-27 18:47

Feature request suggestion for this one: since you're storing checksums, it'd be awesome to have a validation function, which makes sure photos haven't become corrupted on disk.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2012-07-27 19:06

mattdm: can you elaborate a bit? I'm guessing that you'd like a task that will check for corrupted files based on MD5 changes?

 
mattdm

Joined: 2005-07-22
Posts: 181
Posted: Fri, 2012-07-27 19:33
jnash wrote:
mattdm: can you elaborate a bit? I'm guessing that you'd like a task that will check for corrupted files based on MD5 changes?

Exactly.

 
floridave
floridave's picture

Joined: 2003-12-22
Posts: 27300
Posted: Fri, 2012-07-27 23:06

I like one word details. :-)
Elaborate: Develop or present (a theory, policy, or system) in detail.

Dave

_____________________________________________
Blog & G2 || floridave - Gallery Team

 
mattdm

Joined: 2005-07-22
Posts: 181
Posted: Sat, 2012-07-28 01:26
Quote:
Elaborate: Develop or present (a theory, policy, or system) in detail.

Well, that really was exactly that. This module already saves checksums. It would be nice to have a maintanance task that would verify those checksums.

I can imagine serveral possible modes of operation:

  • a manual-run option where you select albums to verify
  • a batch option which tests a selected set of albums periodically
  • a batch option which iterates through all files, but only a few per run, so that over the course of a month or so all files are checked but there's not a huge overall io increase on the server.
  • a background helper process which could be scheduled outside of gallery using cron
 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Mon, 2012-07-30 00:31

I don't think this module is the place for that function. I haven't heard of anyone having problems with images becoming corrupt, but I suppose this could be useful. Unfortunately, I don't have the time to create a new module at the moment, and don't think it will fit well inside of this module. However, the data is there, and should someone want to tackle it? Then I say go for it!

 
mattdm

Joined: 2005-07-22
Posts: 181
Posted: Mon, 2012-07-30 02:13
jnash wrote:
I haven't heard of anyone having problems with images becoming corrupt, but I suppose this could be useful.

It's not unheard of, and as datasets get bigger, random bit flips become more likely. Some new file systems (like zfs) do checksumming inherently, but most don't, yet.

I think probably the ideal architecture would be to have three modules, one which focuses on collecting and managing checksums, one which deals with duplicates (perhaps eventually having clever features for automatically linking them), and a third which uses the checksums for data integrity.

 
mattdm

Joined: 2005-07-22
Posts: 181
Posted: Mon, 2012-07-30 02:47
 
cwallace91472

Joined: 2013-03-08
Posts: 15
Posted: Sat, 2013-03-09 23:28

Is there a way to add a quick 'delete' button to the images to remove them??

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2013-03-10 02:33

If you're logged in with administrative/owner credentials, then you should be able to delete via the dropdown menu under the photo. If you're not the owner/admin, then it can't be done.

 
cwallace91472

Joined: 2013-03-08
Posts: 15
Posted: Sun, 2013-03-10 04:41

When I click the duplicates module and it brings up a screen full of images that are apparently duplicates it has a little popup menu on each image and it links to the location of the file...but no option to delete unless it is at the end of the filename and these are long filenames...

I am going over everything...I am new to this system so it might be something simple I am missing...:)

 
cwallace91472

Joined: 2013-03-08
Posts: 15
Posted: Sun, 2013-03-10 04:49

Yeah...no sign of a delete there...logged in as admin...obviously or I would not be able to get to the module...;)

I uploaded all of them so I do in fact own them on the site...

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2013-03-10 13:18

hmmm, let me take a look at the module and see what I can do. You are correct.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2013-03-10 14:00

Updated v3:

Added 'delete' link in the context menu

See: http://codex.galleryproject.org/Gallery3:Modules:dupcheck

 
cwallace91472

Joined: 2013-03-08
Posts: 15
Posted: Sun, 2013-03-10 16:20

Tada!! Works like a charm! :) Hell of a response time also...:)

Would love a way to bulk delete also...;) hehe...ok...I am pushing it...

THANKS!!
Chris

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2013-03-10 18:19

One can only wish! Maybe one day.

 
AvutonOlrich
AvutonOlrich's picture

Joined: 2005-03-10
Posts: 29
Posted: Tue, 2013-04-02 00:39

Found a (two part) bug. Somehow, the gallery didn't have permissions to a directory of photos in my gallery. Gallery3 handles this well, the md5sum scan did not. Apparently, there are invalid md5sums in the database now. Now that I've fixed the permissions, a rescan did not help this problem..

So, the way I see it: Handle bad permission (or otherwise unreadable) files well, secondly please add a hack to detect invalid md5 sums (is the entry empty? Not sure).

Otherwise, got rid of 100+ duplicates here, I really appreciate this module.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Tue, 2013-04-09 02:26

I can't seem to replicate this issue. The only way an MD5SUM is added is if the item exists in the database and is there. Can you specify what you mean by invalid md5sums, and what you mean by gallery not having permissions? (were the permissions messed up after the item was added?)

I think I understand, but can't figure out how to replicate it to fix it.

Also, I'm assuming this is showing up when you go to the maintenance tasks?

 
AvutonOlrich
AvutonOlrich's picture

Joined: 2005-03-10
Posts: 29
Posted: Tue, 2013-04-09 02:49

So what's happening here is I have a number of images where the permissions were messed up after the images were added to the gallery. When I noticed that an entire folder of images in my gallery were marked as duplicates, I began looking around until I found the problem (bad permissions). Now that I've fixed the permissions problem, the images show up in gallery again (not as a broken icon). Now there's no way to get these images marked as not duplicates. In the duplicate images page, the pull up window for duplicates shows them all under one box, which shows the broken image icon with the dialog "photo".

I shouldn't have said they have invalid md5sums, that's my assumption.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Tue, 2013-04-09 13:24

You should be able to empty out the fullsize_md5sums table in the database safely. You'll have to run the maintenance task to extract the md5sums back afterwards.

Unfortunately, there isn't a way for me to 'catch' this that I can think of off hand. However, I have added a little piece in the code under the maintenance to remove any md5sums that aren't 'valid' (not 32 char length) - Other than that, there isn't a way to check if the md5sum is 'wrong' based on any changes afterwards without emptying the table out. I could add a 'rebuild' but I haven't had anyone present the issue you've encountered, and it clutters up the maintenance interface I think... best and easiest is to clear out the table for fullsize_md5sums

 
AvutonOlrich
AvutonOlrich's picture

Joined: 2005-03-10
Posts: 29
Posted: Thu, 2013-04-11 12:46
Quote:
However, I have added a little piece in the code under the maintenance to remove any md5sums that aren't 'valid' (not 32 char length) - Other than that, there isn't a way to check if the md5sum is 'wrong' based on any changes afterwards without emptying the table out.

OK, I finally got the time to look in the database to see what's going on, the 'itemmd5' for the corresponding items value is '0', so, if you do add something like that I would love to use it and report back.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Thu, 2013-04-11 15:46

That's been corrected in the recent code. I'll post up an update module shortly when I get back home.

James

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Thu, 2013-04-11 15:51
 
AvutonOlrich
AvutonOlrich's picture

Joined: 2005-03-10
Posts: 29
Posted: Fri, 2013-04-12 00:04

v4 fixed most of my problems, but the method it uses shows a new problem (and honestly probably not something your module should handle).

Apparently, I've had filesystem corruption which has wiped away a few of my files (some were backed up, the rest I didn't care that much about). So, I would run the maintenance item repeatedly, and there would be 13 md5sums which needed to be calculated (again, they were all '0's in the database. I hacked it to tell me the 13 problems files (the ones that didn't exist at all) and I'm deleting the gallery database of those items and readding them if I had a backup. Anyhow, not sure this is something you should worry about, but in case you care.

Also, more cosmetic, when I look at the maintenance item, it says you have 13 items (0%) to process (the 0% is a little strange ;)).

Anyhow, again, thanks for the excellent module. Extremely helpful, and will continue to be.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2013-04-12 01:02

Yea, if 13 items is less than 1% of your gallery images, then yes, it does look weird, but that's (rounded) math :)

I do have some ideas in mind on checking when adding/updating an md5sum, and if invalid, provide a warning. This would warn you of the offending images that the MD5 failed on...

I may work on this tonight.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2013-04-12 01:28

^^^^^^^^^^ Fixed in v5 ^^^^^^^^^^ (we'll not the cosmetic 13 of 13000 still < 1% and will show as 0%)

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2013-04-12 02:22

v6 = cosmetic change: when rounded percent = 0, changed to display 'less than 1%' - see above

 
jeffmcclain

Joined: 2010-09-01
Posts: 105
Posted: Fri, 2013-04-26 03:54

This just might be the ticket I've been looking for! Thanks! I want to be able to have server-add images updated, but sometimes I upload duplicates I already put on the server if they are in the same directory (i.e. directory 2013 is still being added to). So, would love to adapt the serveradd to ignore adding images to the gallery if there is an MD5 match already IN the gallery...maybe you could check on how hard to add in there? I may just browse the server add and call the MD5 hash and compare to the database if I get time to look at the server add code and compare to your MD5 crunch. Thanks for adding this!

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2013-04-26 13:30

@jeffmcclain: Sounds like a good idea. I'm a bit busy at the moment, but may have a look some time in the future. I'm kinda waiting for 3.1 upgrades before I go in to any more mods of the modules myself. This would probably be a relatively simple addition to the server add module however. I'd say give it a try and see what you come up with. I'll be happy to help when I can.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Fri, 2013-04-26 14:50

@jeffmcclain:
Here is a very quick and rude patch to: modules/server_add/controllers/server_add.php
It works, but could use cleaning up...

        } else {
          try {
            $extension = strtolower(pathinfo($name, PATHINFO_EXTENSION));
            if (legal_file::get_photo_extensions($extension)) {

// JAMES - PATCH FOR DUPLICATE CHECK
        if(module::is_active("dupcheck")){  
          $fmd5 = md5_file($entry->path);
            $album  = ORM::factory("item");
            $dupcheck = ORM::factory("fullsize_md5sum")
            ->where("itemmd5", "=", $fmd5)
            ->find_all();
          if ($dupcheck->count() > 0) {
            $task->log("Skipping duplicate file: {$entry->path}");
            message::warning(t("Did not add (Duplicate): ".$entry->path));
            $entry->item_id = 0;
          } else {
// END JAMES

              $photo = ORM::factory("item");
              $photo->type = "photo";
              $photo->parent_id = $parent->id;
              $photo->set_data_file($entry->path);
              $photo->name = $name;
              $photo->title = $title;
              $photo->owner_id = $owner_id;
              $photo->save();
              $entry->item_id = $photo->id;

// JAMES - PATCH DUP 2
          }
        } else {
              $photo = ORM::factory("item");
              $photo->type = "photo";
              $photo->parent_id = $parent->id;
              $photo->set_data_file($entry->path);
              $photo->name = $name;
              $photo->title = $title;
              $photo->owner_id = $owner_id;
              $photo->save();
              $entry->item_id = $photo->id;
        }
// END JAMES

            } else if (legal_file::get_movie_extensions($extension)) {
              $movie = ORM::factory("item");
              $movie->type = "movie";
              $movie->parent_id = $parent->id;
 
Sangie
Sangie's picture

Joined: 2010-10-31
Posts: 28
Posted: Sat, 2013-04-27 03:14

Hey jnash. Thanks for programming this. I had wanted something like this for awhile.

Do you think there's a way for it to check on every upload if the image has already been added? So after the upload it could say "Image deleted as it's a dupe of an image already in the gallery" or something along those lines?

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sat, 2013-04-27 23:59

v7 now updated to now display a warning when a duplicate image is uploaded (via the normal web method) and provide an option to delete it.

(will display once you click DONE from uploading images)

 
Sangie
Sangie's picture

Joined: 2010-10-31
Posts: 28
Posted: Sun, 2013-04-28 03:47

Wow I didn't think you would. Thank you so much =)

 
jeffmcclain

Joined: 2010-09-01
Posts: 105
Posted: Sun, 2013-04-28 16:09

Thanks, James! I'm traveling right now, but will check it when I get back! Thx again!

 
dyingdemon

Joined: 2009-08-09
Posts: 25
Posted: Wed, 2013-08-07 23:27

Just thought I'd drop in to say thank you. Iv been using the plugin for a while now, and like I said in my previous post, it's a god in human form. I did have to restore from a backup a few months ago though and for whatever reason it didn't restore the MD5 table for the database. I ended up re-scanning and (at my then current 250,331 images) the only real drain as you said was the first time. I ended up using full resources allotted to me for about 8 min and after that i rarely spike. when i first scanned at ~34,000 images it found over 300 duplicates which i missed. so once again thanks.

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Thu, 2013-08-08 02:28

Glad to hear it's being used and appreciated. Thanks!

 
joecooper84

Joined: 2013-08-22
Posts: 7
Posted: Thu, 2013-08-22 00:22

Hi jnash. I love this plugin. It has helped a lot with my OCD and wanting things in order :P
Unfortunately I am migrating a site, and uploading several hundred photos at once. Earlier I ended up duplicating 500 images. Going through one by one to answer "Delete this photo?" then click the "yes" confirmation and wait for the page to reload can take a very long time. I ended up deleting the album and uploading them all again, careful of the duplicates.
Is there a mass-confirm, or mass-select and delete option I've missed?

Thanks!!

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Sun, 2013-08-25 01:49

No, nothing 'mass' included. There isn't a way to really code this in. I was thinking of having a 'ignore duplicate upload' option, but even that isn't elegant enough to code in.

 
arabianfoz

Joined: 2013-07-18
Posts: 1
Posted: Wed, 2013-09-04 18:12

Hi J.E. Nash,
i am from Brazil,
i realized that some messages on dupcheck module are not possible to be translated,
so then i correct them, i want to know if you can update the module to support it!
in attach the repairs

Thanks, Great work

 
jnash
jnash's picture

Joined: 2004-08-02
Posts: 814
Posted: Wed, 2013-09-04 23:09

Thanks, I'll get these changed as soon as I have a free moment. I got your PM but haven't had a chance to reply. Appreciate you digging out the specifics, will make it much easier for me.