An attempt at an MD5 module to do some duplicate checking/etc.
http://codex.gallery2.org/Gallery3:Modules:dupcheck
Here is the initial release- It's only tested locally on my gallery, so, with that being said, ymmv, but I think it's good.
Here is what the module does as of now:
1) Upon activation, it adds a new table to the DB (fullsize_md5sums) and sets a warning that photos need to be scanned for missing md5s
2) Adds a maintenance task that will allow you to scan all photos and insert any missing MD5s in the DB
3) Upon adding (or changing) an image to the gallery, will insert the MD5 into the DB automatically
4) Upon deleting an image from the gallery, will delete the MD5 from the DB for that image
5) Adds a 'Gallery Photo Duplicates' menu item to ADMIN->CONTENT menu which currently loads a dynamic album with duplicate thumbnails
-5a) On the duplicate item album page, shows the thumbnail of the image that has duplicates
-5b) Under the thumbnail there are links to each image location where it's duplicated - the links show which album it's in
-5c) This check for duplicates across ALL of the gallery, not just in a particular album - (Maybe will code this later, but not now)
ADTL:
6) Shows what a bad coder I am via the php code...
7) Works from what I can tell/test/etc.
--------------------------------------------------------------------------------
CHANGE LOG:
------------------
v7 - Will now provide a warning when uploading a duplicate image via the normal web way and provide an option to delete it. (Will display warning after click DONE when uploading) - thanks Sangie
---------------------------------------------------------------------------------------------
v6 - Cosmetic change when rounded percent = 0, changed to 'less than 1%' - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v5 - Added a warning when the MD5SUM of an image is invalid - now admin will see the offending image to solve in the event of an invalid image or something - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v4 - Added a task in the maintenance run to automatically remove invalid MD5SUM entries from the DB - thanks AvutonOlrich
---------------------------------------------------------------------------------------------
v3 - Added a 'delete' link in the context menu under each photo - thanks cwallace91472
---------------------------------------------------------------------------------------------
v2 - Fix for when no duplicates are found - thanks floridave!
---------------------------------------------------------------------------------------------
v1 - Initial Release
Posts: 101
This module would be useful to those of us who came from Gallery2 and used the Links module which led a picture reside in multiple albums, but which get imported into G3 as two separate, but identical pictures.
Posts: 814
Did I mention I'm not a coder?
Posts: 25
dude, you have my full RAWR support :p. i have 17k images, and with a user base of 1.8k users, who are uploading 200-300 images a day, odd are that some are duplicates. im the only admin, so i spend huge chunks of time just making sure my site is running, and just recently got it open to the public again. your module will be a god in human form. (i only have one question, what kind of server resources are we talking about here? im on a shared host, so i have metered server resources, and need to know just how much this would take to use, as md5 scanning/comparing even through mysql can take decent chunks of server resources)
Posts: 814
dyingdemon, my tests so far (I have 26k images) consume about the same scanning time as the exif modules do (perhaps a bit faster) I use PHP's md5sum capabilities for polling the md5sums (appears not to use too much cpu/mem) I didn't time the runs or anything unfortunately.
The initial MD5sum scan will take a while, but after that, it does the scan for each upload/modification, and that's cake.
For the queries, they are pretty quick and not that resource intensive at all. The original method I was using was very slow on the duplicate polling, however, I've refined it quite a bit, and it's speedy and not that hungry.
I've attached the module into the original post. So you're welcome to test it out. (It's only been -tested- on my gallery, so let me know if there are any issues... I think I've cleaned up any specifics to my gallery, but this is my 'first' public module)
Posts: 181
Feature request suggestion for this one: since you're storing checksums, it'd be awesome to have a validation function, which makes sure photos haven't become corrupted on disk.
Posts: 814
mattdm: can you elaborate a bit? I'm guessing that you'd like a task that will check for corrupted files based on MD5 changes?
Posts: 181
Exactly.
Posts: 27300
I like one word details.
Elaborate: Develop or present (a theory, policy, or system) in detail.
Dave
_____________________________________________
Blog & G2 || floridave - Gallery Team
Posts: 181
Well, that really was exactly that. This module already saves checksums. It would be nice to have a maintanance task that would verify those checksums.
I can imagine serveral possible modes of operation:
Posts: 814
I don't think this module is the place for that function. I haven't heard of anyone having problems with images becoming corrupt, but I suppose this could be useful. Unfortunately, I don't have the time to create a new module at the moment, and don't think it will fit well inside of this module. However, the data is there, and should someone want to tackle it? Then I say go for it!
Posts: 181
It's not unheard of, and as datasets get bigger, random bit flips become more likely. Some new file systems (like zfs) do checksumming inherently, but most don't, yet.
I think probably the ideal architecture would be to have three modules, one which focuses on collecting and managing checksums, one which deals with duplicates (perhaps eventually having clever features for automatically linking them), and a third which uses the checksums for data integrity.
Posts: 181
See http://clusterbuffer.wordpress.com/checksum/
Posts: 15
Is there a way to add a quick 'delete' button to the images to remove them??
Posts: 814
If you're logged in with administrative/owner credentials, then you should be able to delete via the dropdown menu under the photo. If you're not the owner/admin, then it can't be done.
Posts: 15
When I click the duplicates module and it brings up a screen full of images that are apparently duplicates it has a little popup menu on each image and it links to the location of the file...but no option to delete unless it is at the end of the filename and these are long filenames...
I am going over everything...I am new to this system so it might be something simple I am missing...
Posts: 15
Yeah...no sign of a delete there...logged in as admin...obviously or I would not be able to get to the module...;)
I uploaded all of them so I do in fact own them on the site...
Posts: 814
hmmm, let me take a look at the module and see what I can do. You are correct.
Posts: 814
Updated v3:
Added 'delete' link in the context menu
See: http://codex.galleryproject.org/Gallery3:Modules:dupcheck
Posts: 15
Tada!! Works like a charm! Hell of a response time also...
Would love a way to bulk delete also...;) hehe...ok...I am pushing it...
THANKS!!
Chris
Posts: 814
One can only wish! Maybe one day.
Posts: 29
Found a (two part) bug. Somehow, the gallery didn't have permissions to a directory of photos in my gallery. Gallery3 handles this well, the md5sum scan did not. Apparently, there are invalid md5sums in the database now. Now that I've fixed the permissions, a rescan did not help this problem..
So, the way I see it: Handle bad permission (or otherwise unreadable) files well, secondly please add a hack to detect invalid md5 sums (is the entry empty? Not sure).
Otherwise, got rid of 100+ duplicates here, I really appreciate this module.
Posts: 814
I can't seem to replicate this issue. The only way an MD5SUM is added is if the item exists in the database and is there. Can you specify what you mean by invalid md5sums, and what you mean by gallery not having permissions? (were the permissions messed up after the item was added?)
I think I understand, but can't figure out how to replicate it to fix it.
Also, I'm assuming this is showing up when you go to the maintenance tasks?
Posts: 29
So what's happening here is I have a number of images where the permissions were messed up after the images were added to the gallery. When I noticed that an entire folder of images in my gallery were marked as duplicates, I began looking around until I found the problem (bad permissions). Now that I've fixed the permissions problem, the images show up in gallery again (not as a broken icon). Now there's no way to get these images marked as not duplicates. In the duplicate images page, the pull up window for duplicates shows them all under one box, which shows the broken image icon with the dialog "photo".
I shouldn't have said they have invalid md5sums, that's my assumption.
Posts: 814
You should be able to empty out the fullsize_md5sums table in the database safely. You'll have to run the maintenance task to extract the md5sums back afterwards.
Unfortunately, there isn't a way for me to 'catch' this that I can think of off hand. However, I have added a little piece in the code under the maintenance to remove any md5sums that aren't 'valid' (not 32 char length) - Other than that, there isn't a way to check if the md5sum is 'wrong' based on any changes afterwards without emptying the table out. I could add a 'rebuild' but I haven't had anyone present the issue you've encountered, and it clutters up the maintenance interface I think... best and easiest is to clear out the table for fullsize_md5sums
Posts: 29
OK, I finally got the time to look in the database to see what's going on, the 'itemmd5' for the corresponding items value is '0', so, if you do add something like that I would love to use it and report back.
Posts: 814
That's been corrected in the recent code. I'll post up an update module shortly when I get back home.
James
Posts: 814
v4 updated on the codex:
http://codex.galleryproject.org/Gallery3:Modules:dupcheck
Posts: 29
v4 fixed most of my problems, but the method it uses shows a new problem (and honestly probably not something your module should handle).
Apparently, I've had filesystem corruption which has wiped away a few of my files (some were backed up, the rest I didn't care that much about). So, I would run the maintenance item repeatedly, and there would be 13 md5sums which needed to be calculated (again, they were all '0's in the database. I hacked it to tell me the 13 problems files (the ones that didn't exist at all) and I'm deleting the gallery database of those items and readding them if I had a backup. Anyhow, not sure this is something you should worry about, but in case you care.
Also, more cosmetic, when I look at the maintenance item, it says you have 13 items (0%) to process (the 0% is a little strange ;)).
Anyhow, again, thanks for the excellent module. Extremely helpful, and will continue to be.
Posts: 814
Yea, if 13 items is less than 1% of your gallery images, then yes, it does look weird, but that's (rounded) math
I do have some ideas in mind on checking when adding/updating an md5sum, and if invalid, provide a warning. This would warn you of the offending images that the MD5 failed on...
I may work on this tonight.
Posts: 814
^^^^^^^^^^ Fixed in v5 ^^^^^^^^^^ (we'll not the cosmetic 13 of 13000 still < 1% and will show as 0%)
Posts: 814
v6 = cosmetic change: when rounded percent = 0, changed to display 'less than 1%' - see above
Posts: 105
This just might be the ticket I've been looking for! Thanks! I want to be able to have server-add images updated, but sometimes I upload duplicates I already put on the server if they are in the same directory (i.e. directory 2013 is still being added to). So, would love to adapt the serveradd to ignore adding images to the gallery if there is an MD5 match already IN the gallery...maybe you could check on how hard to add in there? I may just browse the server add and call the MD5 hash and compare to the database if I get time to look at the server add code and compare to your MD5 crunch. Thanks for adding this!
Posts: 814
@jeffmcclain: Sounds like a good idea. I'm a bit busy at the moment, but may have a look some time in the future. I'm kinda waiting for 3.1 upgrades before I go in to any more mods of the modules myself. This would probably be a relatively simple addition to the server add module however. I'd say give it a try and see what you come up with. I'll be happy to help when I can.
Posts: 814
@jeffmcclain:
Here is a very quick and rude patch to: modules/server_add/controllers/server_add.php
It works, but could use cleaning up...
Posts: 28
Hey jnash. Thanks for programming this. I had wanted something like this for awhile.
Do you think there's a way for it to check on every upload if the image has already been added? So after the upload it could say "Image deleted as it's a dupe of an image already in the gallery" or something along those lines?
Posts: 814
v7 now updated to now display a warning when a duplicate image is uploaded (via the normal web method) and provide an option to delete it.
(will display once you click DONE from uploading images)
Posts: 28
Wow I didn't think you would. Thank you so much =)
Posts: 105
Thanks, James! I'm traveling right now, but will check it when I get back! Thx again!
Posts: 25
Just thought I'd drop in to say thank you. Iv been using the plugin for a while now, and like I said in my previous post, it's a god in human form. I did have to restore from a backup a few months ago though and for whatever reason it didn't restore the MD5 table for the database. I ended up re-scanning and (at my then current 250,331 images) the only real drain as you said was the first time. I ended up using full resources allotted to me for about 8 min and after that i rarely spike. when i first scanned at ~34,000 images it found over 300 duplicates which i missed. so once again thanks.
Posts: 814
Glad to hear it's being used and appreciated. Thanks!
Posts: 7
Hi jnash. I love this plugin. It has helped a lot with my OCD and wanting things in order :P
Unfortunately I am migrating a site, and uploading several hundred photos at once. Earlier I ended up duplicating 500 images. Going through one by one to answer "Delete this photo?" then click the "yes" confirmation and wait for the page to reload can take a very long time. I ended up deleting the album and uploading them all again, careful of the duplicates.
Is there a mass-confirm, or mass-select and delete option I've missed?
Thanks!!
Posts: 814
No, nothing 'mass' included. There isn't a way to really code this in. I was thinking of having a 'ignore duplicate upload' option, but even that isn't elegant enough to code in.
Posts: 1
Hi J.E. Nash,
i am from Brazil,
i realized that some messages on dupcheck module are not possible to be translated,
so then i correct them, i want to know if you can update the module to support it!
in attach the repairs
Thanks, Great work
Posts: 814
Thanks, I'll get these changed as soon as I have a free moment. I got your PM but haven't had a chance to reply. Appreciate you digging out the specifics, will make it much easier for me.