3.0.2: ?m=nnnnnnnn causing 404 not found on later requests from bots

MarkRH

Joined: 2007-05-25
Posts: 241
Posted: Wed, 2011-06-08 07:11

This was happening with all versions of Gallery 3 I am sure. I discovered this after I added some code to my gallery to look for individual photos that I had moved to other albums without the need to worry about messing with .htaccess rewrite rules.

The code works by intercepting the Kohana 404 exception and double-checking the database using the items.name field for images or slug field for a page and then doing a 301 redirect to the new location.

In /system/libraries/Kohana_404_Exception.php, I added the following check:

	public function __construct($page = NULL)
	{
		if ($page === NULL)
		{
			// Use the complete URI
			$page = Router::$complete_uri;
		}
      // mrh: should this be where I add my call to find missing page/photo and do a 301 redirect if found?
      if (!found_missing_item($page)) {
		 parent::__construct('The page you requested, %page%, could not be found.', array('%page%' => $page));
      }

	}

In my function I log what it's doing just for my curiosity and I started noticing things like:

[07-Jun-2011 07:07:00 AM] 66.249.71.203 - ***** begin -- find missing Kohana item: var/resizes/misc-pics/okc_apartment_front.jpg?m=1263029094
[07-Jun-2011 07:07:00 AM] 66.249.71.203 - Actual server request: /var/resizes/misc-pics/okc_apartment_front.jpg%253Fm%253D1263029094
[07-Jun-2011 07:07:00 AM] 66.249.71.203 - User agent: Googlebot-Image/1.0
[07-Jun-2011 07:07:00 AM] 66.249.71.203 - looking for photo: okc_apartment_front.jpg
[07-Jun-2011 07:07:00 AM] 66.249.71.203 - -- end -- found missing photo: http://gallery.markheadrick.com/var/resizes/misc-pics/okc_apartment_front.jpg

[07-Jun-2011 04:19:06 PM] 95.108.158.239 - ***** begin -- find missing Kohana item: var/resizes/nature/lunar-eclipse-december-21-2010-05.jpg_3Fm_3D1292930341amp_
[07-Jun-2011 04:19:06 PM] 95.108.158.239 - Actual server request: /var/resizes/nature/lunar-eclipse-december-21-2010-05.jpg_3Fm_3D1292930341amp_
[07-Jun-2011 04:19:06 PM] 95.108.158.239 - User agent: Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
[07-Jun-2011 04:19:06 PM] 95.108.158.239 - looking for photo: lunar-eclipse-december-21-2010-05.jpg
[07-Jun-2011 04:19:06 PM] 95.108.158.239 - -- end -- found missing photo: http://gallery.markheadrick.com/var/resizes/nature/lunar-eclipse-december-21-2010-05.jpg

In the first case, the Google Image bot is requesting %253Fm%253D instead of "?m=" and in the second case the Yandax bot is sending in a weird request as well. Many times, the "?m=" is coming across as %3Fm%3D in $_SERVER['REQUEST_URI']. In these cases, the gallery can't find the image even though it is actually there.

My code strips this extra stuff out and just looks for filename.jpg in the items.name field and 301 redirects or continues with the 404 condition accordingly.

Since adopting this code I've noticed my images being indexed much more. This may partially explain why indexing was problematic in that the Google Image bot was getting 404 responses from images it saw previously.

Regular browsing doesn't trigger my code so I know the regular "?m=" is being handled properly. Anyway, thought you like to know what I've uncovered. This is why I've been thinking of turning off the "?m=" output.

Regards,
Mark H.

Using Gallery 3.0.2 - gallery.markheadrick.com