URL rewrite and search engine friendliness issues

ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Thu, 2006-09-21 20:10

Hi there, I am posting a discussion I contacted Bharat regarding the way Gallery currently handles implementing URL rewrites. If anyone has had direct experience with issues of duplicate content penalties in Google or if you have seen some of your pages pushed into the supplemental index of google then please post it here!

Quote:
Hi Bharat,

I have been busy working on a new site using gallery to replace my old website. I looked at quite a few apps before settling on gallery. The main reason I chose it was that I liked its URL rewrite module. Now that my Gallery site is almost finished I see that the URL rewrites are not well implemented in various ways throughout the site. I am not writing to complain about it, just thought that my insight this topic might be helpful in some way....
The fact that the app does not fully implement the URL rewrite module throughout all page links will most likely cause the pages to be double (or triple indexed, etc) and then be pushed into Google's supplemental index, where nobody will ever find them. You are probably already aware of this issue. I have run searches here to see if anyone is working on these problems and it seems like some people are aware of the issue but no fixes have been written....

Below is an analysis of the URL issues on my site. Please keep in mind that depending on how each site is designed a different set of issues may arise. For example, I added code to make the album and photo titles under each thumbnail linkable and now the thumbnail link and title link URLs don't match.

---------------------------------------------------------------------------

PROBLEM: Album thumbnails use one URL while album links use another. You can see this at http://www.andrewprokos.com/index.php:

EXAMPLE:

link under thumb:
http://www.andrewprokos.com/photos/new-york-city/

Thumbnail link: http://www.andrewprokos.com/photos/new-york-city/?g2_enterAlbum=1

--------------------------------------------------------------------------------

PROBLEM: Photo thumbnails use one URL (incorrect) while photo links use another (correct):

EXAMPLE:

link under thumb: http://www.andrewprokos.com/photos/new-york-city/skylines/midtown-manhattan-pano-day.html

Thumbnail link:
http://www.andrewprokos.com/photos/new-york-city/skylines/midtown-manhattan-pano-day.html?g2_enterAlbum=0

--------------------------------------------------------------------------------

ALBUM PAGE-NUMBER NAVBAR:

PROBLEM: Bottom page-number navbar (i.e. page 1 2 3 4 5...16) in any album with multiple pages doesn't use the "pretty" URLS at all, while the previous/next buttons at bottom of page do. This problem affects every album on the site.

EXAMPLE:

page-number nav URL for the first page of the NYC skylines album:
http://www.andrewprokos.com/index.php?g2_view=core%3AShowItem&g2_itemId=17&g2_page=1

previous button URL for same page: http://www.andrewprokos.com/photos/new-york-city/skylines/

Here's the second page in the album:

Button: http://www.andrewprokos.com/photos/new-york-city/skylines/?g2_page=2
Page-number nav 2: http://www.andrewprokos.com/index.php?g2_view=core%3AShowItem&g2_itemId=17&g2_page=2

--------------------------------------------------------------------------------

BREADCRUMB NAV AT TOP OF SITE:

PROBLEM: This problem occurs whether you are in albums or on a photo page. Once you get into any album the breadcrumb URLs have highlight IDs appended to them. Not sure what function they serve, but the breadcrumb nav URLs work just the same without them. The highlight ID numbers change depending on what album you are in.

EXAMPLE:

In the page http://www.andrewprokos.com/photos/new-york-city/skylines/ if you roll over the breadcrumb link at top:

Home link:
/index.php?g2_highlightId=15

This should just read
/index.php

New York Photography link:
/photos/new-york-city/?g2_highlightId=17

This should just read
/photos/new-york-city/

--------------------------------------------------------------------------------

PHOTO PAGE HTML extensions

PROBLEM: The app appends .html to the end of every photo page. If possible the ".html" should be stripped off. This is not a huge problem (that I know of) and won't cause supplemental or dupe content issues per se but many people think that removing the html extension is more SE friendly.

EXAMPLE:

photos/new-york-city/skylines/midtown-manhattan-pano-day.html

Should read:

photos/new-york-city/skylines/midtown-manhattan-pano-day

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Thu, 2006-09-21 20:11

Here is Bharat's response:

Hi, ichthyous. Please post this in the forums and don't follow up by private message as I won't have time to work on this directly myself so we need to leverage the other developers.

But to get the conversation going (feel free to post this reply in the forums when you repost this)...

I am pretty sure that Google is smart enough to be unaffected by virtually all of these constraints. Do you have empirical evidence that this data shows up multiple times in their index? For example, here's a Google query for a site that has a large G2 and is using the rewrite module:

http://www.google.com/search?q=site%3Achristianjamesphoto.com

(don't know this guy, just found him by doing a Google search). I don't see much duplication showing up there, and it doesn't appear that he's been pushed into the supplemental index. At any rate, we can definitely change some of these things. Others are harder. Let's discuss it in the forums and line up a dev to work on the things that can be fixed.

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Thu, 2006-09-21 20:39

Bharat,

The site you mention above does have pages in Google's supplemental index. If you run a site: search at google you will see the site has 3,340 indexed pages. After the 79th page (on my results) you see all of the site's remaining pages in the supplemental index...which you can tell because the word "supplemental" appears in green under the link. Those pages will never appear in any google search. Some supps can be caused by duplicate content filtering and others becuase the page is outdated and has changed. It's hard to know why unless you are the site author. Interestingly the website you found doesn't have the same problem with the page number nav not using the rewritten URLs...so perhaps this is an issue with PGtheme? That's why I say that not every site will be indexed in nearly the same way. I will try to find other Gallery sites which may exhibit indexing issues to see if we can come to any conclusions. I hope what you say about Google being smart enough to sort this out on it's own is true...but seeing what's been going on with Google's index since the Big Daddy update in Spring I fear it's not

 
netscan

Joined: 2005-07-16
Posts: 39
Posted: Fri, 2006-09-22 01:26

K, one at a time:

PROBLEM: Album thumbnails use one URL while album links use another

Solution: Looks like a template error - modify the link handling in the template.

PROBLEM: Photo thumbnails use one URL (incorrect) while photo links use another (correct):

Solution: See problem 1

PROBLEM: ALBUM PAGE-NUMBER NAVBAR

Solution: See problem 1

PROBLEM: BREADCRUMB NAV AT TOP OF SITE

Solution: The "highlight" bit is a sticky so gallery knows where to return the user, ie: If a user clicks on a picture on page 4, then clicks on the album name in the breadcrumb, the user will go back to page 4. This will cause a dupe content hit. The breadcrumb code can be modified to remove the feature, but your users will lose their place when navigating. Have to weigh your options there...

PROBLEM: PHOTO PAGE HTML extensions

Solution: Leave 'em there, Google doesn't care but it's easier for users, /this_looks_like_a_directory/ and this_looks_like_a_page.html

Try and dig up my How-To about being search engine friendly, 99% of it is still up-to-date, only thing I can think of that's different is the main.php/index.php handling, which is easily solved (thanks valiant) by making main.php the default page in the apache server (ie: www.your-site.com/main.php becomes www.your-site.com/). That is the second biggest thing you can do to avoid duplicate content/302 death in google. (remember to set $gallery->setConfig('baseUri', '/'); in config.php ;)

#1 thing that will kill your site in google is having both www.your-site.com and your-site.com accessible. That is bonified duplicate content since G sees it as 2 different sites. Pick one or the other and redirect the one you don't want to the one you do want. Google sitemaps now has an option to set the default, but your milage may vary and I feel a lot better handling it myself.

example www.your-site.com to your-site.com:


RewriteCond %{HTTP_HOST} ^www\.your-site\.com
RewriteRule ^(.*)$ http://your-site.com/$1 [R=permanent,L]

Google will not and probably cannot fix dupe content, it's up to site owners to either play ball or ignore google.

Browse through your site, note all of the inconsistencies and tackle them just as you have here. You can also run one of those sitemap generators, best damn thing I have used to see what gallery is spewing out to the bots.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Fri, 2006-09-22 12:51

With "template error", netscan means that "g2_enterAlbum" is not an official g2 parameter. It's added by the user contributed (Pedro Gilberto, PGTheme) theme. Other themes don't have this parameter.

edit: fixed typos.

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Fri, 2006-09-22 14:06

Thanks for the clarification valiant...I was wondering why my gallery has them and others don't.

Netscan:

I read your post on making Gallery SE friendly when I first started working on the site. Yes, i do think it's still relevant, and it was partly the reason I decided to turn off slideshows and multiple images size options. My site is not really geared towards dialup users anyway

As for the ?g2_enterAlbum I will cross post this topic there and see if anyone knows how to remove it. I don't know PHP at all, which has really slowed things down for me. I figured the Highlight ID served some kind of placeholding purpose...I don't know which is worse, frustrating vistors through poor navigation or frustrating google through dupe content. I will start a post here in the customization forum to see if anyone can remove the highlight IDs, that way people can choose for themselves which version they want to use. As for the html extension...this is the least of the problems you are right.

 
LFrank

Joined: 2005-02-19
Posts: 1023
Posted: Fri, 2006-09-22 16:42

ichthyous,

look in album.tpl (of PGtheme) and search for "{capture assign=linkUrl}{g->url arg1="view=core.ShowItem" arg2="itemId=`$child.id`" arg3="enterAlbum=`$child.canContainChildren`"}{/capture}", just remove the "arg3". This prevents the g2_enterAlbum being appended to the thumbnail links.
This is true for some other URL constructs in other files of PGtheme, too.
CU
Lutz

Gallery version = 2.2-svn core 1.1.16
PHP version = 5.1.6 apache2handler
Webserver = Apache/2.2.3 (Win32) DAV/2 PHP/5.1.6 mod_ssl/2.2.3 OpenSSL/0.9.8c
Database = mysql 5.1.11 beta-log,
Theme=PGlightbox,
Gallery-URL=http://lf-photodesign.de

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Fri, 2006-09-22 17:12

Thanks for the response Lutz...just to be clear I removed every instance of the following from album.tpl:

arg3="enterAlbum=`$child.canContainChildren`"

This resolved this issue without any problems arising from what I can see. Would you happen to know anything about the problem with the bottom page-number navbar in PG theme? It doesn't use the rewritten URLs at all, just the raw dynamic URLs. Thanks!

 
LFrank

Joined: 2005-02-19
Posts: 1023
Posted: Fri, 2006-09-22 17:21

Yes, exactly - might be neccessary in or navigatorthumbs, too.
I made something with the bottom page numbers too, have to dig it out ...

Gallery version = 2.2-svn core 1.1.16
PHP version = 5.1.6 apache2handler
Webserver = Apache/2.2.3 (Win32) DAV/2 PHP/5.1.6 mod_ssl/2.2.3 OpenSSL/0.9.8c
Database = mysql 5.1.11 beta-log,
Theme=PGlightbox,
Gallery-URL=http://lf-photodesign.de

 
netscan

Joined: 2005-07-16
Posts: 39
Posted: Fri, 2006-09-22 21:28

Glad you got it sorted :) I should've used "issue" rather than "error" but thanks valiant for pointing to the exact cause.

That's what I like about Gallery, huge and helpful support base :)

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Sun, 2006-09-24 18:13

As for the page number nav URL rewrite issues...I found a discussion about this here:

http://gallery.menalto.com/node/49021

Apparently it's a problem with URL rewrite not PG theme. I will ask if there has been any development on it so far and post here if there has

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Wed, 2006-09-27 19:20

Does anyone know how to remove the highlight IDs from the breadcrumb?

 
bharat
bharat's picture

Joined: 2002-05-21
Posts: 7985
Posted: Wed, 2006-10-04 22:35

That code is around line 1400 in modules/core/classes/GalleryTheme.class. Comment out this block:

                    if (!empty($theme['parents'][$i + 1]['id'])) {
                        $urlParams['highlightId'] = $theme['parents'][$i + 1]['id'];
                    } else if ($itemId && ($i + 1) == count($theme['parents'])) {
                        $urlParams['highlightId'] = $itemId;
                    }

This will deny you of some functionality, though. It means that if you go back up in the breadcrumbs it will always take you to page 1 of the album, not the page containing the item you're currently looking at.

 
ichthyous

Joined: 2006-06-16
Posts: 324
Posted: Thu, 2007-02-15 22:04

Yes, this is the case and my visitors have reported having difficulty browsing around the site. You gain on the se friendliness but lose on usability for sure.

New York photography gallery
Washington DC photography gallery

 
skunker

Joined: 2005-02-04
Posts: 344
Posted: Thu, 2008-01-10 18:16

Hello guys,
I saw with interest that one of the problems mentioned was how Gallery2 was appending "html" to the URLS for the photo pages. I need to remove these "html" extensions because I am trying to match the URLS of my old site using rewrite module. For example, my old site used Gallery1 and so the URLS are like this: "www.site.com/gallery/abc_efg" . With Gallery2 and the rewrite module, the closest I can get to matching the old URLS is like this: "www.site.com/gallery/abc_efg.html". Notice the HTML is appended.

Does anyone know how I can get Gallery2 to NOT append the HTML extension on these photo pages? Thanks!

 
bharat
bharat's picture

Joined: 2002-05-21
Posts: 7985
Posted: Thu, 2008-01-10 18:44

There's no configuration option for that currently, but if you modify modules/rewrite/classes/RewriteSimpleHelper.class and look for this code:

        if (substr($url, -6) != '%path%') { 
            $path = rtrim($path, '/'); 
        } else if (!GalleryUtilities::isA($entity, 'GalleryAlbumItem')) { 
            /* Append .html suffix on non-album paths if rule has no suffix */ 
            $path .= '.html'; 
        } 

Just comment out this line:

            $path .= '.html'; 

So that it looks like this:

        if (substr($url, -6) != '%path%') { 
            $path = rtrim($path, '/'); 
        } else if (!GalleryUtilities::isA($entity, 'GalleryAlbumItem')) { 
            /* Append .html suffix on non-album paths if rule has no suffix */ 
            //$path .= '.html'; 
        }