Is Google is Broken? (Robots.txt Hell)

It looks like Google is not correctly obeying robots.txt again. Google states:

To block Googlebot from block crawling any URL that includes a ? (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot

Disallow: /*?

However, I’ve recently seen a couple of sites that use this rule that got the root non-indexed. Removing the rule would get it reindexed quickly.

For example, the robots.txt rule Disallow: /section/*? would end up de-indexing the URL http://example.com/section/. Soon after removing that robots.txt rule, the page would be re-indexed.

Coincidence on two different sites?

Has anyone else seen this?

Share and Enjoy:
  • Twitter
  • Sphinn
  • Facebook
  • del.icio.us
  • Digg
  • Reddit
  • StumbleUpon
  • Google Bookmarks
  • Mixx
  • Tumblr
  • FriendFeed
  • LinkedIn

Related posts:

  1. Robots.txt – Watching the Minor Details Robots.txt tip - watch out for details....
  2. MSN Live Search Only Has Partial Support for Wildcards in Robots.txt Problems with MSN Live Search's robots.txt docs....
  3. How to Spot the Ultimate Robots.txt Mistake Don't block your entire Web site with your robots.txt file....
  4. Google Does Not Obey Robots.txt Why isn't Google obeying robots.txt files?...
  5. MSN Live Search Indexing URL Fragments MSN Live search is indexing URL fragments like this: http://example.com/page#section-1...

5 Comments

  1. Posted November 27, 2007 at 2:43 am | Permalink

    From here Disallow: /*? works as described. There are other reasons why a home page doesn’t make it in the index. Did you check the logs for a fetch by Googlebot? Did you test that Google indexes the root without this directive? Did you try
    Disallow: /*?
    Allow: /

  2. Posted November 27, 2007 at 2:47 am | Permalink

    Oops:
    Disallow: /*?
    Allow: /$

  3. Posted November 27, 2007 at 3:05 am | Permalink

    Good ideas. I will grep the logs if I see it again, though I’ve noticed other Google robots.txt problems recently where they were indexing pages that were always blocked by robots.txt.

    Google quickly indexed the root once that /*? rule had been removed.

  4. John Opera Windows
    Posted November 27, 2007 at 3:06 am | Permalink

    Post your URL in the Google Webmaster Help groups and I’m sure people there will figure something out (or someone from Google can take a better look). It’s not really possible to do much without the URL. Thanks.

  5. Posted November 27, 2007 at 3:20 am | Permalink

    Unfortunately I cannot post the URLs in Google’s Webmaster forum.

    It has already been solved by the removal of that rule. If I have time tomorrow I will try to run a test on another domain and see if I can reproduce it for the Webmaster forum.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*