Google is not obeying robots.txt files.
I’ve seen this on several sites already. Sections are blocked from spiders with robots.txt. Google then spiders those pages and indexes them with a cached version of the page. Google even displays the last-crawled date on the cached pages.
I’m sure that the robots.txt rules are correct.
Google should not be accessing pages that are blocked with robots.txt.
If you have seen this problem, please leave a comment below.
Related posts:
- Robots.txt – Watching the Minor Details Robots.txt tip - watch out for details....
- How to Spot the Ultimate Robots.txt Mistake Don't block your entire Web site with your robots.txt file....
- MSN Live Search Only Has Partial Support for Wildcards in Robots.txt Problems with MSN Live Search's robots.txt docs....
- Is Google is Broken? (Robots.txt Hell) Something is wrong with Google and robots.txt....
- How Long Does It Take For a Google Reinclusion Request? As mentioned on the 20th, PocketSEO.com was hacked and penalized...


3 Comments
What I have seen is something slightly different.
Feedburner provide a way to block search engines from crawling your feed. It is code in the RSS itself.
Within my Feedburner stats it is still saying that Google is visiting (no not just the bot picking up feeds for subscribers)
Hi Andy, thanks for your comment.
Feedburner is saying that Googlebot is spidering your feed? Have you seen it indexed? (e.g., cache:example.com/page/feed/)
On at least one of the sites that I’ve noticed this problem on, Google’s robots.txt tool says that the pages are blocked. But Google still spiders and indexes them…
Hmm, it seems it has stopped being indexed now, that is interesting. It might actually be working as intended.