<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pocket SEO &#187; Site Analysis</title>
	<atom:link href="http://pocketseo.com/category/site-analysis/feed" rel="self" type="application/rss+xml" />
	<link>http://pocketseo.com</link>
	<description>Practical SEO blog</description>
	<lastBuildDate>Wed, 28 Jul 2010 17:12:22 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Robots.txt &#8211; Watching the Minor Details</title>
		<link>http://pocketseo.com/site-analysis/252</link>
		<comments>http://pocketseo.com/site-analysis/252#comments</comments>
		<pubDate>Sat, 22 Mar 2008 19:04:34 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/252</guid>
		<description><![CDATA[Robots.txt tip - watch out for details.


Related posts:<ol><li><a href='http://pocketseo.com/google/182' rel='bookmark' title='Permanent Link: Is Google is Broken? (Robots.txt Hell)'>Is Google is Broken? (Robots.txt Hell)</a></li>
<li><a href='http://pocketseo.com/msn/177' rel='bookmark' title='Permanent Link: MSN Live Search Only Has Partial Support for Wildcards in Robots.txt'>MSN Live Search Only Has Partial Support for Wildcards in Robots.txt</a></li>
<li><a href='http://pocketseo.com/site-analysis/153' rel='bookmark' title='Permanent Link: How to Spot the Ultimate Robots.txt Mistake'>How to Spot the Ultimate Robots.txt Mistake</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I noticed that Google was indexing one of my blog&#8217;s feeds that I thought I had blocked with robots.txt:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2008/03/google-rss-feeds.png' alt='Google indexing RSS feeds' /></p>
<p>The indexed URL is <strong>pocketseo.com/domains/7/feed</strong>.  Notice that it doesn&#8217;t have a trailing slash.</p>
<p>My robots.txt rule is:</p>
<pre lang="robots.txt">
Disallow: /*/feed/
</pre>
<p>WordPress URLs often have trailing slashes.  I don&#8217;t want to remove that trailing slash from robots.txt otherwise it would block any URL that contains the word &#8220;feed&#8221;.</p>
<p>It&#8217;s not a major problem, but here is a quick solution for that one indexed URL:</p>
<pre lang="robots.txt">
Disallow: /*/feed/
<span class="highlight">Disallow: /domains/7/feed</span>
</pre>
<p>It&#8217;s not important on a site like pocketseo.com, but I think that robots.txt rules are important on large sites and they can sometimes be tough to get right.</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=252&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/google/182' rel='bookmark' title='Permanent Link: Is Google is Broken? (Robots.txt Hell)'>Is Google is Broken? (Robots.txt Hell)</a></li>
<li><a href='http://pocketseo.com/msn/177' rel='bookmark' title='Permanent Link: MSN Live Search Only Has Partial Support for Wildcards in Robots.txt'>MSN Live Search Only Has Partial Support for Wildcards in Robots.txt</a></li>
<li><a href='http://pocketseo.com/site-analysis/153' rel='bookmark' title='Permanent Link: How to Spot the Ultimate Robots.txt Mistake'>How to Spot the Ultimate Robots.txt Mistake</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/252/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>XML Sitemaps and Competitive Intelligence</title>
		<link>http://pocketseo.com/site-analysis/204</link>
		<comments>http://pocketseo.com/site-analysis/204#comments</comments>
		<pubDate>Wed, 26 Dec 2007 13:04:46 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/204</guid>
		<description><![CDATA[A response to Graywolf's post on XML sitemaps and competitive intelligence.


Related posts:<ol><li><a href='http://pocketseo.com/google/196' rel='bookmark' title='Permanent Link: XML Sitemaps Do Not Affect Your Google Rankings'>XML Sitemaps Do Not Affect Your Google Rankings</a></li>
<li><a href='http://pocketseo.com/resources/24' rel='bookmark' title='Permanent Link: Henk Van Ess and Other Interesting Resources'>Henk Van Ess and Other Interesting Resources</a></li>
<li><a href='http://pocketseo.com/google/93' rel='bookmark' title='Permanent Link: Google Does Not Obey Robots.txt'>Google Does Not Obey Robots.txt</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Graywolf has an interesting post about using <a href="http://www.wolf-howl.com/case-study/using-google-sitemaps-for-competitive-intelligence/">sitemaps for competitive intelligence</a>:</p>
<blockquote><p>&#8230;people who are using automated solutions like popularity contest are telling you their most highly trafficked pages. Other people who are generating Sitemap XML files in a more manual fashion, are telling you the pages they want to rank. Chances are good the pages they want to rank for are the â€œmoney pagesâ€.</p></blockquote>
<p>It&#8217;s an interesting idea. If you are worried about someone doing this to your sites, you could try to hide your sitemaps by giving them unpredictable names and only using the <a href="http://sitemaps.org/protocol.php#submit_ping">ping technique</a> to tell search engines where the sitemap index is located.</p>
<p>If you want to find your competitors&#8217; sitemaps you could use search engine queries like this:</p>
<ul>
<li><a href="http://www.google.com/search?q=site%3Agoogle.com+filetype%3Axml" rel="nofollow">http://www.google.com/search?q=site%3Agoogle.com+filetype%3Axml</a></li>
<li><a href="http://www.google.com/search?q=site%3Agoogle.com+inurl%3Asitemap" rel="nofollow">http://www.google.com/search?q=site%3Agoogle.com+inurl%3Asitemap</a></li>
</ul>
<p>If you want to make sure that search engines can&#8217;t index your sitemap files you might be able to block them with an x-robots-tag HTTP header telling <a href="http://googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html">Google</a> and <a href="http://www.ysearchblog.com/archives/000508.html">Yahoo</a> not to index them.</p>
<p><strong>EDIT:</strong> I removed the idea of blocking the sitemap with a <a href="http://sebastians-pamphlets.com/about-noindex-crawler-directives-in-robots-txt/">robots.txt <em>noindex</em> directive</a>, because it probably won&#8217;t work in this situation.</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=204&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/google/196' rel='bookmark' title='Permanent Link: XML Sitemaps Do Not Affect Your Google Rankings'>XML Sitemaps Do Not Affect Your Google Rankings</a></li>
<li><a href='http://pocketseo.com/resources/24' rel='bookmark' title='Permanent Link: Henk Van Ess and Other Interesting Resources'>Henk Van Ess and Other Interesting Resources</a></li>
<li><a href='http://pocketseo.com/google/93' rel='bookmark' title='Permanent Link: Google Does Not Obey Robots.txt'>Google Does Not Obey Robots.txt</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/204/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to Spot the Ultimate Robots.txt Mistake</title>
		<link>http://pocketseo.com/site-analysis/153</link>
		<comments>http://pocketseo.com/site-analysis/153#comments</comments>
		<pubDate>Wed, 24 Oct 2007 23:40:36 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/153</guid>
		<description><![CDATA[Don't block your entire Web site with your robots.txt file.


Related posts:<ol><li><a href='http://pocketseo.com/msn/177' rel='bookmark' title='Permanent Link: MSN Live Search Only Has Partial Support for Wildcards in Robots.txt'>MSN Live Search Only Has Partial Support for Wildcards in Robots.txt</a></li>
<li><a href='http://pocketseo.com/site-analysis/252' rel='bookmark' title='Permanent Link: Robots.txt &#8211; Watching the Minor Details'>Robots.txt &#8211; Watching the Minor Details</a></li>
<li><a href='http://pocketseo.com/google/182' rel='bookmark' title='Permanent Link: Is Google is Broken? (Robots.txt Hell)'>Is Google is Broken? (Robots.txt Hell)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve seen this problem a surprising number of times: client accidentally blocks entire Web site with robots.txt file.</p>
<p>It often happens when files are copied from the dev server to the live server, and the dev server has a robots.txt file that blocks all robots.</p>
<p>Here is what it looks like:</p>
<p>The Google SERPs will not show a text snippet, and the title often becomes the words from the domain name (or the query) in lowercase letters:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/example-robots-txt.png' alt='Google SERPs for a site that is blocked by robots.txt' /></p>
<p>Here is a live example from <a href="http://pocketseo.com/web-20/130">msplinks.com</a>&mdash;they block the entire domain with robots.txt:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/msplinks-robots.png' alt='msplinks robots.txt in Google SERPs' /></p>
<p>So if you see that kind of result in Google&#8217;s SERPs for your site, check the robots.txt file.</p>
<p>This mistake won&#8217;t knock your site out the SERPs right away, but it will probably reduce traffic very quickly.</p>
<p>One way to prevent it is not to block robots on the dev server with the robots.txt file.  Use password protection instead.</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=153&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/msn/177' rel='bookmark' title='Permanent Link: MSN Live Search Only Has Partial Support for Wildcards in Robots.txt'>MSN Live Search Only Has Partial Support for Wildcards in Robots.txt</a></li>
<li><a href='http://pocketseo.com/site-analysis/252' rel='bookmark' title='Permanent Link: Robots.txt &#8211; Watching the Minor Details'>Robots.txt &#8211; Watching the Minor Details</a></li>
<li><a href='http://pocketseo.com/google/182' rel='bookmark' title='Permanent Link: Is Google is Broken? (Robots.txt Hell)'>Is Google is Broken? (Robots.txt Hell)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/153/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introduction to Advanced SEO Site Analysis for Large Web Sites</title>
		<link>http://pocketseo.com/site-analysis/142</link>
		<comments>http://pocketseo.com/site-analysis/142#comments</comments>
		<pubDate>Thu, 18 Oct 2007 17:02:18 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/142</guid>
		<description><![CDATA[A few tips on how to start an SEO analysis of a large Web site.


Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/20' rel='bookmark' title='Permanent Link: How to Discover Subdomains on Client Sites'>How to Discover Subdomains on Client Sites</a></li>
<li><a href='http://pocketseo.com/scripts/116' rel='bookmark' title='Permanent Link: How to Turn Proxy Hijacking Into Inbound Links to Your Web Site'>How to Turn Proxy Hijacking Into Inbound Links to Your Web Site</a></li>
<li><a href='http://pocketseo.com/msn/150' rel='bookmark' title='Permanent Link: MSN Live Search Indexing URL Fragments'>MSN Live Search Indexing URL Fragments</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I often hear SEOs and Web developers make broad statements about large Web sites after just a few minutes looking at them.  The statement often goes something like this:</p>
<blockquote><p>&#8220;This site looks well optimized and gets millions of visitors per month.  Your title tags are optimized and you have meta tags.  Just keep building more content for your users and search engines will love it.  There is nothing more do to for the site&#8217;s SEO.&#8221;</p></blockquote>
<p>There is a lot more to SEO than just optimizing title elements (<a href="http://pocketseo.com/markup/14">not title <em>tags</em></a>) and meta tags, or randomly &#8220;adding more content&#8221;.  There is almost always more SEO that can be done to a Web site, no matter how much traffic it already gets.  This post lists some ways that one can start analyzing a large site and find issues that could be addressed to increase search engine referrals.</p>
<h2>The <em>site:</em> Query</h2>
<p>One of the first things to look at on a large Web site is the <em>site:</em> query in Google, Yahoo, and MSN.  Just go to each of the three search engines and enter <strong>site:example.com</strong> and the search engines will return a list of indexed pages.</p>
<p>You can use <a href="http://pocketseo.com/google/33">Google&#8217;s URL parameters</a> to switch to 100 results per page which allows to you scan the URLs that have been indexed.  Look for titles and text snippets (usually the meta description) that are identical across many pages.  Also scan the indexed URLs for things like:</p>
<ul>
<li>URLs that should not be indexed</li>
<li>URLs that were indexed over HTTPS</li>
<li>URLs that have many parameters</li>
<li>URLs that were indexed on the <em>www</em> subdomain like <strong>http://www.example.com/</strong> as well as without a subdomain like <strong>http://example.com/</strong>.
</ul>
<p>Google will only show 1000 results.  To show more than 1000 results you have to get more specific by querying directories.  For example, to find all of the pages in a directory called <em>products</em> use this Google query: <strong>site:http://example.com/products/</strong>.  You can leave the <strong>http://</strong> part off, but be aware that a URL that starts with <strong>http://</strong> is different than one that starts with <strong>https://</strong>.  If you include the <em>www</em> Google will only show results from the <em>www</em> subdomain.  If you leave the <em>www</em> off, you will get indexed pages from all subdomains.</p>
<p>Also run the <em>site:</em> query in Yahoo and MSN.  Yahoo will often reveal strange indexed URLs like the ones created by <a href="http://msdn2.microsoft.com/en-us/library/ms972429.aspx" rel="nofollow">ASP.NET Session Service</a>.  Those ASP.NET URLs look like this, with the highlighted part being a kind of session ID that changes:<br />
<strong>http://example.com/<span class="highlight">(lit3py55t21z5v55vlm25s55)</span>/Application/SessionState.aspx</strong></p>
<p>It&#8217;s also useful to check how MSN Live Search has indexed your site because it can provide additional data about how your site is being crawled.</p>
<h2>Find Subdomains</h2>
<p>You should <a href="http://pocketseo.com/site-analysis/20" title="How to discover subdomains on client Web sites">check for subdomains on the site</a> to see if there is duplicate content or other unwanted sections of the site being indexed.  Large Web sites often have multiple subdomains for load balancing or other reasons.  One common situation is: <strong>www.example.com</strong>, <strong>www2.example.com</strong>, <strong>www3.example.com</strong>&mdash;all with the same content.</p>
<p>Here is <a href="http://www.google.com/search?q=site%3Acnn.com+-www.cnn.com&#038;num=100" rel="nofollow">an example from CNN.com</a>:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/cnn-seo.png' alt='CNN SEO problems - duplicate content on subdomains' /></p>
<p>I&#8217;ve marked the subdomains that contain duplicate content.</p>
<p>For example, the subdomain <strong>www-cgi.cnn.com</strong> has <a href="http://www.google.com/search?q=site%3Awww-cgi.cnn.com" rel="nofollow">over 17,000 pages indexed</a> and they look like duplicates of the pages on the main domain.</p>
<p>Notice that <strong>beta.cnn.com</strong> has also been indexed.  It looks like someone tried to block that from robots with <a href="http://beta.cnn.com/robots.txt" rel="nofollow">a robots.txt file</a> that contains the rule <code>Disallow: /beta</code> (hint: that won&#8217;t work).</p>
<h2>Off-Domain Duplicate Content</h2>
<p>In addition to duplicate content on subdomains, you may also find the site duplicated on other domains.  Often people will say things like, &#8220;I want my site to be accessible at example.com, example_1.com, and example_2.com&#8221;.  They end up with their entire site duplicated across other domains.</p>
<p>You can search for these duplicate domains by copying a sentence of text (or a unique string of words) from the Web site and searching Google with it.  Put the sentence in quotes to return only exact matches.  If the content appears on other sites, those sites will probably turn up in a Google search.</p>
<p>You can also often find clients&#8217; other domain names by using the <em>ip:</em> query in MSN Live Search.  Use the <a href="http://pocketseo.com/tools/59">Show IP Firefox Extension</a> to find the IP address of the client&#8217;s Web site.  Then go to MSN Live Search and type in:<br />
<strong>ip:nnn.nnn.nnn.nnn</strong> (replace the letters &#8220;n&#8221; with the IP address)</p>
<p>MSN will return a list of other sites at the same IP address.</p>
<p>Also, ask the client, &#8220;Do you have any other domain names that point to the same server/content?&#8221;</p>
<h2>Duplicate Content From Referrers</h2>
<p>URLs with referrer information (e.g., affiliate program tracking) can create multiple URLs for a single page of content.</p>
<p>Here is <a href="http://www.google.com/search?q=intitle:%22CNN.com+-+Breaking+News,+U.S.,+World,+Weather,+Entertainment%22&#038;num=100&#038;filter=0" rel="nofollow">another example from CNN.com</a> showing that the home page is indexed with many different URLs:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/cnn-duplicate-homepage.png' alt='CNN duplicate home page URLs' /></p>
<p>You can find these in Google with a query like:<br />
<strong>intitle:&#8221;some text from the page&#8217;s title element&#8221;</strong></p>
<h2>Site Spider</h2>
<p>It is useful to locate URLs that return <em>302</em>, <em>403</em>, <em>404</em>, or other undesired headers.  It&#8217;s also good to locate URLs that don&#8217;t exist, but that send <em>200 OK</em> headers.</p>
<p>I used to write short custom scripts to check response headers on sites, but now I usually use the free <a href="http://www.auditmypc.com/free-sitemap-generator.asp">AuditMyPC.com Sitemap Generator</a> which can generate a useful spreadsheet of HTTP response headers.</p>
<p>If you are using the AuditMyPC.com tool, be sure to respect robots.txt and rel=nofollow as shown below:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/sitemap-crawler-options.png' alt='AuditMyPC.com crawler image' /></p>
<p>You can also set a crawl delay so that it doesn&#8217;t put too much load on their servers.</p>
<h2>Analyzing the Raw Logs</h2>
<p><a href="http://pocketseo.com/analytics/31">Grepping your raw log files</a> can turn up a lot of useful information about how spiders are perceiving your site.  Basically, you want to extract all of the hits that were made by Google&#8217;s, Yahoo&#8217;s, and MSN&#8217;s spiders, and then see what kind of response codes were sent.</p>
<p>This is useful for finding URLs that search engines are requesting but not indexing, or large numbers of internal 302 redirects that you might have missed.</p>
<h2>Site Architecture</h2>
<p>This is a large topic, so this is just a quick overview of where you might start:</p>
<p>Check each page that is linked to from the home page.  Are they category pages?  Do the category pages link to the pages that will bring maximum SEO benefit?  Is the site keeping the important content within a few clicks of the home page?</p>
<h2>Piece by Piece</h2>
<p>Attack the site&#8217;s SEO issues in smaller chunks.  Optimize smaller sections of the site at a time and observe how the search engines respond.  Then move on to the next SEO issue.</p>
<h2>Conclusion: SEO for Large Web Sites</h2>
<p>The techniques above are just a few places to start when analyzing a large Web site.  Even those few analysis methods should give you lots of SEO material to work on that go beyond the obvious basics like page titles, meta tags, and alt attributes.</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=142&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/20' rel='bookmark' title='Permanent Link: How to Discover Subdomains on Client Sites'>How to Discover Subdomains on Client Sites</a></li>
<li><a href='http://pocketseo.com/scripts/116' rel='bookmark' title='Permanent Link: How to Turn Proxy Hijacking Into Inbound Links to Your Web Site'>How to Turn Proxy Hijacking Into Inbound Links to Your Web Site</a></li>
<li><a href='http://pocketseo.com/msn/150' rel='bookmark' title='Permanent Link: MSN Live Search Indexing URL Fragments'>MSN Live Search Indexing URL Fragments</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/142/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>6 Reasons Why Clean URLs Matter</title>
		<link>http://pocketseo.com/site-analysis/125</link>
		<comments>http://pocketseo.com/site-analysis/125#comments</comments>
		<pubDate>Mon, 15 Oct 2007 19:36:22 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/125</guid>
		<description><![CDATA[6 solid reasons that you should be using clean URLs.


Related posts:<ol><li><a href='http://pocketseo.com/google/181' rel='bookmark' title='Permanent Link: Google&#8217;s Vastly Improved Webmaster Guidelines'>Google&#8217;s Vastly Improved Webmaster Guidelines</a></li>
<li><a href='http://pocketseo.com/site-analysis/252' rel='bookmark' title='Permanent Link: Robots.txt &#8211; Watching the Minor Details'>Robots.txt &#8211; Watching the Minor Details</a></li>
<li><a href='http://pocketseo.com/google/93' rel='bookmark' title='Permanent Link: Google Does Not Obey Robots.txt'>Google Does Not Obey Robots.txt</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been reading SEO forums lately and have read some comments that there is absolutely no difference between using static or dynamic URLs.</p>
<p>A site can get indexed and ranked well if it has dynamic URLs, but that does not mean that dynamic URLs are as good as clean URLs.</p>
<p><strong>Note:</strong> if you have dynamic URLs, don&#8217;t just change them to clean URLs after reading this without knowing what you are doing.  I&#8217;ll cover the risks involved with doing that in another post.  Changing URL structure from dynamic to static would not be recommended in all cases.  This post just contains general advice on why you should start with clean URLs.</p>
<h2>1. The search engines recommend clean URLs</h2>
<p><a href="http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_GuidelinesforOptimizingSite.htm" rel="nofollow">MSN says</a> it very plainly, &#8220;<em>Keep your URLs simple and static.</em>&#8221;</p>
<p><a href="http://www.google.com/support/webmasters/bin/answer.py?hl=en&#038;answer=35769#design">Google</a> &mdash; &#8220;<em>If you decide to use dynamic pages (i.e., the URL contains a &#8220;?&#8221; character), be aware that not every search engine spider crawls dynamic pages as well as static pages. It helps to keep the parameters short and the number of them few.</em>&#8221;</p>
<p><strong>Update: (26 Nov 07)</strong> Google has updated their Webmaster Guidelines to specifically talk about <a href="http://pocketseo.com/google/181">problems with dynamic URLs</a>:</p>
<blockquote><p>Google indexes dynamically generated webpages, including .asp pages, .php pages, and pages with question marks in their URLs. However, these pages can cause problems for our crawler and may be ignored. If youâ€™re concerned that your dynamically generated pages are being ignored, you may want to consider creating static copies of these pages for our crawler. If you do this, please be sure to include a robots.txt file that disallows the dynamic pages in order to ensure that these pages arenâ€™t seen as having duplicate content.</p></blockquote>
<p><a href="http://help.yahoo.com/l/us/yahoo/search/siteexplorer/dynamic/">Yahoo&#8217;s answer</a> is more complex because they have a <a href="http://help.yahoo.com/l/us/yahoo/search/siteexplorer/dynamic/">dynamic URL tool</a> in SiteExplorer.  Yahoo clearly states some of the problems that dynamic URLs can create for spiders:</p>
<blockquote><p>
You might use URL parameters in your site to perform various functions apart from modifying the content of the page, such as,</p>
<ul>
<li>session ids &#8211; for tracking user sessions</li>
<li>source trackers â€“ for tracking the sources which are sending referrals to your pages and site</li>
<li>format modifiers â€“ for print formats etc</li>
</ul>
<p>In these cases, our crawler will come across different versions of your site&#8217;s URLs that are substantially, if not exactly, similar in content. This causes numerous problems:</p>
<ol>
<li>Such URLs look like different documents to our crawler and create excessive crawling on your site.</li>
<li>We do our best to detect duplicates among these pages, and the detected duplicates are prevented from ranking well. However, if we are unable to detect the duplicate, this results in duplicate results from your site competing for positions in search results</li>
<li>When people link to your site with different versions of these URLs, it fragments the link referrals to your site across multiple different URLs even though it&#8217;s the same page.</li>
</ol>
</blockquote>
<p>Ask.com doesn&#8217;t have much information on dynamic URLs, except for <a href="http://sp.uk.ask.com/en/docs/about/webmasters.shtml#15">a quick mention</a> on their UK site:</p>
<blockquote><p>
<span class="highlight">We include a select number of dynamic URLs in our index.</span> However, they are screened to detect likely duplicates before downloading.
</p></blockquote>
<p>Note that Ask.com carefully uses the phrase, &#8220;a select number&#8221;.</p>
<p>Search engines are not saying that they cannot spider and index dynamic URLs, but they are giving very clear hints that it is easier for them to index static URLs.</p>
<h2>2. Dynamic URLs probably do not pass link value in the same way</h2>
<p>Matt Cutts talked a little bit about this in <a href="http://video.google.com/videoplay?docid=-6860320126300142609">one of his videos</a>.  Here is a transcription, with relevant parts highlighted:</p>
<blockquote><p>Does Google treat Dynamic Pages differently than static pages&#8230;?</p>
<p>Good question. <span class="highlight">To a first approximation</span>, we do treat static and dynamic pages <span class="highlight">in similar ways</span> in ranking. So let me explain that in a little more detail.</p>
<p>Pagerank <span class="highlight">flows to</span> dynamic URL&#8217;s in the same way they flow to static URL&#8217;s. And so, if you&#8217;ve got New York Times linking to a dynamic URL you&#8217;ll still get the Pagerank benefit, and it will still flow the Pagerank Benefit. There are other Search Engines who in the past have said, &#8216;OK, we&#8217;ll go one-level deep from static URLs, so we&#8217;re not going to crawl from a dynamic URL, but we&#8217;re willing to go into the dynamic URL space from a static URL&#8217;. So, the short answer is Pagerank still flows just the same between static and dynamic.</p>
<p>Let&#8217;s go into the more detailed answer.</p>
<p>The example you gave actually has 5 parameters, and one of them is like a Product ID with like 2725. You definitely can use too many parameters.  I would absolutely opt for 2 or 3 at the most, if you have any choice whatsoever. And try to avoid long numbers, because we can think that those are session IDs. Any extra parameters that you can get rid of are always a good idea.</p>
<p><span class="highlight">And remember that Google is not the only Search Engine out there, so if you have the ability to basically say I&#8217;m going to use a little bit of mod_rewrite and I&#8217;m going to make it look like a static URL, that can often be a very good way to tackle the problem.</span> So, PageRank still flows but, experiment! If you don&#8217;t see any URLs that have the same structure, or the same number of parameters as you&#8217;re thinking about doing; it&#8217;s probably better if you can either cut back on the number of parameters or shorten them somehow or try to use mod_rewrite. </p></blockquote>
<p>So Matt Cutts is saying that, &#8220;Pagerank <span class="highlight">flows to</span> dynamic URL&#8217;s in the same way they flow to static URL&#8217;s.&#8221;  Does that mean that PageRank does not flow <em>from</em> dynamic pages the same way as it does from static URLs?  For example, if a dynamic URL links to another dynamic URL, is it the same as if a static URL links to a page?</p>
<p>Yahoo used to link to a good powerpoint presentation from <a href="http://www.ysearchblog.com/archives/000050.html">this page</a> called <a href="http://www.ysearchblog.com/files/wmw2004/search-friendly-design.ppt" rel="nofollow">Search Friendly Design</a> (hopefully they will put it back or update it).</p>
<p>Here is a screenshot from the presentation:</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/yahoo-dynamic-urls.png' alt='Yahoo recommends clean URLs' /></p>
<blockquote>
<h3>Database-Driven Sites</h3>
<ul>
<li><strong>What gets crawled</strong>
<ul>
<li>Static URLs</li>
<li>Dynamic Pages with in-links from static pages</li>
<li>Links between dynamic pages are problematic for crawlers (some get crawled, some don&#8217;t)</li>
</ul>
</li>
<li><strong>Limit URL depth when using dynamic-to-static</strong></li>
</ul>
</blockquote>
<p>The presentation was from 2004, and it&#8217;s possible that Yahoo has upgraded their indexer to the point where dynamic URLs don&#8217;t matter at all anymore, but I doubt that Yahoo handles dynamic URLs in the exact same way as static URLs.</p>
<p>Here is another slide from the presentation where Yahoo (Tim Mayer) says that Yahoo &#8220;&#8230;won&#8217;t crawl&#8230; spider &#8216;traps&#8217; (dynamic content)&#8221;.</p>
<p><img src='http://pocketseo.com/wp-content/uploads/2007/10/yahoo-indexer-spider.png' alt='Yahoo Spider and Indexer Slide' /></p>
<p>Not all dynamic content is a &#8220;spider trap&#8221;, but the concept of spider trap should be kept in mind when looking at dynamic URLs.  Consider the perspective of a search engine architect: dynamic Web sites can generate an unlimited number of pages.  The spider has to be able to automatically detect when it&#8217;s getting caught in an endless loop of dynamic content (&#8220;spider trap&#8221;).  This is probably why Google strongly recommends not to exceed a a couple of URL parameters.</p>
<p>The search engines representatives are clearly hinting that you should watch out for dynamic URLs.</p>
<h2>3. Dynamic URLs often work even when the parameters are reversed, added, or removed</h2>
<p>For example, the URL <strong>http://example.com/index.php?name=abc&#038;num=123</strong> might also load at <strong>http://example.com/index.php?num=123&#038;name=abc</strong>.  If search engines find both links you may end up with multiple indexed URLs for the same page of content.</p>
<p>Here is an example of dynamic URLs causing problems on YouTube&mdash;a site that would benefit from some SEO if Google weren&#8217;t automatically inserting the site in the SERPs with Universal Search:</p>
<ul>
<li><a href="http://www.youtube.com/watch?v=hwTaSmS85l4" rel="nofollow">http://www.youtube.com/watch?v=hwTaSmS85l4</a>
<li><a href="http://www.youtube.com/watch?v=hwTaSmS85l4&#038;mode=related&#038;search=" rel="nofollow">http://www.youtube.com/watch?v=hwTaSmS85l4&#038;mode=related&#038;search=</a></li>
</ul>
<p>Both URLs load the same page of content and both are indexed by Google.</p>
<h2>4. Dynamic URLs are harder to block with robots.txt</h2>
<p>It is more difficult to block pages&#8217; dynamic URLs with robots.txt files.  The robots.txt standard (which needs updating) doesn&#8217;t yet officially support wildcards, though Google, Yahoo, <strike>and MSN</strike> all do.  [<strong>UPDATE:</strong> <a href="http://pocketseo.com/msn/177">MSN doesn't fully support wildcards in robots.txt</a>]  For example, Google could block the duplicate URLs above on YouTube with the following robots.txt rules, but it&#8217;s an extension of the standard and may not work with all robots:</p>
<p><code><br />
Disallow: /*mode=<br />
Disallow: /*search=<br />
</code></p>
<p>If you only use static URLs for your site, you also have the advantage of being able to easily block weird query strings that might appear on your site.</p>
<p>For example, if all of your main URLs are clean you can block weird URLs that you might not have anticipated like <strong>http://example.com/widgets/5/?ajax_callback=true</strong>.</p>
<p><strong>Update: 27 Nov 2007</strong> Please disregard the following advice for the moment and see <a href="http://pocketseo.com/google/182">this post about Google and robots.txt</a> before implementing it:</p>
<p><strike>A single robots.txt rule should at least block those kinds of unexpected URLs from at least Google and Yahoo:</strike></p>
<p><code><br />
<strike># this blocks all dynamic URLs from Google and Yahoo</strike><br />
<strike>Disallow: /*?</strike><br />
</code></p>
<p><strong>Warning:</strong> Be very careful before implementing that last robots.txt rule though because there may be essential pages on the site that have dynamic URLs.  There are many factors to consider before doing that, such as &#8220;are there linked-to, indexed dynamic URLs on the site that are 301 redirecting to new URLs?&#8221;  Basically, it should only be done on new sites where you are sure that you have constructed the site so that your essential site structure does not contain any dynamic URLs.</p>
<h2>5. Clean URLs are more memorable</h2>
<p>It&#8217;s better to send people the URL <strong>http://example.com/pagename</strong> than <strong>http://example.com/index.php?page=pagename</strong>.</p>
<p>If you have an ecommerce site, it&#8217;s easier for your customers to remember <strong>http://example.com/widgets</strong> than <strong>http://example.com/index.php?category_id=widgets</strong>.</p>
<h2>6. Tim Berners-Lee says so (basically)</h2>
<p>Tim Berners-Lee, the inventor of the WWW, has a great page of information about <a href="http://www.w3.org/Provider/Style/URI">URL structure</a>.  He doesn&#8217;t specifically mention dynamic URLs, but reading the article will give a sense of how to create good URLs.  File name extensions and software mechanisms should not appear in URLs.</p>
<h2>Conclusion: why static URLs are better than dynamic</h2>
<p>I&#8217;ve heard even some well-known SEOs say things like &#8220;there is no difference between static and dynamic URLs&mdash;just look at the dynamic URLs on my well-indexed site as an example&#8221;.</p>
<p>I&#8217;m not saying that dynamic URLs won&#8217;t get indexed or won&#8217;t rank highly.  Search engines can spider and index dynamic URLs.  But anyone who says, &#8220;I already rank highly for my main keywords; I don&#8217;t need SEO&#8221; does not understand the potential of SEO.  Traffic can almost always be increased on a Web site, even on sites that rank #1 for highly competitive keywords.</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=125&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/google/181' rel='bookmark' title='Permanent Link: Google&#8217;s Vastly Improved Webmaster Guidelines'>Google&#8217;s Vastly Improved Webmaster Guidelines</a></li>
<li><a href='http://pocketseo.com/site-analysis/252' rel='bookmark' title='Permanent Link: Robots.txt &#8211; Watching the Minor Details'>Robots.txt &#8211; Watching the Minor Details</a></li>
<li><a href='http://pocketseo.com/google/93' rel='bookmark' title='Permanent Link: Google Does Not Obey Robots.txt'>Google Does Not Obey Robots.txt</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/125/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Negative SEO</title>
		<link>http://pocketseo.com/site-analysis/51</link>
		<comments>http://pocketseo.com/site-analysis/51#comments</comments>
		<pubDate>Thu, 16 Aug 2007 19:00:45 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/51</guid>
		<description><![CDATA[Inside the evil art of negative SEO.


Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/142' rel='bookmark' title='Permanent Link: Introduction to Advanced SEO Site Analysis for Large Web Sites'>Introduction to Advanced SEO Site Analysis for Large Web Sites</a></li>
<li><a href='http://pocketseo.com/google/108' rel='bookmark' title='Permanent Link: Is Google Using &#8220;the Temporal Pattern&#8221; to Detect Paid Links?'>Is Google Using &#8220;the Temporal Pattern&#8221; to Detect Paid Links?</a></li>
<li><a href='http://pocketseo.com/tools/165' rel='bookmark' title='Permanent Link: The Worst Thing About WordPress SEO'>The Worst Thing About WordPress SEO</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Forbes.com has an article called <a href="http://www.forbes.com/2007/06/28/negative-search-google-tech-ebiz-cx_ag_0628seo.html">The Saboteurs of Search</a> which contains a slideshow of &#8220;7 Ways your Site Can be Sabotaged&#8221;.</p>
<p>They quoted Matt Cutts:</p>
<blockquote><p>Matt Cutts, a senior software engineer for Google, says that piling links onto a competitor&#8217;s site to reduce its search rank isn&#8217;t impossible, but it&#8217;s extremely difficult. &#8220;We try to be mindful of when a technique can be abused and make our algorithm robust against it,&#8221; he says. &#8220;I won&#8217;t go out on a limb and say it&#8217;s impossible. But Google bowling is much more inviting as an idea than it is in practice.&#8221;</p></blockquote>
<p>Defense against &#8220;negative SEO&#8221; involves knowing how it is performed.  I&#8217;ve categorized this post under <em>Site Analysis</em> because it involves things you might look for when analyzing a site.</p>
<p>This is Forbes.com&#8217;s list of the 7 ways to sabotage a Web site:</p>
<ol>
<li><strong>Google Bowling</strong> &mdash; Creating many low-quality inbound links to a site.  Easy to discover.</li>
<li><strong>Tattling</strong> &mdash; Reporting paid links to Google through a spam report.  If you don&#8217;t buy links, it probably isn&#8217;t a problem.  People who would use the dark side of SEO probably buy links also.</li>
<li><strong>Google Insulation</strong> &mdash; Creating content that ranks above the competitor&#8217;s site.</li>
<li><strong>Copyright Takedown Notices</strong> &mdash; Reporting another Web site for copyright infringement can have it taken down for 10 days.  This is probably illegal and opens the perpetrator to lawsuits.</li>
<li><strong>Copied Content</strong> &mdash; Creating duplicate content.  More difficult to trace, but you can defend against it by monitoring for duplicate content.  If they are using proxies to duplicate your content you can send alternate content to the proxies by IP address.</li>
<li><strong>Denial of Service</strong> &mdash; Crash the Web site with a DOS attack. Definitely illegal.</li>
<li><strong>Click Fraud</strong> Simulate click fraud on competitor&#8217;s Web site.</li>
</ol>
<p>Has anyone been affected by these techniques, or know of any others?</p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=51&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/142' rel='bookmark' title='Permanent Link: Introduction to Advanced SEO Site Analysis for Large Web Sites'>Introduction to Advanced SEO Site Analysis for Large Web Sites</a></li>
<li><a href='http://pocketseo.com/google/108' rel='bookmark' title='Permanent Link: Is Google Using &#8220;the Temporal Pattern&#8221; to Detect Paid Links?'>Is Google Using &#8220;the Temporal Pattern&#8221; to Detect Paid Links?</a></li>
<li><a href='http://pocketseo.com/tools/165' rel='bookmark' title='Permanent Link: The Worst Thing About WordPress SEO'>The Worst Thing About WordPress SEO</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/51/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Discover Subdomains on Client Sites</title>
		<link>http://pocketseo.com/site-analysis/20</link>
		<comments>http://pocketseo.com/site-analysis/20#comments</comments>
		<pubDate>Thu, 24 May 2007 03:27:47 +0000</pubDate>
		<dc:creator>Josh</dc:creator>
				<category><![CDATA[Site Analysis]]></category>

		<guid isPermaLink="false">http://pocketseo.com/site-analysis/20</guid>
		<description><![CDATA[How to find hidden subdomains on client sites that may be creating duplicate content.  This is frequently a problem on large Web sites.


Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/142' rel='bookmark' title='Permanent Link: Introduction to Advanced SEO Site Analysis for Large Web Sites'>Introduction to Advanced SEO Site Analysis for Large Web Sites</a></li>
<li><a href='http://pocketseo.com/google/33' rel='bookmark' title='Permanent Link: Google Search URL Parameters'>Google Search URL Parameters</a></li>
<li><a href='http://pocketseo.com/site-analysis/153' rel='bookmark' title='Permanent Link: How to Spot the Ultimate Robots.txt Mistake'>How to Spot the Ultimate Robots.txt Mistake</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>On large, complex sites you will often find many &quot;hidden&quot; subdomains that duplicate the content on the main domain.  You might even find that the dev server has been indexed (e.g., http://dev.example.com/).  Checking for subdomains should be part of every site audit.</p>
<p>A simple Google query will show you a list of subdomains on a site by excluding domains that begin with <em>www</em>:</p>
<p><code>site:example.com -www.example.com</code></p>
<p>Change the view to 100 per page and visually scan the SERPs for subdomains.  You can quickly convert the view to 100 results per page by appending the text <em>&amp;num=100</em> to the Google SERPs URL.</p>
<p>Below is an example query that shows the subdomains on Google.com.  Note the <em>&#038;num=100</em> on the end of the URL that tells Google to display 100 results:</p>
<p><a rel="nofollow" href="http://www.google.com/search?q=site%3Agoogle.com+-www.google.com&#038;num=100">http://www.google.com/search?q=site%3Agoogle.com+-www.google.com&#038;num=100</a></p>
<img src="http://pocketseo.com/?ak_action=api_record_view&id=20&type=feed" alt="" />

<p>Related posts:<ol><li><a href='http://pocketseo.com/site-analysis/142' rel='bookmark' title='Permanent Link: Introduction to Advanced SEO Site Analysis for Large Web Sites'>Introduction to Advanced SEO Site Analysis for Large Web Sites</a></li>
<li><a href='http://pocketseo.com/google/33' rel='bookmark' title='Permanent Link: Google Search URL Parameters'>Google Search URL Parameters</a></li>
<li><a href='http://pocketseo.com/site-analysis/153' rel='bookmark' title='Permanent Link: How to Spot the Ultimate Robots.txt Mistake'>How to Spot the Ultimate Robots.txt Mistake</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://pocketseo.com/site-analysis/20/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
