I often hear SEOs and Web developers make broad statements about large Web sites after just a few minutes looking at them. The statement often goes something like this:
“This site looks well optimized and gets millions of visitors per month. Your title tags are optimized and you have meta tags. Just keep building more content for your users and search engines will love it. There is nothing more do to for the site’s SEO.”
There is a lot more to SEO than just optimizing title elements (not title tags) and meta tags, or randomly “adding more content”. There is almost always more SEO that can be done to a Web site, no matter how much traffic it already gets. This post lists some ways that one can start analyzing a large site and find issues that could be addressed to increase search engine referrals.
The site: Query
One of the first things to look at on a large Web site is the site: query in Google, Yahoo, and MSN. Just go to each of the three search engines and enter site:example.com and the search engines will return a list of indexed pages.
You can use Google’s URL parameters to switch to 100 results per page which allows to you scan the URLs that have been indexed. Look for titles and text snippets (usually the meta description) that are identical across many pages. Also scan the indexed URLs for things like:
- URLs that should not be indexed
- URLs that were indexed over HTTPS
- URLs that have many parameters
- URLs that were indexed on the www subdomain like http://www.example.com/ as well as without a subdomain like http://example.com/.
Google will only show 1000 results. To show more than 1000 results you have to get more specific by querying directories. For example, to find all of the pages in a directory called products use this Google query: site:http://example.com/products/. You can leave the http:// part off, but be aware that a URL that starts with http:// is different than one that starts with https://. If you include the www Google will only show results from the www subdomain. If you leave the www off, you will get indexed pages from all subdomains.
Also run the site: query in Yahoo and MSN. Yahoo will often reveal strange indexed URLs like the ones created by ASP.NET Session Service. Those ASP.NET URLs look like this, with the highlighted part being a kind of session ID that changes:
http://example.com/(lit3py55t21z5v55vlm25s55)/Application/SessionState.aspx
It’s also useful to check how MSN Live Search has indexed your site because it can provide additional data about how your site is being crawled.
Find Subdomains
You should check for subdomains on the site to see if there is duplicate content or other unwanted sections of the site being indexed. Large Web sites often have multiple subdomains for load balancing or other reasons. One common situation is: www.example.com, www2.example.com, www3.example.com—all with the same content.
Here is an example from CNN.com:

I’ve marked the subdomains that contain duplicate content.
For example, the subdomain www-cgi.cnn.com has over 17,000 pages indexed and they look like duplicates of the pages on the main domain.
Notice that beta.cnn.com has also been indexed. It looks like someone tried to block that from robots with a robots.txt file that contains the rule Disallow: /beta (hint: that won’t work).
Off-Domain Duplicate Content
In addition to duplicate content on subdomains, you may also find the site duplicated on other domains. Often people will say things like, “I want my site to be accessible at example.com, example_1.com, and example_2.com”. They end up with their entire site duplicated across other domains.
You can search for these duplicate domains by copying a sentence of text (or a unique string of words) from the Web site and searching Google with it. Put the sentence in quotes to return only exact matches. If the content appears on other sites, those sites will probably turn up in a Google search.
You can also often find clients’ other domain names by using the ip: query in MSN Live Search. Use the Show IP Firefox Extension to find the IP address of the client’s Web site. Then go to MSN Live Search and type in:
ip:nnn.nnn.nnn.nnn (replace the letters “n” with the IP address)
MSN will return a list of other sites at the same IP address.
Also, ask the client, “Do you have any other domain names that point to the same server/content?”
Duplicate Content From Referrers
URLs with referrer information (e.g., affiliate program tracking) can create multiple URLs for a single page of content.
Here is another example from CNN.com showing that the home page is indexed with many different URLs:

You can find these in Google with a query like:
intitle:”some text from the page’s title element”
Site Spider
It is useful to locate URLs that return 302, 403, 404, or other undesired headers. It’s also good to locate URLs that don’t exist, but that send 200 OK headers.
I used to write short custom scripts to check response headers on sites, but now I usually use the free AuditMyPC.com Sitemap Generator which can generate a useful spreadsheet of HTTP response headers.
If you are using the AuditMyPC.com tool, be sure to respect robots.txt and rel=nofollow as shown below:

You can also set a crawl delay so that it doesn’t put too much load on their servers.
Analyzing the Raw Logs
Grepping your raw log files can turn up a lot of useful information about how spiders are perceiving your site. Basically, you want to extract all of the hits that were made by Google’s, Yahoo’s, and MSN’s spiders, and then see what kind of response codes were sent.
This is useful for finding URLs that search engines are requesting but not indexing, or large numbers of internal 302 redirects that you might have missed.
Site Architecture
This is a large topic, so this is just a quick overview of where you might start:
Check each page that is linked to from the home page. Are they category pages? Do the category pages link to the pages that will bring maximum SEO benefit? Is the site keeping the important content within a few clicks of the home page?
Piece by Piece
Attack the site’s SEO issues in smaller chunks. Optimize smaller sections of the site at a time and observe how the search engines respond. Then move on to the next SEO issue.
Conclusion: SEO for Large Web Sites
The techniques above are just a few places to start when analyzing a large Web site. Even those few analysis methods should give you lots of SEO material to work on that go beyond the obvious basics like page titles, meta tags, and alt attributes.
