Wednesday, July 1, 2009

Site Architecture and SEO – file/page issues

Source: Bing.com: Search engine optimization (SEO) has three fundamental pillars upon which successful optimization campaigns are run. Like a three-legged stool, take one away, and the whole thing fails to work. The SEO pillars include: content (which we initially discussed in Are you content with your content?), links (which we covered in Links: the good, the bad, and the ugly, Part 1 and Part 2), and last but not least, site architecture. You can have great content and a plethora of high quality inbound links from authority sites, but if your site’s structure is flawed or broken, then it will still not achieve the optimal page rank you desire from search engines.

The search engine web crawler (also known as a robot or, more simply, a bot) is the key to website architecture issues (Bing uses MSNBot). Think of the bot as a headless web browser, one that does not display what it sees, but instead interprets the HTML code it finds on a webpage and sends the content it discovers back to the search engine database so that it can be analyzed and indexed. You can even equate the bot to a very simple user. If you target your site’s content to be readable by that simple user (serving as a lowest common denominator), then more sophisticated users (running browsers like Internet Explorer 8 or Firefox 3) will most certainly keep up. Using that analogy, doing SEO for the bot is very much a usability effort.

If you care about your website being found in search (and I presume you do if you’re reading this column!), you’ll want to help the crawler do its job. Or at the very minimum, you should remove any obstacles under your control that can get in its way. The more efficiently the search engine bot crawls your site, the higher the likelihood that more of its content that will end up in the index. And that, my friend, is how you show up in the search engine results pages (SERPs).

With site architecture issues for SEO, there’s a ton of material to cover. So much so, in fact, that I need to break up this subject into a multi-part series of blog posts. I’ve broken them down into subsets of issues that pertain to: HTML files (pages), URLs and links, and on-page content. I even plan a special post devoted solely to tag optimizations for SEO.

So let’s kick off this multi-part series of posts with a look at SEO site architecture issues and solutions related to files and pages.

Use descriptive file and directory names

Every time you can use descriptive text to help represent your content, the better off your site will be. This even goes for file and directory names. Besides being far more user friendly for end users to remember, the strategic use of keywords in file and directory names will further reinforce their relevance to those pages.

And while you’re examining the names of files and directories, avoid using underscores as word separators. Use hyphens instead. This syntax will help the bot to properly parse the long name you use into individual words instead of having it treated as the equivalent of a meaningless superlongkeyword.

Limit directory depth

Bots don’t crawl endlessly, searching every possible nook and cranny of every website (unless you are an important authority site, where it may probe deeper than usual). For the rest of us, though, creating a deep directory structure will likely mean the bot never gets to your deepest content. To alleviate this possibility, make your site’s directory structure shallow, no deeper than four child directories from the root.

Limit physical page file size

Keep your individual webpage files down under 150 KB each. Anything bigger than that and the bot may abandon the page after a partial crawl or skip crawling the page entirely.

Externalize on-page JavaScript and CSS code

If your pages use JavaScript and or Cascading Style Sheets (CSS), make sure that content is not inline within the HTML page. Search bots want to see the tag content as quickly as possible. If your pages are filled with script and CSS code, you run the risk of making the pages too long to be effectively crawled. In fact, ensure that the tag starts within the first 100 KB of the page’s source code; otherwise, the bot may not crawl the page at all.

Removing JavaScript and CSS code from your pages into external files offers additional advantages beyond just shortening your webpage files. By being external to the content they modify, they can be used by multiple pages simultaneously. Externalizing this content also simplifies code maintenance issues.

Follow these examples on how to reference external JavaScript and CSS code in your HTML pages.

A few notes to consider. External file references are not supported in really old browser versions, such as Netscape Navigator 2.x and Microsoft Internet Explorer 3.x. But if the users of such old browsers are not your target audience, the benefits of externalizing this code will far outweigh that potential audience loss. I also recommend storing your external code files separately from your HTML code, such as in /Scripts and /CSS directories. This helps keep website elements organized, and you can then easily use your robots.txt file to block bot access to all of your code files (after all, sometimes scripts handle business confidential data, so preventing the indexing of those files might be a wise idea!).

Use 301 redirects for moved pages

When you move your site to a new domain or change folder and/or file names within your site, don’t lose all of your previously earned site ranking “link juice.” Search engines are quite literal in that the same pages on different domains or the same content using different file names are regarded as duplicates. Search engines also attribute rank to pages. Search engines have no way of knowing when you intend new page URLs to be considered updates of your old page URLs. So what do you do? Use an automatic redirect to manage this for you.

Automatic redirects are set up on your web server. If you don’t have direct access to your web server, ask your administrators to set this up for you. Otherwise, you’ll need to do a bit of research. First you need to know which type of HTML redirect code you need. Unless your move is very temporary (in which case you’ll want to use a 302 redirect), use a 301 redirect for permanently moved pages. A 301 tells the search engine that the page has moved to a new location and that the new page is not a duplicate of the old page, but instead IS the old page at a new location. Thus when the bot attempts to crawl your old page location, it’ll be redirected to the new location, gather the new page’s content , and apply any and all changes made to the existing page rank standing.

To learn how to set this up, you’ll first need to know which web server software is running your site. Once you know that, click either Windows Server Internet Information Server (IIS) or Apache HTTP Server to learn how you can set up 301 redirects on your website.

Avoid JavaScript or meta refresh redirects

Technically you can also do page redirects with JavaScript or “refresh” tags. However, these are not recommended methods of accomplishing this task and still achieving optimal SEO results. These methods were highly abused in the past for hijacking users away from content that they wanted to web spam that they didn’t want. As a result, search engines take a dim view of these techniques for redirect. To do the job right, to preserve your link juice, and to continue your good standing with search engines, use 301 redirects instead.

Implement custom 404 pages

When a user makes a mistake then typing your URL into the address bar of their browser or an inbound link contains a typo, the typical website pops up a generic HTML 404 File Not Found error page. The most common end user response to that error message is to abandon that webpage. If that user had gone to your website and despite the error, you actually had the information they were seeking, that’s a lost business opportunity.

Instead of letting users go away thinking your site is broken, make an attempt to help them find what they want by showing a custom 404 page. Your page should look like the other page designs on your site, include an acknowledgment that the page the user was looking for doesn’t exist, and offer a link to your site’s home page and more importantly, access to either a site-wide search or an HTML-based sitemap page. At a minimum, make sure your site’s navigation tools are present, enabling the user to search for their content of interest before they leave.

Implementing a custom 404 page is dependent upon which web server you are using: For users of Windows Server IIS, check out the new Bing Web Page Error Toolkit. Otherwise, browse the 404 information for Apache HTTP Server.
Other crawler traps

The search engine bot doesn’t see the Web as do you and I. As such, there are several other page-related issues that can “trap” the bot, preventing it from seeing all of the content you intend to have indexed. For example, there are many page types that the bot doesn’t handle very well. If you use frames on your website (does anyone still use frames?), the bot will only see the frame page elements as individual pages. Thus, when it want to see how each page interrelates to other pages on your site, frame element pages are usually poor performers. This is because frame pages usually separate content from navigation. Thus content pages often become islands of isolated text that are not linked to directly by anything. And with no links to them, they might never get found. But even if the bot finds the frame’s navigation pane page, there’s no context to the links. This is pretty bad in terms of ranking in search engine relevance.

Other types of pages that can trip up search engine bots include forms (there’s typically no useful content on a form page) and authentication pages (bots can’t execute authentication schemes, so they are blocked from seeing all of the pages behind the authentication gateway). Pages that require either session IDs or cookies to be accessed are similar to authentication pages in that the bot’s inability to generate session IDs or accept cookies block them from accessing content requiring such tracking measures.

To keep the search engine bot from going places that might trip it up, see the following information about the “noindex” attribute to prevent indexing of whole pages.

We’re only getting started here on site architecture issues. There’s plenty more to come. If you have any questions, comments, or suggestions, feel free to post them in our SEM forum. See you soon…

– Rick DeJarnette, Bing Webmaster Center