All opinions expressed are those of the authors and not necessarily those of OSNews.com, our sponsors,
or our affiliates.
A couple of months ago, I integrated Omniture SiteCatalyst into an Interchange site for one of End Point's clients, CityPass. Shortly after, the client added a blog to their site, which is a standalone WordPress instance that runs separately from the Interchange ecommerce application. I was asked to add SiteCatalyst tracking to the blog.
I've had some experience with WordPress plugin development, and I thought this was a great opportunity to develop a plugin to abstract the SiteCatalyst code from the WordPress theme. I was surprised that there were limited Omniture WordPress plugins available, so I'd like to share my experiences through a brief tutorial for building a WordPress plugin to integrate Omniture SiteCatalyst.
First, I created the base wordpress file to append the code near the footer of the wordpress theme. This file must live in the ~/wp-content/plugins/ directory. I named the file omniture.php.
<?php /*
Plugin Name: SiteCatalyst for WordPress
Plugin URI: http:www.endpoint.com/
Version: 1.0
Author: Steph Powell
*/
function omniture_tag() {
}
add_action('wp_footer', 'omniture_tag');
?>
In the code above, the wp_footer is a specific WordPress hook that runs just before the </body> tag. Next, I added the base Omniture code inside the omniture_tag function:
...
function omniture_tag() {
?>
<script type="text/javascript">
<!-- var s_account = 'omniture_account_id'; -->
</script>
<script type="text/javascript" src="/path/to/s_code.js"></script>
<script type="text/javascript"><!--
s.pageName='' //page name
s.channel='' //channel
s.pageType='' //page type
s.prop1='' //traffic variable 1
s.prop2='' //traffic variable 2
s.prop3='' //traffic variable 3
s.prop4= '' //traffic variable 4
s.prop5= '' //traffic variable 5
s.campaign= '' //campaign variable
s.state= '' //user state
s.zip= '' //user zip
s.events= '' //user events
s.products= '' //user products
s.purchaseID= '' //purchase ID
s.eVar1= '' //conversion variable 1
s.eVar2= '' //conversion variable 2
s.eVar3= '' //conversion variable 3
s.eVar4= '' //conversion variable 4
s.eVar5= '' //conversion variable 5
/************* DO NOT ALTER ANYTHING BELOW THIS LINE ! **************/
var s_code=s.t();if(s_code)document.write(s_code)
--></script>
<?php
}
...
To test the footer hook, I activated the plugin in the WordPress admin. A blog refresh should yield the Omniture code (with no variables defined) near the </body> tag of the source code.
After verifying that the code was correctly appended near the footer in the source code, I determined how to track the WordPress traffic in SiteCatalyst. For our client, the traffic was to be divided into the home page, static page, articles, tag pages, category pages and archive pages. The Omniture variables pageName, channel, pageType, prop1, prop2, and prop3 were modified to track these pages. Existing WordPress functions is_home, is_page, is_single, is_category, is_tag, is_month, the_title, get_the_category, the_title, single_cat_title, single_tag_title, the_date were used.
...
<script type="text/javascript"><!--
<?php
if(is_home()) { //WordPress functionality to check if page is home page
$pageName = $channel = $pageType = $prop1 = 'Blog Home';
} elseif (is_page()) { //WordPress functionality to check if page is static page
$pageName = $channel = the_title('', '', false);
$pageType = $prop1 = 'Static Page';
} elseif (is_single()) { //WordPress functionality to check if page is article
$categories = get_the_category();
$pageName = $prop2 = the_title('', '', false);
$channel = $categories[0]->name;
$pageType = $prop1 = 'Article';
} elseif (is_category()) { //WordPress functionality to check if page is category page
$pageName = $channel = single_cat_title('', false);
$pageName = 'Category: ' . $pageName;
$pageType = $prop1 = 'Category';
} elseif (is_tag()) { //WordPress functionality to check if page is tag page
$pageName = $channel = single_tag_title('', false);
$pageType = $prop1 = 'Tag';
} elseif (is_month()) { //WordPress functionality to check if page is month page
list($month, $year) = split(' ', the_date('F Y', '', '', false));
$pageName = 'Month Archive: ' . $month . ' ' . $year;
$channel = $pageType = $prop1 = 'Month Archive';
$prop2 = $year;
$prop3 = $month;
}
echo "s.pageName = '$pageName' //page namen";
echo "s.channel = '$channel' //channeln";
echo "s.pageType = '$pageType' //page typen";
echo "s.prop1 = '$prop1' //traffic variable 1n";
echo "s.prop2 = '$prop2' //traffic variable 2n";
echo "s.prop3 = '$prop3' //traffic variable 3n";
?>
s.prop4 = '' //traffic variable 4
...
The plugin allows you to freely switch between WordPress themes without having to manage the SiteCatalyst code and to track the basic WordPress page hierarchy. Here are example outputs of the SiteCatalyst variables broken down by page type:
Homepage
s.pageName = 'Blog Home' //page name
s.channel = 'Blog Home' //channel
s.pageType = 'Blog Home' //page type
s.prop1 = 'Blog Home' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3
Tag Page
s.pageName = 'chocolate' //page name
s.channel = 'chocolate' //channel
s.pageType = 'Tag' //page type
s.prop1 = 'Tag' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3
Category Page
s.pageName = 'Category: Food' //page name
s.channel = 'Food' //channel
s.pageType = 'Category' //page type
s.prop1 = 'Category' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3
Static Page
s.pageName = 'About' //page name
s.channel = 'About' //channel
s.pageType = 'Static Page' //page type
s.prop1 = 'Static Page' //traffic variable 1
s.prop2 = '' //traffic variable 2
s.prop3 = '' //traffic variable 3
Archive
s.pageName = 'Month Archive: November 2009' //page name
s.channel = 'Month Archive' //channel
s.pageType = 'Month Archive' //page type
s.prop1 = 'Month Archive' //traffic variable 1
s.prop2 = '2009' //traffic variable 2
s.prop3 = 'November' //traffic variable 3
Article
s.pageName = 'Hello world!' //page name
s.channel = 'Test Category' //channel
s.pageType = 'Article' //page type
s.prop1 = 'Article' //traffic variable 1
s.prop2 = 'Hello world!' //traffic variable 2
s.prop3 = '' //traffic variable 3
A followup step to this plugin would be to use the wp_options table in WordPress to manage the Omniture account id, which would allow admin to set the Omniture account id through the WordPress admin without editing the plugin code. I've uploaded the plugin to a github repository here.
Learn more about End Point's analytics services.
Comments
I'm back at work after last week's PubCon Vegas. I published several articles about specific sessions, but I wanted to provide some nuggets on recurring themes of the conference.
Google Caffeine Update
This year Google rolled out some changes referred to as the Google Caffeine update. This change increases the speed and size of the index, moves Google search to real-time, and improves search results relevancy and accuracy. It was a popular topic at the conference, however, not much light was shed on how algorithm changes would affect your search results, if at all. I'll have to keep an eye on this to see if there are any significant changes in End Point's search performance.
Bing
Bing is gaining traction. They want to get [at least] 51% of the search market share.
Social media
Social media was a hot topic at the conference. An entire track was allocated to Twitter topics on the first day of the conference. However, it still pales in comparison to search. Of all referrals on the web, search still accounts for 98% and social media referrals only account for less than 1% (view referral data here). Dr. Pete from SEOMoz nicely summarized the elephant in the room at PubCon regarding social media that it's important to measure social media response to determine if it provides business value.
Ecommerce Advice
I asked Rob Snell, author of Starting a Yahoo Business for Dummies, for the most important advice for ecommerce SEO he could provide. He explained the importance of content development and link building to target keywords based on keyword conversion. Basically, SEO efforts shouldn't be wasted on keywords that don't convert well. I typically don't have access to client keyword conversion data, but this is great advice.
Internal SEO Processes
Another recurring topic I observed at PubCon was that often internal SEO processes are a much bigger obstacle than the actual SEO work. It's important to get the entire team on your side. Alex Bennert of Wall Street Journal discussed understanding your audience when presenting SEO. Here are some examples of appropriate topics for a given audience:
- IT Folks: sitemaps, duplicate content (parameter issues, pagination, sorting, crawl allocation, dev servers), canonical link elements, 301 redirects, intuitive link structure
- Biz Dev & Marketing Folks: syndication of content, evaluation of vendor products & integration, assessing SEO value and link equity of partner sites, microsites, leveraging multiple assets
- Content Developers: on page elements best practices, linking, anchor text best practices, keyword research, keyword trends, analytics
- Management: progress, timelines, roadmaps
On the topic of internal processes, I was entertained by the various comments expressing the developer-marketer relationship, for example:
- "Don't ever let a developer control your URL structure."
- "Don't ever let a developer control your site architecture."
- "This site looks like it was designed by a developer."
Apparently developers are the most obvious scapegoat. Back to the point, though: It often requires more effort to get SEO understanding and support than actually explaining what needs to be done.
Search Engine Spam
Search engine spam detection is cool. During a couple of sessions with Matt Cutts, I became interested in writing code to detect search spam. For example:
- Crawling the web to detect links where the anchor text is '.'.
- Crawling the web to identify sites where robots.txt blocks ia_archiver.
- Crawling the web to detect pages with keyword stuffing.
I've typically been involved in the technical side of SEO (duplicate content, indexation, crawlability), and haven't been involved in link building or content development, but these discussions provoked me to start looking at search spam from an engineer's perspective.
Google Parameter Masking
Apparently I missed the announcement of parameter masking in Google Webmaster Tools. I've helped battle duplicate content for several clients, and at PubCon I heard about parameter masking provided in Google Webmaster Tools. This functionality was announced in October of 2009 and allows you to provide suggestions to the crawler to ignore specific query parameters.
Parameter masking is yet another solution to managing duplicate content in addition to the rel="canonical" tag, creative uses of robots.txt, and the nofollow tag. The ideal solution for SEO would be to build a site architecture that doesn't require the use of any of these solutions. However, as developers we have all experienced how legacy code persists and sometimes a low effort-high return solution is the best short term option.
Learn more about End Point's technical SEO services.
Comments
On day 3 of PubCon Vegas, a great session I attended was Optimizing Forums For Search & Dealing with User Generated Content with Dustin Woodard, Lawrence Coburn, and Roger Dooley. User generated content is content generated by users in the form of message boards, customizable profiles, forums, reviews, wikis, blogs, article submission, question and answer, video media, or social networks.
Some good statistics were presented about why to tap into user generated content. Nielsen research recently released showed that 1 out of every 11 minutes spent online is on a social network and 2/3rds of customer "touch points" are user-generated.
Dustin provided some interesting details about long tail traffic. He looked at HitWise's data of the top 10,000 search terms for a 3 month period. The top 100 terms accounted for 5.7% of all traffic, the top 1000 terms accounted for 10.6% of all traffic, and the entire 10,000 data set accounted for just 18.5% of all traffic. With this data, representing the long tail would be analogous to a lizard with a one inch head and a tail that was 221 miles long that represents the long tail traffic.
Dustin gave the following steps for developing a user generated content community:
- Seed it with a few editors and really good initial content.
- Give them a voice.
- Make it easy to contribute.
- Make it cool or trendy.
- Provide ownership.
- Create competition with contests, ranking or by highlighting expertise.
- Build a sense of community or a sense of exclusivity.
- Give the people community a purpose.
All SEO best practices apply to a user generated content, but throughout the session, I learned several specific user generated content tips:
- Predefining keyword rich categories, topics and tags will go a long way with optimization. The better structure for topics that is created up front, the better the user generated content can content in the long run. Users are not inherently good at content organization, so content can be easily buried with poor information architecture.
- Developing automated cross-linking between user generated content helps improve authority, build clusters of content, and enrich the internal link structure. Dustin had experience with building widgets to automatically links to 5 pieces of user generated content and another widget to allow the user to select several pieces of user generated content from a set of related content.
- Examples of battling duplicate content include disallowing duplicate page titles and meta descriptions. Content that is moved, renamed or deleted should be managed well.
- Finally, building a badge or widget to display user involvement helps increase external linking to your site, but this should be carefully managed to avoid appearing spammy. Widget best practices are that the widget should have excellent accessibility, widgets should be simple with light branding and always have fresh content.
- Developing your own tiny URL helps pass and keep intact external links to your site with user generated content. Lawrence suggested to "gently tweet" user generated content that is the highest quality.
Several of End Point's clients are either in the middle of or considering building a community with user generated content. In ecommerce, blogs, forums, reviews, and Q&A are the most prevalent types of user generated content that I've encountered. Many of the things mentioned in this session were good tips to consider throughout the development of user generated content for ecommerce.
Learn more about End Point's technical SEO services.
Comments
On the second day of PubCon Vegas, I attended several SEO track sessions including "SEO for Ecommerce", "International and European Site Optimization", "Mega Site SEO", and "SEO/SEM Tools". A mini-summary of several of the sessions is presented below.
Derrick Wheeler from Microsoft.com spoke on Mega Site SEO about "taming the beast". Microsoft has 1.2 billion URLs that are comprised of thousands of web properties. For mega site SEO, Derrick highlighted:
- Content is NOT king. Structure is! Content is like the princess-in-waiting after structure has been mastered.
- Developing an overall SEO approach and organization to getting structure, content, and authority SEO completed is more valuable or relevant to the actual SEO work. This was a common theme among many of the presentations at PubCon.
- Getting metrics set up at the beginning of SEO work is a very important step to measure and justify progress.
- Don't be afraid to say no to low priority items.
Most developers deal with a large amount of legacy code. Derrick discussed primary issues when working with legacy problems:
- Duplicate and undesirable pages. For Microsoft.com, managing and dealing with 1.2 billion pages results in a lot of duplicate and undesirable pages from the past.
- Multiple redirects.
- Improper error handling (error handling on 404s or 500s).
- International URL structure can be a problem for international sites. Having an appropriate TLD (top level domain) is the best solution, but if that's not possible, a process should be implemented to regulate the international urls.
- Low Quality Page Titles and Meta Tags. For large sites with hundreds of thousands of pages, it's really important to have unique page titles and meta descriptions or to have a template that forces uniqueness.
In summary, structure and internal processes are areas to focus on for Mega Site SEO. Legacy problems are something to be aware of when you have a site so large where changes won't be implemented as quickly as small site changes.
In International and European Search Management, Michael Bonfils, Nelson James, and Andy Atkins-Krueger discussed international SEO and SEM tactics. Takeaways include:
- In terms of international search marketing, it's important to incorporate culture into search optimization and marketing. If it works in one country, it may not work in another country and so don't offend a culture by not understanding it. Some examples of content differences for targeting different cultures include emphasizing price points, focusing on product quality, and asserting authority or trust on a site.
- It's also important to understand how linguistics affects your keyword marketing. Automatic translation should not be used (all the speakers mentioned this). A good example of linguistics and search targeting is the use of the search term "soccer cleats", or "football boots". In England, the term "football boot" has a very small portion of the traffic share, but singular terms in other languages ("scarpe de calcio", "botas de futbal") have a much larger percentage of the search market share. Andy shared many other examples of how direct translation would not be the best keywords to target ("car insurance", "healthcare", "30% off", "cheap flights").
- Local hosting is important for metrics, linking, and to develop trust. Nelson James shared research that shows that 80% of the top 10 results of the top 30 keywords in china had a '.cn' top level domain, but the other top sites that were '.com' sites are all hosted in china.
- Other technical areas for international search that were mentioned are using the meta language tag, pinyin, charset, and language set. Duplicate content also will become a problem across sites of the same language.
- It's important to understand the search market share. In Russia, Google shares 35% of the search market and Yandexx has 54%. In China, Baidu has 76% and Google has 22%. There are some reasons that explain these market share differences. Yandexx was written to manage the large Russian vocabulary that Google does not handle as well. Baidu handles search for media better than Google and search traffic in China is much more entertainment driven rather than business driven in the US.
In the last session of the day, about 100 tools were discussed in SEO/SEM Tools. I'm planning on writing another blog post with a summary of these tools, but here's a short list of the tools mentioned by multiple speakers:
- SEMRush
- Google: Keyword Ad Tool, Webmaster Tools, Adplanner, SocialGraph API, Google Trends, Analytics, Google Insights
- SpyFu: Kombat, Domain Ad History, Smart Search, Keyword Ad History
- SEOBook
- SEOMoz: Linkscape, Mozbar, Top Pages, etc.
- MajesticSEO
- Raven SEO Tools: Website Analytics, Campaign Reports
Stay tuned for a day 3 and wrap up article!
Learn more about End Point's technical SEO services.
Comments
Some years ago Davor Oceli? redesigned icdevgroup.org, Interchange's home on the web. Since then, most of the attention paid to it has been on content such as news, documentation, release information, and so on. We haven't looked much at implementation or optimization details. Recently I decided to do just that.
Interchange optimizations
There is currently no separate logged-in user area of icdevgroup.org, so Interchange is primarily used here as a templating system and database interface. The automatic read/write of a server-side user session is thus unneeded overhead, as is periodic culling of the old sessions. So I turned off permanent sessions by making all visitors appear to be search bots. Adding to interchange.cfg:
RobotUA *
That would not work for most Interchange sites, which need a server-side session for storing mv_click action code, scratch variables, logged-in state, shopping cart, etc. But for a read-only content site, it works well.
By default, Interchange writes user page requests to a special tracking log as part of its UserTrack facility. It also outputs an X-Track HTTP response header with some information about the visit which can be used by a (to my knowledge) long defunct analytics package. Since we don't need either of those features, we can save a tiny bit of overhead. Adding to catalog.cfg:
UserTrack No
Very few Interchange sites have any need for UserTrack anymore, so this is commonly a safe optimization to make.
HTTP optimizations
Today I ran the excellent webpagetest.org test, and this was the icdevgroup.org test result. Even though icdevgroup.org is a fairly simple site without much bloat, two obvious areas for improvement stood out.
First, gzip/deflate compression of textual content should be enabled. That cuts down on bandwidth used and page delivery time by a significant amount, and with modern CPUs adds no appreciable extra CPU load on either the client or the server.
We're hosting icdevgroup.org on Debian GNU/Linux with Apache 2.2, which has a reasonable default configuration of mod_deflate that does this, so it's easy to enable:
a2enmod deflate
That sets up symbolic links in /etc/apache2/mods-enabled for deflate.load and deflate.conf to enable mod_deflate. (Use a2dismod to remove them if needed.)
I added two content types for CSS & JavaScript to the default in deflate.conf:
AddOutputFilterByType DEFLATE text/html text/plain text/xml text/css application/x-javascript
That used to be riskier when very old browsers such as Netscape 3 and 4 claimed to support compressed CSS & JavaScript but actually didn't. But those browsers are long gone.
The next easy optimization is to enable proxy and browser caching of static content: images, CSS, and JavaScript files. By doing this we eliminate all HTTP requests for these files; the browser won't even check with the server to see if it has the current version of these files once it has loaded them into its cache, making subsequent use of those files blazingly fast.
There is, of course, a tradeoff to this. Once the browser has the file cached, you can't make it fetch a newer version unless you change the filename. So we'll set a cache lifetime of only one hour. That's long enough to easily cover most users' browsing sessions at a site like this, but short enough that if we need to publish a new version of one of these files, it will still propagate fairly quickly.
So I added to the Apache configuration file for this virtual host:
ExpiresActive On
ExpiresByType image/gif "access plus 1 hour"
ExpiresByType image/jpeg "access plus 1 hour"
ExpiresByType image/png "access plus 1 hour"
ExpiresByType text/css "access plus 1 hour"
ExpiresByType application/x-javascript "access plus 1 hour"
FileETag None
Header unset ETag
This adds the HTTP response header "Cache-Control: max-age=3600" for those static files. I also have Apache remove the ETag header which is not needed given this caching and the Last-modified header.
There are cases where the above configuration would be too broad, for example, if you have:
- images that differ with the same filename, such as CAPTCHAs
- static files that vary based on logged-in state
- dynamically-generated CSS or JavaScript files with the same name
If the website is completely static, including the HTML, or identical for all users at the same time even though dynamically generated, we could also enable caching the HTML pages themselves. But in the case of icdevgroup.org, that would probably cause trouble with the Gitweb repository browser, live documentation searches, etc.
After those changes, we can see the results of a new webpagetest.org run and see that we reduced the bytes transferred, and the delivery time. It's especially dramatic to see how much faster subsequent page views of the Hall of Fame are, since it has many screenshot thumbnail images.
Optimizing a simple non-commerce site such as icdevgroup.org is easy and even fun. With caution and practicing on a non-production system, complex ecommerce sites can be optimized using the same techniques, with even more dramatic benefits.
Comments
I had a flash of inspiration to write an article about external links in the world of search engine optimization. I've created many SEO reports for End Point's clients with an emphasis on technical aspects of search engine optimization. However, at the end of the SEO report, I always like to point out that search engine performance is dependent on having high quality fresh and relevant content and popularity (for example, PageRank). The number of external links to a site is a large factor in popularity of a site, and so the number of external links to a site can positively influence search engine performance.
After wrapping up a report yesterday, I wondered if the external link data that I provide to our clients is meaningful to them. What is the average response when I report, "You should get high quality external links from many diverse domains"?
So, I investigated some data of well known and less well known sites to display a spectrum of external link and PageRank data. Here is the origin of some of the less well known domains referenced in the data below:
And here is the data:
I retrieved the PageRank from a generic PageRank tool. SEOMoz.org was used to collect external link counts and external linking subdomains. Finally, Yahoo Site Explorer was used to retrieve external link counts to the domain in question. I chose to examine both external link counts from SEOMoz and Yahoo Site Explorer to get a better representation of data. SEOMoz compiles their data about once a month and does not have as many urls indexed as Yahoo, which explains why their numbers may be lagging behind the Yahoo Site Explorer external link counts.
Out of curiosity, I went on to plot the Page Rank data vs. Log (base 10) of the other data.
PageRank vs Log of SEOMoz external link count
PageRank vs Log of SEOMoz external linking subdomain count
PageRank vs Log of Yahoo SiteExplorer external link count
PageRank is described as a theoretical probability value on a logarithmic scale and it's based on inbound links, PageRank of inbound links, and other factors such as Google visit data, search click-through rates, etc. The true popularity rank is a rank between 1 and X, where X is equal to the total number of webpages crawled by search engine A. After pages are individually ranked between 1 and X, they are scaled logarithmically between 0 and 10.
The takeaway from this data is when an "SEO report" gives advice to "get more external links", it means:
- If your site has a PageRank of < 4, getting external links on the scale of hundreds may impact your existing PageRank or popularity
- If your site has a PageRank of >= 4 and < 6, getting external links on the scale of thousands may impact your existing PageRank or popularity
- If your site has a PageRank of >= 6 and < 8, getting external links on the scale of tens to hundreds of thousands may impact your existing PageRank or popularity
- If your site has a PageRank of >= 8, you probably are already doing something right...
Furthermore, even if a site improves external link counts, other factors will play into the PageRank algorithm. Additionally, keyword relevance and popularity play key roles in search engine results.
Comments
I was recently tasked with implementing site search using a commercially available site search application for one of our clients (
Gear.com). The basic implementation requires that a SOAP request be made and the XML data returned be parsed for display. The SOAP request contains basic search information, and additional information such as product pagination and sort by parameters. During the implementation in a Rails application, I applied a few unique solutions worthy of a blog article :)
The first requirement I tackled was to design the web application in a way that produced search engine friendly canonical URLs. I used Rails routing to implement a basic search:
map.connect ':id', :controller => 'basic', :action => 'search'
Any simple search path would be sent to the basic search query that performed the SOAP request followed by XML data parsing. For example,
http://www.gear.com/s/climb is a search for "
climb" and
http://www.gear.com/s/bike for "
bike".
After the initial search, a user can refine the search by brand, merchant, category or price, or choose to sort the items, select a different page, or modify the number of items per page. I chose to force the order of refinement, for example, brand and merchant order were constrained with the following Rails routes:
map.connect ':id/brand/:rbrand', :controller => 'basic', :action => 'search'
map.connect ':id/merch/:rmerch', :controller => 'basic', :action => 'search'
map.connect ':id/brand/:rbrand/merch/:rmerch', :controller => 'basic', :action => 'search'
Rather than allow different order of refinement parameters in the URLs, such as
http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec and http://www.gear.com/s/climb/merch/Altrec/brand/Arcteryx, the order of search refinement is always limited to the Rails routes specified above and the former URL would be allowed in this example.
For example, http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec is a valid URL for
Arcteryx Altrec climb, http://www.gear.com/s/climb/brand/Arcteryx for
Arcteryx climb, and http://www.gear.com/s/climb/merch/Altrec for
Altrec climb.
All URLs on any given search result page are built with a single Ruby method to force the refinement and parameter order. The method input requires the existing refinement values, the new refinement key, and the new refinement value. The method builds a URL with all previously existing refinement values and adds the new refinement value. Rather than generating millions of URLs with the various refinement combinations of brand, merchant, category, price, items per page, pagination number, and sort method, this logic minimizes duplicate content. The use of Rails routes and the chosen URL structure also creates search engine friendly URLs that can be targeted for traffic. Below is example pseudocode with the URL-building method:
def build_url(parameters, new_key, new_value)
# set url to basic search information
# append brand info to url if parameters[:brand] exists or if new_key is brand
# append merchant info to url if parameters[:merchant] exists or if new_key is merchant
# append category info to url if parameters[:cat] exists or if new_key is cat
# ...
end
The next requirement I encountered was breadcrumb functionality. Breadcrumbs are an important usability feature that provide the ability to navigate backwards in search and refinement history. Because of the canonical URL solution described above, the URL could not be used to indicate the search refinement history. For example, http://www.gear.com/s/climb/brand/Arcteryx/merch/Altrec does not indicate whether the user had refined by brand then merchant, or by merchant then brand. I investigated a few solutions having implemented similar breadcrumb functionality for other End Point clients, including appending the '#' (hash or relative url) to the end of the URL with details of the user refinement path, using JavaScript to set a cookie containing the user refinement path whenever a link was clicked, and using a session variable to track the user refinement path. In the end, I found it easiest to use a single session variable to track the user refinement path. The session variable contained all information needed to display the breadcrumb with a bit of parsing.
For example, for the URL mentioned above, the session variable of 'brand-Arcteryx:merch-Altrec' would yield the breadcrumb:
"Your search: climb > Arcteryx > Altrec"
And the session variable 'merch-Altrec:brand-Arcteryx' would yield the breadcrumb:
"Your search: climb > Altrec > Arcteryx". I could have used more than one session variable, but this solution worked out to be simple and comprised less than 10 lines of code.
Another interesting necessity was determining the best way to parse the XML data. I researched several XML parsers including XmlSimple, Hpricot, ReXML, and libxml. About a year ago, John Nunemaker reported on some benchmark testing of several of these packages (
Parsing XML with Ruby). After some investigative work, I chose Hpricot because it was very easy to implement complex selectors that reminded me of jQuery selectors (which are also easy to use). The interesting thing that I noticed throughout the implementation was that the refinement parsing took much more time than the actual product parsing and formatting. For Gear.com, the number of products returned ranges from 20-60 and products were quickly parsed. The number of refinements returned ranged from very small for a distinct search
Moccasym (4 refinement options) to a general search
jacket (50+ refinement options). If performance is an issue in the future, I can further investigate the use of libxml-ruby or other Ruby XML parsing tools that may improve the performance.
A final point of interest was the decision to tie the Rails application to the same database that drives the product pages (which was easily done). This decision was made to allow access of frontend taxonomy information for the product categorization. For example, if a user chooses to refine a specific by a category (
jacket in Kids Clothing), the Rails app can retrieve all the taxonomy information for that category such as the display name, the number of products in that category, subcategories, and subsubcategories. This may be important information required for additional features, such as providing the ability to view the subcategories in this category or view other products in this category that aren't shown in the search results.
I was happy to see the success of this project after working through the deliverables. Future work includes integration of additional search features common to many site search packages, such as implementing refinement by color and size, or retrieving recommended products or best sellers.
Comments
Last week the SEO world reacted to
Matt Cutts' article about the use of nofollow in PageRank sculpting.
Google uses the
PageRank algorithm to calculate popularity of pages in the web. Popularity is only one factor in determining which pages are returned in search results (relevance to search terms is the other major factor). Other major search engines use similar popularity algorithms. Without describing the algorithm in detail, the important takeaways are:
a) PageRank of a single page is influenced by all inbound (external links) links
b) PageRank of a single page is passed on to all outgoing links after being normalized and divided by the total number of outgoing links
So, given page C with an inbound links from page A and B, where page A and B have equal page rank X, page A has 3 total external links and B has 5 total external links, page C receives more PageRank from page A than page B.

From an external link perspective, it's great to get as many links as possible from a variety of sources that rank high and have a low number of external links. From an internal site perspective, it's important to examine how PageRank is passed throughout a site to apply the best site architecture. In addition to designing a site architecture that pleases users and passes link juice throughout a site effectively, the rel="nofollow" tag was adopted by several major search engines and was used as an additional tool to stop the flow of link juice from one page to another. The nofollow tag can also be used to identify paid links (early implementation) or to avoid passing links to external sites completely.
In the example above, rel="nofollow" could be added to 2 links on page B which would result in the same PageRank passed from page B to page C as from page A to page C.

Then, at a recent SEO conference, Matt Cutts (head of the Google spam team) made a comment about how the PageRank algorithm changed its use of nofollow and just last week, it was announced that the PageRank algorithm would no longer use the nofollow attribute in PageRank sculpting. Any link with the nofollow attribute will no longer reduce the count of outgoing page links to improve link juice passed on to other pages, but link juice will still not be passed from one link to another with the nofollow attribute.
In the ongoing example, the link juice passed from page B to page C will be less than from page A to C because it has more outgoing links, even if they are nofollow links.

One
SEOMoz article I read suggests that SEO best practices will now be to recommend blog owners to disallow comments that may contain external links to prevent the dilution of link juice. Other potential solutions would be to filter out links from user generated content (comments or qna specifically), use iframes to display any user generated content, or embed flash or java with external links. The nofollow attribute may be used to stop the flow of link juice to external pages, however, it may no longer be used for internal PageRank sculpting.
Comments

I sent Eugenia and Thom a list of suggestions for improving your standing in search engines. It's really just a list of lessons I learned as I researched the topic for my company. I culled the info from websites, a search engine optimization partner, and various writings on the subject. The thing is, as I wrote in an earlier post, it's really worked for me. I've obviously done something right, because firsttube.com is returning much higher in all search engines in the last few months. Actually updating once in a while is probably a large part of it...
So I decided to implement many of the same things on OSNews. I have not only created "
pretty URLs," which are really just URLs that obey PATH_INFO rather than standard GET variables, but I've added lots of links, included additional internal links, cleaned up the page titles, and added links to submission forms of popular social networking/social bookmarking sites in attempt to earn some additional trust in pagerank.
I'm hoping to see some changes within the next 60 days. So, let's check back at Thanksgiving time and see what happens.
Tags:
SEO,
OSNews,
Code,
Development
Comments

Some time ago, I decided to learn a little bit about
Google Pagerank. I wanted to improve my site's rankings in Google's search results. I learned about
Google-dancing and many of the sports that involve optimizing your page in search results. So I took the advice and redid a lot of my page to work with what I knew. Those of you that actually follow my blog have seen changes: tags, topics, changes to my XML feeds, new prettier URLs, etc. I keep track of all referers when someone hits my blog. I see
Yahoo! Slurp and the
Googlebot crawl me everyday. But then I started noticing something. I started seeing really simple Google searches refering to me. As of this writing, by typing:
picasaweb iphoto
into Google, firsttube.com returns as the FIRST result. In fact, if you just search for "picasaweb," I'm the fifth result, ranked only behind Google itself, ZDNet, and Miguel de Icaza. I have a ton of traffic coming to my Picasaweb vs. Flickr article, and lots of traffic goes to my "I want Picasa on Mac" article, even though it's mostly worthless other than me pining away.
Anyway, I definitely want to make the same types of changes on OSNews, because clearly we could always use a boost in search engine results. As a direct result of the boost firsttube.com has seen, I think I'll be adding friendly URLs to OSNews in the next few days.
Tags:
SEO,
Google,
Yahoo,
Pagerank,
Slurp
Comments