Build Backlinks Online has posted a new item, 'SEO Finds in Your Server Logs,
Part 2: Optimizing for Googlebot'
Posted by timresnik
This is a follow-up to a post I wrote a few months ago that goes over some of
the basics of why server log files are a critical part of your technical SEO
toolkit. In this post, I provide more detail around formatting the data in Excel
in order to find and analyze Googlebot crawl optimization opportunities.
Before digging into the logs, itâs important to understand the basics of
how Googlebot crawls your site. There are three basic factors that Googlebot
considers. First is which pages should be crawled. This is determined by factors
such as the number of backlinks that point to a page, the internal link
structure of the site, the number and strength of the internal links that point
to that page, and other internal signals like sitemaps.
Next, Googlebot determines how many pages to crawl. This is commonly referred
to as the "crawl budget." Factors that are most likely considered when
allocating crawl budget are domain authority and trust, performance, load time,
and clean crawl paths (Googlebot getting stuck in your endless faceted search
loop costs them money). For much more detail on crawl budget, check out Ian
Lurieâs post on the subject.
Finally, the rate of the crawl â how frequently Googlebot comes back
â is determined by how often the site is updated, the domain authority,
and the freshness of citations, social mentions, and links.
Now, let's take a look at how Googlebot is crawling Moz.com (NOTE: the data I
am analyzing is from SEOmoz.org prior to our site migration to Moz.com. Several
of the potential issues that I point out below are now solved. Wahoo!). The
first step is getting the log data into a workable format. I explained in detail
how to do this in my last server log post. However, this time make sure to
include the parameters with the URLs so we can analyze funky crawl paths. Just
make sure the box below is unchecked when importing your log file.
The first thing that we want to look at is where on the site Googlebot is
spending its time and dedicating the most resources. Now that you have exported
your log file to a .csv file, youâll need to do a bit of formatting and
cleaning of the data.
1. Save the file with an Excel extension, for example .xlsx
2. Remove all the columns except for Page/File, Response Code and User Agent,
it should look something like this (formatted as a table which can be done by
selecting your data and ^L):
3. Isolate Googlebot from other spiders by creating a new column and writing a
formula that searches for âGooglebotâ in the cells in the 3rd
column.
4. Scrub the Page/File column for the top-level directory so we can later run
a pivot table and see which sections Google is crawling the most
5. Since we left the parameter on the URL in order to check crawl paths,
weâll want to remove it here so that data is included in the top level
directory analysis that we do in the pivot table. The URL parameter always
starts with "?," so that is what we want to search for in Excel. This is a
little tricky because Excel uses the question mark character as a wildcard. In
order to indicate to Excel that the question mark is literal, use a preceding
tilde, like this: "~?"
6. The data can now be analyzed in a pivot table (data > pivot table). The
number associated with the directory is the total number of times Googlebot
requested a file in the timeframe of the log, in this case a day.
Is Google allocating crawl budget properly? We can dive deeper into several
different pieces of data here:
Over 70% of Google's crawl budget focuses on three sections, while over 50% goes
towards /qa/ and /users/. Moz should look at search referral data from Google
Analytics to see how much organic search value these sections provide. If it is
disproportionately low, crawl management tactics or on-page optimization
improvements should be considered.
Another potential insight from this data is that /page-strength/, a URL used
for posting data for a Moz tool, is being crawled nearly 1,000 times. These
crawls are most likely triggered from external links pointing to the results of
the Moz tool. The recommendation would be to exclude this directory using
robots.txt.
On the other end of the spectrum, it is important to understand the directories
that are rarely being crawled. Are there sections being under-crawled?
Letâs look at a few of Mozâs:
In this example, the directory /webinars pops out as not getting enough Google
attention. In fact, only the top directory is being crawled, while the actual
Webinar content pages are being skipped.
These are just a few examples of crawl resource issues that can be found in
server logs. A few additional issues to look for include:
Are spiders crawling pages that are excluded by robots.txt?
Are spider crawling pages that should be excluded by robots.txt?
Are certain sections consuming too much bandwidth? What is the ratio of the
number of pages crawled in a section to the amount of bandwidth required?
As a bonus, I have done a screencast of the above process for formatting and
analyzing the Googlebot crawl.
In my next post on analyzing log files, I will explain in more detail how to
identify duplicate content and look for trends over time. Feel free to share
your thoughts and questions in the comments below!
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten
hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think
of it as your exclusive digest of stuff you don't have time to hunt down but
want to read!
You may view the latest post at
http://feedproxy.google.com/~r/seomoz/~3/noIYBEP9OCw/seo-log-file-analysis-part-2
You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Build Backlinks Online
peter.clarke@designed-for-success.com
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment