Thursday 5 September 2013

[Build Backlinks Online] Comparing Rank-Tracking Methods: Browser vs. Crawler vs. Webmaster Tools

Build Backlinks Online has posted a new item, 'Comparing Rank-Tracking Methods:
Browser vs. Crawler vs. Webmaster Tools'


Posted by Dr-Pete

Deep down, we all have the uncomfortable feeling that rank-tracking is
unreliable at best, and possibly outright misleading. Then, we walk into our
bossâs office, pick up the phone, or open our email, and hear the same
question: âWhy aren't we #1 yet?!â Like it or not, rank-tracking is
still a fact of life for most SEOs, and ranking will be a useful signal and
diagnostic for when things go very wrong (or very right) for the foreseeable
future.


Unfortunately, there are many ways to run a search, and once you factor in
localization, personalization, data centers, data removal (such as [not
provided]), and transparency (or the lack thereof), itâs hard to know how
any keyword really ranks. This post is an attempt to compare four common
rank-tracking methods:


Browser â Personalized
Browser â Incognito
Crawler
Google Webmaster Tools (GWT)

Iâm going to do my best to keep this information unbiased and even
academic in tone. Moz builds rank-tracking tools based in part on crawled data,
so it would be a lie to say that we have no skin in the game. On the other hand,
our main goal is to find and present the most reliable data for our customers. I
will do my best to present the details of our methodology and data, and let you
decide for yourselves.
Methodology

We started by collecting a set of 500 queries from Moz.comâs Google
Webmaster Tools (GWT) data for the month of July 2013. We took the top 500
queries for that time period by impression count, which provided a decent range
of rankings and click-through rates. We used GWT data because itâs the
most constrained rank-tracking method on our list â in other words, we
needed keywords that were likely to pop up on GWT when we did our final data
collection.


On August 7th, we tracked these 500 queries using four methods:


(1) Browser â Personalized


This is the old-fashioned approach. I personally entered the queries on
Google.com via the Chrome browser (v29) and logged into my own account.


(2) Browser â Incognito


Again, using Google.com on Chrome, I ran the queries manually. This time,
though, I was fully logged out and used Chromeâs incognito mode. While
this method isnât perfect, it seems to remove many forms of
personalization.

(3) Crawler

We modified part of the MozCast engine to crawl each of the 500 queries and
parse the results. Crawls occurred across a range of IP addresses (and
C-blocks), selected randomly. The crawler did not emulate cookies or any kind of
login, and we added the personalization parameter (â&pws=0â) to
remove other forms of personalization. The crawler also used the
â&near=usâ option to remove some forms of localization. We crawled
up to five pages of Google results, which produced data for all but 12 of the
500 queries (since these were queries for which we knew Moz.com had recently
ranked).

(4) Google Webmaster Tools

After Google made data available for August 7th, we exported average position
data from GWT (via âSearch Trafficâ > âSearch
Queriesâ) for that day, filtering to just âWebâ and
âUnited Statesâ, since those were the parameters of the other
methods. While the other methods represent a single data point, GWT âAvg.
positionâ theoretically represents multiple data points. Unfortunately,
there is very little transparency about precisely how this data is measured.


Once the GWT data was exported and compared to the full list, there were 206
queries left with data from all four rank-tracking methods. All but a handful of
the dropped keywords were due to missing data in GWTâs one-day report. Our
analyses were conducted on this set of 206 queries with full data.

Results: Correlations

To compare the four ranking methods, we started with the pair-wise Spearman
rank-order correlations (hat tip to my colleague, Dr. Matt Peters, for his
assistance on this and the following analysis). All correlations were
significant at the p<0.01* level, and r-values are shown in the table below:




*Given that the ranking methods are analogous to a repeated analysis of the
same data set, we applied the Bonferroni correction to all p-values.


Interestingly, almost all of the methods showed very strong agreement, with
Personalized vs. Incognito showing the most agreement (not surprisingly, as both
are browser-based). Hereâs a scatterplot of that data, plotted on log-log
axes (done only for visualizationâs sake, since the rankings were grouped
pretty tightly at the upper spots):




Crawler vs. GWT had the lowest correlation, but itâs important to note
that none of these differences were large enough to make a strong distinction
between them. Hereâs the scatterplot of that correlation, which is still
very high/positive by most reasonable standards:




Since the GWT âAverageâ data is precise to one decimal point,
thereâs more variation in the Y-values, but the linear relationship
remains very clear. Many of the keywords in this data set had #1 rankings in
GWT, which certainly helped boost the correlations, but the differences in the
methods appear to be surprisingly low.


If you're new to correlation and r-values, check out my quick refresher: the
correlation "mathographic". The statement "p<0.01" means that there is less
than a 1% probability that these r-values were the result of random chance. In
other words, we can be 99% sure that there was some correlation in play (and it
wasn't zero). This doesn't tell us how meaningful the correlation is. In this
particular case, we're just comparing sets of data to see how similar they are
â we're not making any statements about causation.

Results: Agreement

One problem with the pair-wise correlations is that we can only compare any
one method to another. In addition, thereâs a certain amount of dependence
between the methods, so itâs hard to determine what a âstrongâ
correlation is. During a smaller, pilot study, we decided that what weâre
really interested in is how any given method compares to the totality of the
other three methods. In other words, which method agrees or disagrees the most
with the rest of the methods?


With the help of Dr. Peters, I created a metric of agreement (or, more
accurately, disagreement). Iâll save the full details for Appendix A at
the end of this article, but hereâs a short version. Letâs say that
the four methods return the following rankings (keeping in mind that GWT is an
average):


2
1
1
2.8

Our disagreement metric produces the following values for each of the methods:

2.89
2.34
2.34
3.58

Since the two #1 rankings show the most agreement, methods (2) and (3) have the
same score, with method (1) showing more disagreement and (4) showing the most
disagreement. The greater the distance between the rankings, the higher the
disagreement score, but any rankings that match will have the same score for any
given keyword.

This yielded a disagreement score for each of the four methods for each of the
206 queries. We then took the mean disagreement score for each method, and got
the following results:


Personal = 1.12

Incognito = 0.82

Crawler = 0.98

GWT = 1.26




GWT showed the highest average disagreement from the other methods, with
incognito rankings coming in on the low end. On the surface, this suggests that,
across the entire set of methods, GWT disagreed with the other three methods the
most often.


Given that weâve invented this disagreement metric, though, itâs
important to ask if this difference is statistically significant. This data
proved not to be normally distributed (a chunk of disagreement=0 data points
skewed it to one side), so we decided our best bet for comparison was the
non-parametric Mann-Whitney U Test.


Comparing the disagreement data for each pair of methods, the only difference
that approached statistical significance was Incognito vs. GWT (p=0.022). Since
I generally try to keep the bar high (p<0.01), I have to play by my own rules
and say that the disagreement scores were too close to call. Our data cannot
reliably tell the levels of disagreement apart at this point.

Results: Outliers

Even if the statistics told us that one method clearly disagreed more than the
other methods, it still wouldnât answer one very important question
â which method is right? Is it possible, for example, that Google
Webmaster Tools could disagree with all of the other methods, and still be the
correct one? Yes, itâs within the realm of possibility.


No statistic will tell us which method is correct if we fundamentally distrust
all of the methods (and I do, at least to a point), so our next best bet is to
dig into some of the specific cases of disagreement and try to sort out
whatâs happening. Letâs look at a few cases of large-scale
disagreement, trying not to bias toward any particular method.

Case 1 â Personalization Boost

Many of the cases where personalization disagreed are what youâd expect
â Moz.com was boosted in my personalized results. For example, a search
for âseo checklistâ had Moz.com at #3 in my logged-in results, but
#7 for both incognito and crawled, and an average of 6.7 for GWT (which is
consistent with the #7 ballpark). Even by just clicking personalization off,
Moz.com dropped to #4, and in a logged out browser a few days after the original
data collection, it was at #5.


Whatâs fascinating to me is that personalization didnât disagree
even more often. Consider that all of these queries were searches that generated
traffic for Moz.com and Iâm on the site every day and very active in the
SEO community. If personalization has the impact we seem to believe it has, I
would theorize that personalized searches would disagree the most with other
methods. Itâs interesting that that wasnât the case. While
personalization can have a huge impact on some queries, the number of searches
it affects still seems to be limited.

Case 2 â Personalization Penalty

In some cases, personalization actually produced lower rankings. For example,
a search for âwhat is an analystâ showed Moz.com at the #12 position
for both personalized and incognito searches. Meanwhile, crawled rankings put us
at #3, and GWTâs average ranking was #5. Checking back (semi-manually), I
now see us at #10 on personalized search and up to #2 for crawled rankings.


Why would this happen? Both searches (personalized vs. crawled) show a
definition box for âanalystâ at the top, which could indicate some
kind of re-ranking in play, but the top 10 after that box differ by quite a bit.
One would naturally assume that Moz.com would get a boost in any of my
personalized searches, but thatâs simply not the case. The situation is
much more complex and real-time than we generally believe.

Case 3 â GWT (Ok, Google) Hates Us

Hereâs one where GWT seems to be out of whack. In our one-day data
collection, a search for âseoâ showed Moz at #3 for personalized
rankings and #4 for incognito and crawled. Meanwhile, GWT had us down in the #6
spot. Itâs not a massive difference, but for such an important head
keyword, it definitely could lead to some soul-searching.


As of this writing, I was showing Moz.com in the #4 spot, so I called in some
help via social media. I asked people to do a logged-in (personalized) search
for âseoâ and report back where they found Moz.com. I removed data
from non-US participants, which left 63 rankings (36 from Twitter, and 27 from
Facebook). The reported rankings ranged from #3 to #8, with an average of 4.11.
These rankings were reported from across the US, and only two participants
reported rankings at #6 or below. Hereâs the breakdown of the raw data:




You can see the clear bias toward the #4 position across the social data. You
could argue that, since many of my friends are SEOs, we all have similarly
biased rankings, but this quickly leads to speculation. Saying that GWT numbers
donât match because of personalization is a bit like saying that the
universe must be made of dark matter just because the numbers donât add up
without it. In the end, that may be true, but we still need the evidence.

Face Validity

Ultimately, this is my concern â when GWTâs numbers disagree,
weâre left with an argument that basically boils down to âJust trust
us.â This is difficult for many SEOs, given what feels like a concerted
effort by Google to remove critical data from our view. On the one hand, we know
that personalization, localization, etc. can skew our individual viewpoints (and
that browser-based rankings are unreliable). On the other hand, if 56 out of 63
people (89%) all see my site at #3 or #4 for a critical head term and Google
says the âaverageâ is #6, thatâs a hard pill to swallow with
no transparency around where Googleâs number is coming from.


In measurement, we call this âface validityâ. If something
doesnât look right on the surface, we generally want more proof to sort
out why, and thatâs usually a reasonable instinct. Ultimately,
Googleâs numbers may be correct â itâs hard to prove
theyâre not. The problem is that we know almost nothing about how
theyâre measured. How does Google count local and vertical results, for
example? What/who are they averaging? Is this a sample, and if so, how big of a
sample and how representative? Is data from [not provided] keywords included in
the mix?


Without these answers, we tend to trust what we can see, and while we may be
wrong, itâs hard to argue that we shouldnât. Whatâs more,
itâs nearly impossible to convince our clients and bosses to trust a
number they canât see, right or wrong.

Conclusions

The âgoodâ news, if weâre being optimistic, is that the four
methods we considered in this study (Personalized, Incognito, Crawler, and GWT)
really didnât differ that much from each other. They all have their
potential faults, but in most cases theyâll give you an answer
thatâs in the ballpark of reality. If you focus on relative change over
time and not absolute numbers, then all four methods have some value, as long as
youâre consistent.


Over time, this situation may change. Even now, none of these methods measure
anything beyond core organic ranking. They donât incorporate local
results, they donât indicate if there are prominent SERP features (like
Answer Boxes or Knowledge Graph entries), they donât tell us anything
about click-through or traffic, and they all suffer from the little white lie of
assumed linearity. In other words, we draw #1 - #10, etc. on a straight line,
even though we know that click-through and impact drop dramatically after the
first couple of ranking positions.


In the end, we need to broaden our view of rankings and visibility, regardless
of which measurement method we use, and we need to keep our eyes open. In the
meantime, the method itself probably isnât critically important for most
keywords, as long as weâre consistent and transparent about the
limitations. When in doubt, consider getting data from multiple sources, and
donât put too much faith in any one number.

Appendix A: Measuring Disagreement

During a pilot study, we realized that, in addition to pair-wise comparisons
of any two methods, what we really wanted to know was how any one method
compared to the rest of the methods. In other words, which methods agreed (or
disagreed) the most with the set of methods as a whole? We invented a fairly
simple metric based on the sum of the differences between each of the methods.
Let's take the example from the post â here, the four methods returned the
following rankings (for Keyword X):


2
1
1
2.8

We wanted to reward methods (2) and (3) for being the most similar (it doesn't
matter that they showed Keyword X in the #1 position, just that they agreed),
and slightly penalize (1) and (4) for mismatching. After testing a few options,
we settled (I say "we", but I take full blame for this particular nonsense) on
calculating the sum of the square roots of the absolute differences between each
method and the other three methods.

That sounds a lot more complicated than it actually is. Let's calculate the
disagreement score for method 1, which we'll call "M1" (likewise, we'll call the
other methods M2, M3, and M4). I call it a "disagreement" score because larger
values ended up representing lower agreement. For M1 for Keyword X, the
disagreement score is calculated by:

sqrt(abs(M1-M2)) + sqrt(abs(M1-M3)) + sqrt(abs(M1-M4))

The absolute value is used because we don't care about the direction of the
difference, and the square root is essentially a dampening function. I didn't
want outliers to be overemphasized, or one bad data point for one method could
potentially skew the results. For Method 1 (M1), then, the disagreement value
is:

sqrt(abs(2-1)) + sqrt(abs(2-1)) + sqrt(abs(2-2.8))

...which works out to 2.89. Here are the values for all four methods:


2.89
2.34
2.34
3.58

Let's look at a couple of more examples, just so that you don't have to take my
word for how this works. In this second case, two methods still agree, but the
ranking positions are "lower" (which equates to larger numbers), as follows:

12
12
3
5

The disagreement metric yields the following values:

5.65
5.65
7.41
6.71

M1 and M2 are in agreement, so they have the same disagreement value, but all
four values are elevated a bit to show that the overall distance across the four
methods is fairly large. Finally, here's an example where two methods each agree
with one other method:

2
2
5
5

In this case, all four methods have the same disagreement score:

3.46
3.46
3.46
3.46

Again, we don't care very much that two methods ranked Keyword X at #2 and two
at #5 â we only care that each method agreed with one other method. So, in
this case, all four methods are equally in agreement, when you consider the
entire set of rank-tracking methods. If the difference between the two pairs of
methods was larger, the disagreement score would increase, but all four methods
would still share that score.

Finally, for each method, we took the mean disagreement score across the 206
keywords with full ranking data. This yielded a disagreement measurement for
each method. Again, these measurements turned out not to differ by a
statistically significant margin, but I've presented the details here for
transparency and, hopefully, for refinement and replication by other people down
the road.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten
hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think
of it as your exclusive digest of stuff you don't have time to hunt down but
want to read!






You may view the latest post at
http://feedproxy.google.com/~r/seomoz/~3/bxFqJK2jjeY/comparing-ranktracking-methods-browser-vs-crawler-vs-webmaster-tools

You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Build Backlinks Online
peter.clarke@designed-for-success.com

No comments:

Post a Comment