Tuesday, 22 October 2013

[Build Backlinks Online] A [Poorly] Illustrated Guide to Google's Algorithm

Build Backlinks Online has posted a new item, 'A [Poorly] Illustrated Guide to
Google's Algorithm'

Posted by Dr-Pete
Like all great literature, this post started as a bad joke on Twitter on a
Friday night:




If you know me, then this kind of behavior hardly surprises you (and I
probably owe you an apology or two). What's surprising is that Google's Matt
Cutts replied, and fairly seriously:




Matt's concern that even my painfully stupid joke could be misinterpreted
demonstrates just how confused many people are about the algorithm. This tweet
actually led to a handful of very productive conversations, including one with
Danny Sullivan about the nature of Google's "Hummingbird" update.


These conversations got me thinking about how much we oversimplify what "the
algorithm" really is. This post is a journey in pictures, from the most basic
conception of the algorithm to something that I hope reflects the major concepts
Google is built on as we head into 2014.

The Google algorithm

There's really no such thing as "the" algorithm, but that's how we think about
itâas some kind of monolithic block of code that Google occasionally
tweaks. In our collective SEO consciousness, it looks something like this:




So, naturally, when Google announces an "update", all we see are shades of
blue. We hear about a major algorithm update ever month or two, and yet Google
confirmed 665 updates (technically, they used the word "launches") in
2012âobviously, there's something more going on here than just changing a
few lines of code in some mega-program.

Inputs and outputs

Of course, the algorithm has to do something, so we need inputs and outputs.
In the case of search, the most fundamental input is Google's index of the
worldwide web, and the output is search engine result pages (SERPs):




Simple enough, right? Web pages go in, [something happens], search results
come out. Well, maybe it's not quite that simple. Obviously, the algorithm
itself is incredibly complicated (and we'll get to that in a minute), but even
the inputs aren't as straightforward as you might imagine.


First of all, the index is really roughly a dozen data centers distributed
across the world, and each data center is a miniature city unto itself, linked
by one of the most impressive global fiber optic networks ever built. So, let's
at least add some color and say it looks something more like this:




Each block in that index illustration is a cloud of thousands of machines and
an incredible array of hardware, software and people, but if we dive deep into
that, this post will never end. It's important to realize, though, that the
index isn't the only major input into the algorithm. To oversimplify, the system
probably looks more like this:




The link graph, local and maps data, the social graph (predominantly Google+)
and the Knowledge Graphâessentially, a collection of entity
databasesâall comprise major inputs that exist beyond Google's core index
of the worldwide web. Again, this is just a conceptualization (I don't claim to
know how each of these are actually structured as physical data), but each of
these inputs are unique and important pieces of the search puzzle.


For the purposes of this post, I'm going to leave out personalization, which
has its own inputs (like your search history and location). Personalization is
undoubtedly important, but it impacts many areas of this illustration and is
more of a layer than a single piece of the puzzle.

Relevance, ranking and re-ranking

As SEOs, we're mostly concerned (i.e. obsessed) with ranking, but we forget
that ranking is really only part of the algorithm's job. I think it's useful to
split the process into two steps: (1) relevance, and (2) ranking. For a page to
rank in Google, it first has to make the cut and be included in the list. Let's
draw it something like this:




In other words, first Google has to pick which pages match the search, and
then they pick which order those pages are displayed in. Step (1) relies on
relevanceâa page can have all the links, +1s, and citations in the world,
but if it's not a match to the query, it's not going to rank. The Wikipedia page
for Millard Fillmore is never going to rank for "best iPhone cases," no matter
how much authority Wikipedia has. Once Wikipedia clears the relevance bar,
though, that authority kicks in and the page will often rank well.


Interestingly, this is one reason that our large-scale correlation studies
show fairly low correlations for on-page factors. Our correlation studies only
measure how well a page ranks once it's passed the relevance threshold. In 2013,
it's likely that on-page factors are still necessary for relevance, but they're
not sufficient for top rankings. In other words, your page has to clearly be
about a topic to show up in results, but just being about that topic doesn't
mean that it's going to rank well.


Even ranking isn't a single process. I'm going to try to cover an incredibly
complicated topic in just a few sentences, a topic that I'll call "re-ranking."
Essentially, Google determines a core ranking and what we might call a "pure"
organic result. Then, secondary ranking algorithms kick inâthese include
local results, social results, and vertical results (like news and images).
These secondary algorithms rewrite or re-rank the original results:




To see this in action, check out my post on how Google counts local results.
Using the methodology in that post, you can clearly see how Google determines a
base set of rankings, and then the local algorithm kicks in and not only adds
new features but re-ranks the original results. This diagram is only the tip of
the icebergâBill Slawski has an excellent three-part series on re-ranking
that covers 40 different ways Google may re-rank results.

Special inputs: penalties and disavowals

There are also special inputs (for lack of a better term). For example, if
Google issues a manual penalty against a site, that has to be flagged somewhere
and fed into the system. This may be part of the index, but since this process
is managed manually and tied to Google Webmaster Tools, I think it's useful to
view it as a separate concept.


Likewise, Google's disavow tool is a separate input, in this case one
partially controlled by webmasters. This data must be periodically processed and
then fed back into the algorithm and/or link graph. Presumably, there's a
semi-automated editorial process involved to verify and clean this
user-submitted data. So, that gives us something like this:




Of course, there are many inputs that feed other parts of the system. For
example, XML sitemaps in Google Webmaster Tools help shape the index. My goal it
to give you a flavor for the major concepts. As you can see, even the "simple"
version is quickly getting complicated.

Updates: Panda, Penguin and Hummingbird

Finally, we have the algorithm updates we all know and love. In many cases, an
update really is just a change or addition to some small part of Google's code.
In the past couple of years, though, algorithm updates have gotten a bit more
tricky.


Let's start with Panda, originally launched in February of 2011. The Panda
update was more than just a tweak to the codeâit was (and probably still
is) a sub-algorithm with its own data structures, living outside of the core
algorithm (conceptually speaking). Every month or so, the Panda algorithm would
be re-run, Panda data would be updated, and that data would feed what you might
call a Panda ranking factor back into the core algorithm. It's likely that
Penguin operates similarly, in that it's a sub-algorithm and separate data set.
We'll put them outside of the big, blue oval:




I don't mean to imply that Panda and Penguin are the sameâthey operate
in very different ways. I'm simply suggesting that both of these algorithm
updates rely on their own code and data sources and are only periodically fed
back into the system.


Why didn't Google just re-write the algorithm to account for the Panda and/or
Penguin intent? Part of it is computationalâthe resources required to
process this data are beyond what the real-time infrastructure can probably
handle. As Google gets faster and more powerful, these sub-algorithms may become
fully integrated (and Panda is probably more integrated than it once was). The
other reason may involve testing and mitigating impact. It's likely that Google
only updates Penguin periodically because of the large impact that the first
Penguin update had. This may not be a process that they simply want to let loose
in real-time.


So, what about the recent Hummingbird update? There's still a lot we don't
know, but Google has made it pretty clear that Hummingbird is a fundamental
rewrite of how the core algorithm works. I don't think we've seen the full
impact of Hummingbird yet, personally, and the potential of this new code may be
realized over months or even years, but now we're talking about the core
algorithm(s). That leads us to our final image:






Image credit for hummingbird silhouette: Michele Tobias at Experimental Craft.


The end result surprised even me as I created it. This was the most basic
illustration I could make that didn't feel misleading or simplistic. The reality
of Google today far surpasses this diagramâevery piece is dozens of
smaller pieces. I hope, though, that this gives you a sense for what the
algorithm really is and does.

Additional resources

If you're new to the algorithm and would like to learn more, Google's own "How
Search Works" resource is actually pretty interesting (check out the
sub-sections, not just the scroller). I'd also highly recommend Chapter 1 of our
Beginner's Guide: "How Search Engines Operate." If you just want to know more
about how Google operates, Steven Levy's book "In The Plex" is an amazing read.

Special bonus nonsense!

While writing this post, the team and I kept thinking there must be some way
to make it more dynamic, but all of our attempts ended badly. Finally, I just
gave up and turned the post into an animated GIF. If you like that sort of
thing, then here you go...


Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten
hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think
of it as your exclusive digest of stuff you don't have time to hunt down but
want to read!



You may view the latest post at
http://feedproxy.google.com/~r/seomoz/~3/_rxMm03y4n8/a-poorly-illustrated-guide-to-googles-algorithm

You received this e-mail because you asked to be notified when new updates are
posted.
Best regards,
Build Backlinks Online
peter.clarke@designed-for-success.com

No comments:

Post a Comment