'Link Popularity Made Simple (sort of...)'
by Steve Clason of TopDog Strategy
This article explains link popularity (specifically Google's use of link
popularity) as it pertains to search engine rankings. The explanation
requires some mathematics, but you don't need a strong mathematical background
to understand it -- high school algebra will get you through. You won't
find specific methods for improving your search engine rankings, but reading
this may help you select the more productive options offered to you for
improving rankings.
Ever since Yahoo! replaced Inktomi as their search engine with Google,
with their link-oriented ranking algorithm, the Web has been abuzz with
the importance of "link popularity". It is now "crucial"
that we attend to our links. We read, over and over, sentences like, "To
push your site towards the top of search results, it's important to have
many web sites point or link to your site." Further, we're told that
some links are better than others, that (for instance) links from .gov
or .edu sites matter more than those from cousin Sally, that free-for-all
link sites don't matter at all (or that they help immensely)... and on
and on.
This buzz, and the information attending it comes from sources of varying
credibility, but almost all offer a specific product or service taking
advantage of the information. Oddly missing has been a popular explanation
of how ranking algorithms employ link analysis in their calculations --
explanations that would provide the background information allowing us
to judge the credibility of the offers .
That lack is odd because of all the features of search engine ranking
algorithms, link popularity may be the easiest to understand. Not because
it is the least complicated, but because it has been so thoroughly described
by its developers. Descriptions of specific implementations of the technique
are widely published in academic circles, and are available on the Web.
This article stems from one of those publications.
Links Measure Importance
Before diving into the details, let's gloss the basics. Link "topology"
is really a better term than "popularity" because the technique
treats 'interconnectedness' as well as count. Whatever you call it, the
analysis measures a page's importance. That's different from relevance.
While relevance means how well the content of a Web page pertains to a
specific query string, "importance" refers to a page's worth,
without regard to the content. A link to a page makes a statement about
the page's perceived worth and so confers some value on it. The more value
conferred through links, the more important the page is.
But we don't all have the same value to confer. Some of our pages are
more important than others, and a link from an important page confers
more value than a link from an unimportant one.
So: An important page is one that has important pages pointing to it.
Circular reasoning? Absolutely—but easy to grasp intuitively. For
instance, a link to your site from the National Institute of Standards
and Technology ought to confer more value than a link from your cousin
Sally. Not because Sally is partial, but because NIST is more important,
in some vague way that we nevertheless seem to understand.
How Importance Is Measured
Although easy enough to grasp with two or three pages, measuring the
relative importance of billions of inter-connected pages seems hopelessly
complicated. And complicated it is, but not hopelessly—it's really
pretty straightforward. It does require a lot of calculations, but luckily
we don't have to invent them. We can just get them from the academic literature.
While graduate students at Stanford, Larry Page and Sergey Brin (the
inventors of Google the search engine and the founders of Google the company)
published "The
Anatomy of a Large-Scale Hypertextual Search Engine", which you
can download in PDF format.
Their paper describes PageRank, Google's technique for defining a page's
importance on the basis of the pages that link to it. It is this method
that I elaborate in this article.
Here's the PageRank formula. It looks difficult, but don't let it intimidate
you. We'll take it a little at a time; patience and high school algebra
will get you through it.
Assume a web page, called A, that has a number of pages that link to
it. Call those pages that link to it T1, T2, T3, and so on to the last
one, called Tn
No math so far -- this just names things so we can talk about them later.Ý
Think of A as your home page, and T1 to Tn as other pages on the Web that
contain hypertext links to yours. T2 can even be your cousin Sally's,
if that helps.
The PageRank of A is calculated by this equation:
PR(A) = (1-d)+d [PR(T1)/C(T1)+PR(T2)/C(T2)+PR(T3)/C(T3)+Ö+PR(Tn)/C(Tn)]
This looks really nasty. But ripped into its three components, it's much
easier.
PR(A) means the PageRank of A; that's what we are trying to find out.
This just states the problem -- the real work happens on the other side
of the equals sign.
(1-d)+d is a damping factor.Ý Never mind what it does. Page and
Brin recommend a setting of 0.85, so we'll set it to that and then forget
about it. Although it probably matters a lot if you run a search engine,
it doesn't matter much for our purposes here.ÝWe're just going
to calculate that big mess between the brackets (these things,Ý
[ and ]), multiply it by 0.85 and add that result to 0.15 to stay true
to the formula.
Now, to the big mess between the brackets.Ý If we reformat this
section, like this:
PR(T1)/C(T1)+
PR(T2)/C(T2)+
PR(T3)/C(T3)+Ö+
PR(Tn)/C(Tn)]
it's easier to recognize T1, T2, and T3 as those pages that link to A,
and (I hope) easy to see that an exceedingly simple calculation is being
done to them.Ý The apparent complication comes from the quantity
of the calculations, not the difficulty.
PR still means PageRank, just like on the left of the equals sign, T1,
T2, etc., just name the pages. The only new thing is the C -- as in C(T2).Ý
This is defined as the number of hypertext links on page T2.Ý The
number of outgoing links.Ý This is an outgoing link:
http://www.adventive.com
On the receiving end (at Adventive's home page), this is an incoming link
(or an in-link), but on this page, it is an out-link. Both matter!
Putting back together the three components we earlier ripped, we can
now list a set of steps for applying the formula to any page:
make a list of all the pages which point to the target page (we'll call
it a link list);
for each page on the link list:
determine the PageRank;
count the number of outgoing links;
divide each page's PageRank by the number of its outgoing links;
sum the results of step 2 for the entire link list;
apply the damping factor to the resulting sum.
Making The Calculations
Four steps. Simple enough, but where do we start?. In order to determine
PageRank, this says we have to already know the PageRank of the link list
-- have to know, that is, exactly what we're trying to find out.
But the PageRank formula achieves its results by repeating the calculation
until the results converge on a stable result.Ý Meaning that we
can just start someplace arbitrary, and everything will work itself out.
This is a remarkably clever idea, which I'm going to belabor. To demonstrate
how it works, I've created a little Web made up of 10 pages, and we'll
employ PageRank calculations to rank these pages according to their importance.
The universe looks like this -- the circles are Web pages, the lines
between them are hyperlinks.Ý The arrows show the direction of
the links:
What a mess, eh?Ý But things clear up fast once we start the calculations.
Stepping through the sequence, first we make a link list. Here's A's:
A has 6 in-links, from B, E, G, H, I, and J
You can do the rest if you want to follow along closely.
Next we find the PageRank for each of the pages on this link list.Ý
Of course, at this point we don't know any PageRanks, so, we'll arbitrarily
assign each page a PageRank of 1 for the first iteration of the algorithm.
Next we count the number of outgoing links for each page on the link
list and divide PageRank by the result of the count.ÝUsing A's
link list, we generate this table:
Page |
PageRank |
# out-links |
PR/out-links |
B |
1 |
6 |
0.1667 |
E |
1 |
4 |
0.2500 |
G |
1 |
3 |
0.3333 |
H |
1 |
2 |
0.5000 |
I |
1 |
4 |
0.2500 |
J |
1 |
3 |
0.3333 |
|
Total |
1.8333 |
As the last step, we apply the damping factor by multiplying the sum
(1.8333) by 0.85, getting 1.5583, then adding (1-0.85), or 0.15 to get
1.7083. After the first iteration, PR(A)=1.7083
Repeating the steps for each of the ten pages in our Web delivers the
following results, listed in order of rank (you can check my work by constructing
a table for each of the other 9 pages just like the table I did for A,
if you've a mind to):
PR(A)= 1.7083
PR(J)= 1.4250
PR(G)= 1.2833
PR(H)= 1.0708
PR(C)= 0.8583
PR(D)= 0.8583
PR(F)= 0.7875
PR(I)= 0.7167
PR(E)= 0.5042
PR(B)= 0.3625
OK, looking it over, the list makes a little sense.Ý A has the
most in-links and is listed as the most important, and B the fewest and
is the least important. But J in second place seems a little odd, since
G has more incoming links (4 versus 3).Ý So, let's run through
the calculations again.
Second Iteration
We use the same sequence of steps, but this time, instead of using an
arbitrary 1 for the PageRank value in calculating each page's links-list,
we'll use the values in the above table; that is, the results of the first
iteration.Ý So, in calculating A's link list for the second iteration,
we generate this table:
Page |
PageRank |
# out-links |
PR/out-links |
B |
0.3625 |
6 |
0.0604 |
E |
0.5042 |
4 |
0.1261 |
G |
1.2833 |
3 |
0.4278 |
H |
1.0708 |
2 |
0.5354 |
I |
0.7167 |
4 |
0.1792 |
J |
1.4250 |
3 |
0.4750 |
|
Total |
1.8039 |
Page PageRank # out-links PR/out-links
B 0.3625 6 0.0604
E 0.5042 4 0.1261
G 1.2833 3 0.4278
H 1.0708 2 0.5354
I 0.7167 4 0.1792
J 1.4250 3 0.4750
Total 1.8039
Adjust that total by the damping factor, and PR(A)=1.6833 after the second
iteration.
Look what happened this second time through.Ý Taking B as a example,
notice that instead of contributing a value of 0.1667 to A's accumulating
PageRank, this time around B added only 0.0604. In other words, after
the first go-round, B's importance diminished from the (arbitrary) starting
value of 1, and so it's referrals diminished in value.Ý Once we
get past the initial, arbitrary PageRank of 1, every page starts to contribute
according to its own "importance".
I won't go through any calculations in detail (you'll have to trust that
I did them right);Ý here are the PageRank values (again in order)
after the second iteration of the algorithm:
PR(A)= 1.6833
PR(G)= 1.5442
PR(J)= 1.4870
PR(H)= 1.3335
PR(F)= 1.0502
PR(C)= 0.7731
PR(D)= 0.7173
PR(I)= 0.5361
PR(E)= 0.3537
PR(B)= 0.2572
There's a little more action this time around.Ý G and J swapped
places and that seems intuitively better, as we mentioned earlier. F went
from 7th to 5thÝ displacing C and D. Let's see if we can figure
that one out.
Look again at the link structure diagram.Ý C, D, and F all have
3 in-links, but notice that one of F's comes from A, while A does not
link to either C or D.Ý A being the most important (highest ranked)
page in this universe, a referral from it carries more weight than referral
from any other page, so F gets a big boost that C and D lack.Ý
Let's see if anything changes the third time through.
Third Iteration
I'll neglect all the detail, and just go to the results. PageRank results
after 3 iterations are:
PR(A)= 1.8020
PR(G)= 1.6515
PR(H)= 1.4019
PR(F)= 0.9920
PR(J)= 0.9496
PR(C)= 0.7774
PR(D)= 0.7389
PR(I)= 0.6328
PR(E)= 0.3004
PR(B)= 0.2260
Only one change: J drops from 3rd to 5th boosting G and H. Why?Ý
Notice that G and H have links from A, while J does not. Again, having
a link from an important page boosts importance. Compare F and H, both
with 3 in-links. F's links come from A, B, and C, while H's are from A,
C, and F. That single difference -- the difference between the weight
of the contribution of B (at the bottom of the list) and F (near the top)
ranks H higher than F.
Although this universe offers no example of this, you can see how a single
in-link from A would count more than links from all three of the bottom
of the list. With a larger universe, that spread would be even wider,
and link quality would play an even larger part than here, where it serves
mainly to decide the relative rankings of pages with the same number of
in-links.
Fourth Iteration: We're Done!
After 4 iterations, the values differ, but the rankings remain the same:
PR(A)= 1.7132
PR(G)= 1.5575
PR(H)= 1.4126
PR(F)= 1.0230
PR(J)= 0.9764
PR(C)= 0.8162
PR(D)= 0.7844
PR(I)= 0.6036
PR(E)= 0.3165
PR(B)= 0.2138
The rankings stay this way through several more iterations, and seem
to be stable, so we'll stop here.ÝThis last list contains what
we'll call the "official" PageRank values of our little 10 page
universe.Ý
Lessons Learned
Simple, right?Ý Not particularly easy, but once you understand
a small example you can expand that understanding to encompass the entire
Web.Ý You'll never really get your mind around it because the complexity
of the interrelationships between a billion or so pages in ungraspable
-- we can, though, understand the calculations, and appreciate how they
determine what pages are "important".
More importantly, for most of us anyway, we can use our new grasp of
the algorithm to make sense of some of the buzz about link popularity.Ý
Let's start with the quote that opened this article: "To push your
site towards the top of search results, it's important to have many web
sites point or link to your site."ÝA true statement, as far
as it goes.ÝIn general, the more in-links you have the higher you'll
be ranked in the search results. But as we've seen, links confer different
amounts of importance, and a high quality link can easily outweigh several
low quality links.
Now this one: "The quality of the link holds more "weight"
than the quantity of links. You will get better results in the search
engines if you have link popularity from sites that have considerable
traffic."Ý The statement starts off dead on, then misses the
point completely.ÝÝ Traffic has nothing to do with link
popularity.Ý The "quality" of a site is nothing more
than its PageRank. All links (of whatever "quality") contribute
to the calculation; "quality" links just contribute more. Traffic
contributes nothing.
Or, consider this: "Free for all sites don't boost your rankings."Ý
Mostly true. Although all incoming links improve your importance, the
FFAs, whose lots of out-links dilute whatever importance they might have,
will contribute very little. Also, to the extent they are considered spam,
Search Engines may keep them out of their index -- a link from a non-indexed
site contributes nothing.
And lastly, how about this one: "Links from .gov and .edu sites
are better than links from your cousin."Ý Maybe.Ý The
.gov or .edu by themselves confer no additional authority, but these sites
might be more likely than your cousin's to have lots of in-links themselves,
and so have a high PageRank.Ý That alone makes a link "better".
The main lesson here seems to be the same old one. If you want the high
search rankings that result from a high PageRank (or whatever other link
popularity algorithm a SE uses) pay attention to your content. Easy to
navigate pages with high quality content will attract links from Webmasters
augmenting their sites by linking to yours, and it is those incoming links,
hopefully from sites that are themselves important, that will boost your
page's importance.
Steve Clason is a freelance writer and the owner of TopDog Strategy,
a consulting firm. He lives in Boulder, Colorado, and can be reached at
steve@clason.org.
Go Back to the
Link Popularity Articles Directory |