The PageRank algorithm described in Python

The existing resources which explain the PageRank algorithm using Python code involve many dependencies, or un-necessarily cloud the core ideas with matrix manipulation and graph theory. This page presents a description of the algorithm based on a fairly literal translation of the mathematical formulæ into Python source code.

def pagerank(graph, damping=0.85, epsilon=1.0e-18):
    inlink_map = {}
    outlink_counts = {}
    
    def new_node(node):
        if node not in inlink_map: inlink_map[node] = set()
        if node not in outlink_counts: outlink_counts[node] = 0
    
    for tail_node, head_node in graph:
        new_node(tail_node)
        new_node(head_node)
        if tail_node == head_node: continue
        
        if tail_node not in inlink_map[head_node]:
            inlink_map[head_node].add(tail_node)
            outlink_counts[tail_node] += 1
    
    all_nodes = set(inlink_map.keys())
    for node, outlink_count in outlink_counts.items():
        if outlink_count == 0:
            outlink_counts[node] = len(all_nodes)
            for l_node in all_nodes: inlink_map[l_node].add(node)
    
    initial_value = 1 / len(all_nodes)
    ranks = {node: initial_value for node in all_nodes}
    
    new_ranks = {}
    delta = 1.0
    n_iterations = 0
    while delta > epsilon:
        for node, inlinks in inlink_map.items():
            new_ranks[node] = ((1 - damping) / len(all_nodes)) \
                + (damping \
                   * sum(ranks[inlink] \
                         / outlink_counts[inlink] for inlink in inlinks))

        delta = sum(abs(new_ranks[node] - ranks[node]) \
                    for node in new_ranks.keys())
        ranks, new_ranks = new_ranks, ranks
        n_iterations += 1
    
    return ranks, n_iterations

The PageRank algorithm views the web as a directed graph with the pages being nodes and hyperlinks being connections between those nodes. It can be used to rank the nodes of any kind of graph (including undirected ones) by importance. This description uses graph terminology and only shows how it is done for a directed graph such as the webgraph.

While it is accurate to say that PageRank will tell you the importance of each page, a more accurate definition is that PageRank assigns a probability to each page. Specifically, the PageRank value of a page is the probability, between 0 and 1, that someone surfing by clicking links at random will end up on that page. It also models boredom through a ‘damping factor’: after so many clicks, the simulated surfer will get bored and stop browsing.

Because finding the PageRank probability value for an individual node in a graph will depend on the PageRank value of all the nodes which connect to it, and those nodes might connect cyclically to the node whose ranking we want, PageRank values are usually assigned using a converging iterative method.

The Python procedure takes its argument in a similar format to tsort: the graph argument is an iterable of two-element sequences. It also accepts optional arguments for the damping factor (damping) and the amount of convergence required (epsilon). epsilon is the smallest cumulative change in the PageRank of all nodes which will be accepted as sufficient convergence.