Archive for May, 2011

There are kinds of ranks and measures based on the link structure of the web to detect good or bad content. Some of interesting ones are described as follows.

  • Truncated PageRank
    In a general form, Google’s pagerank is to calculate PR(P)=Sum{ damping(t)/N * P^t * I | t=0,1,2,…}, where P is the normalized adjacent matrix of the web graph and for any dangling node (0 out-degree node) v, edges from v to all nodes of the web graph are added to P. I is the identity vector. damping(t) is a function, and for Google’s computation, damping(t)=(1-a)a^t, where 1-a is considered as the probability of breaking out in the famous random surfer model and a=0.85 is suggested. We can say the PageRank of a page is collected from its supporters, which have links to it directly or indirectly. It is observed that spam pages collect their PageRanks from close supporters (spammers usually set up thousands of supporting pages), which are within a short distances. The motivation of Truncated PageRank is to make close supporters’ PagaRank contribution small. A simple means is to set damping(t)=0 for small value t, say t < T. If a page’s PageRank and Truncated PagaRank have a big gap, It is likely to be spam.
  • Estimation of Supporters
    A supporter of distance D is a page which has a shortest path of length D to the target page.This idea is based on the same intuition as Truncated PageRank that spam pages have a different supporter distribution. The numeric feature to be used is the so-called bottleneck number,
    Bk(x)= Min{ | supporter(x,j)|/|supporter(x,j-1)| , 0<j<=k} which indicates the smallest increasing rate of supporter numbers from 1 to k. The real challenge comes from the computation and it’s impossible to calculate exact supporter numbers within a given distance for each page. Some probabilistic counting method is employed here. A basic idea is to generate a bitmap for each node each bit of which is set with probability e, and run D Bellman-Ford relaxations, where the relaxation for each a–>b is to make bitmap(b)|=bitmap(a). Finally, we can estimate the supporter numbers by the bitmaps. The basic algorithm may not work well, and some refinement is used, for example, run multiple times of Bellma-Ford using different e.
  • Graph Clustering
    The intuition is that closely connected pages (hosts) are similar.
    One method is to cluster pages(hosts) and use the labeled pages in each cluster to predict the whole cluster. For example, if the average spam probability of the cluster is larger than a predefined threshold, then the whole cluster is predicted as spam. If smaller than another predefined threshold, then the whole cluster is predicted as non-spam.Another method is so-called Stacked Graphical learning. Say we have a learning algorithm to predict each page’s probability of spam, then we calculate the average spam probability of each page’s neighbors( in-link, out-link or both), and use the value as a new input feature to construct another supposed more powerful learning algorithm.
  • TrustRank
    Instead of detect spam, TrustRank is used to recognize non-spam. If the surfer breaks out to a random page according to some pre-defined probability distribution A in the random surfer model, the described previously PageRank’s formula can be generalized as PR(P,A)=Sum{ damping(t)/N * P^t * A | t=0,1,2,…}. Making entries of Acorresponding to some trusted pages non-zeroes and others zeroes, we would get so-called TrustedRank. If a page is not linked by some trusted pages directly or indirectly, it would have a low TrustedRank.Picking up and labeling seeds are very costly. Some tricks can be used to select more efficient seeds to cover more pages. For example, select pages with large inverse PageRank. A large inverse PageRank indicates a page links to many pages directly or indirectly.
  • SpamRank (BadRank)
    The idea is similar to the TrustRank, but the restart distribution A indicates probability of spam pages. It looks like propagating spam probability through in-links.
  • SimRank
    A general measure of similarity between objects using relations. It has a recursive definition
    S(a,b) = C/| Neib(a)|/|Neib(b)| * Sum{ S( c,d ) | c in Neib(a) & d in Neib(b) }
    where Neib(a) is the neighbor set of object a. The neighbor relation can be defined arbitrarily, e.g. web page’s out-linked pages. C is the damping factor. The base case is S(a,a)=1.The underlying intuition is that if two objects’ neighbors are similar, then the two objects are similar too. In terms of a random surfer model, S(a,b) can be considered as the expected hops two surfers will need to meet, if at each step them both move to one of the neighbors of the current objects. More precisely speaking, S(a,b)= Sum { Prob(P(l)) * C ^ l | l=0,1,2,…} where P(l) is a path of the graph G^2 ending with some node [x,x]. Due to the computational space and time complexities, there are some randomized algorithms to approximate SimRank.

Read Full Post »