Feeds:
Posts

## Things about C++ I didn’t know – implicit conversion while binding to a const reference

A type A can be bound to a function argument of type const B&, if there is an implicit ctor of B with an argument of A. The tricky part is a temporary object will be created and passed to the function as a const reference. If the function returns the argument itself or any member of it as a const reference, that reference will refer to a destructed object outside the function.

std::pair is a perfect example class for this, because it has a wild ctor.

```template <class _U1, class _U2>
pair(const pair<_U1, _U2>& __p) : first(__p.first), second(__p.second) {}
```

Here is a complete example to show this.

## An algorithm to find Minimum Mean Weighted Cycle

There is a wonderful algorithm to find the minimum mean weighted cycle of a edge-weighted directed graph. The algorithm is mentioned in one of exercises of CLRS. First calculate a 2D array dis[][] such that dis[k][v] is the length of the minimum k-path from a fake source node to node v. It is assumed that the fake source node has a zero weighted edge to each other node. If the path doesn’t exist, the length is considered infinite.
Then the mean weight of the MMWC is
Min { Max { (dis[n][v] – dis[k][v] ) / (n – k ) | k } | v }. *
The proof of the algorithm is as simple as the algorithm itself.
Assume that the MMWC of the graph has a mean weight 0. Let sdis[v] be the length of the shortest path from the source node to node v.
1. Max { (dis[n][v] – dis[k][v] ) / (n – k ) | k } must be non-negative, since if dis[n][v] is not infinite, there must be a cycle in its path and removing that cycle leads to a path with length dis[k][v] for some specific k. The total weight of the cycle is non-negative, so dis[n][v] – dis[k][v] >= 0.
2. Assume that a node v on a MMWC has the shortest path length from the source node among all nodes on that MMWC. It can be induced that for any node u on the MMWC,
sdis[u]=sdis[v] + cycle[v][u] ( cycle[v][u] is the length of MMWC path from v to u ). (1)
We can extend the sortest path from the source node to node v along the MMWC until we get a n-path at a cycle node u.
Assume that the shortest path from the source node to node u is a k-path, then from formula (1) we have dis[n][u] = dis[k][u], soformula (*) equals to 0.
3. If the mean weight of MMWC is t*, we can subtract t* from each edge’s weight and the proof still works.
It seems not trivial to find such a MMWC.

There are kinds of ranks and measures based on the link structure of the web to detect good or bad content. Some of interesting ones are described as follows.

• Truncated PageRank
In a general form, Google’s pagerank is to calculate PR(P)=Sum{ damping(t)/N * P^t * I | t=0,1,2,…}, where P is the normalized adjacent matrix of the web graph and for any dangling node (0 out-degree node) v, edges from v to all nodes of the web graph are added to P. I is the identity vector. damping(t) is a function, and for Google’s computation, damping(t)=(1-a)a^t, where 1-a is considered as the probability of breaking out in the famous random surfer model and a=0.85 is suggested. We can say the PageRank of a page is collected from its supporters, which have links to it directly or indirectly. It is observed that spam pages collect their PageRanks from close supporters (spammers usually set up thousands of supporting pages), which are within a short distances. The motivation of Truncated PageRank is to make close supporters’ PagaRank contribution small. A simple means is to set damping(t)=0 for small value t, say t < T. If a page’s PageRank and Truncated PagaRank have a big gap, It is likely to be spam.
• Estimation of Supporters
A supporter of distance D is a page which has a shortest path of length D to the target page.This idea is based on the same intuition as Truncated PageRank that spam pages have a different supporter distribution. The numeric feature to be used is the so-called bottleneck number,
Bk(x)= Min{ | supporter(x,j)|/|supporter(x,j-1)| , 0<j<=k} which indicates the smallest increasing rate of supporter numbers from 1 to k. The real challenge comes from the computation and it’s impossible to calculate exact supporter numbers within a given distance for each page. Some probabilistic counting method is employed here. A basic idea is to generate a bitmap for each node each bit of which is set with probability e, and run D Bellman-Ford relaxations, where the relaxation for each a–>b is to make bitmap(b)|=bitmap(a). Finally, we can estimate the supporter numbers by the bitmaps. The basic algorithm may not work well, and some refinement is used, for example, run multiple times of Bellma-Ford using different e.
• Graph Clustering
The intuition is that closely connected pages (hosts) are similar.
One method is to cluster pages(hosts) and use the labeled pages in each cluster to predict the whole cluster. For example, if the average spam probability of the cluster is larger than a predefined threshold, then the whole cluster is predicted as spam. If smaller than another predefined threshold, then the whole cluster is predicted as non-spam.Another method is so-called Stacked Graphical learning. Say we have a learning algorithm to predict each page’s probability of spam, then we calculate the average spam probability of each page’s neighbors( in-link, out-link or both), and use the value as a new input feature to construct another supposed more powerful learning algorithm.
• TrustRank
Instead of detect spam, TrustRank is used to recognize non-spam. If the surfer breaks out to a random page according to some pre-defined probability distribution A in the random surfer model, the described previously PageRank’s formula can be generalized as PR(P,A)=Sum{ damping(t)/N * P^t * A | t=0,1,2,…}. Making entries of Acorresponding to some trusted pages non-zeroes and others zeroes, we would get so-called TrustedRank. If a page is not linked by some trusted pages directly or indirectly, it would have a low TrustedRank.Picking up and labeling seeds are very costly. Some tricks can be used to select more efficient seeds to cover more pages. For example, select pages with large inverse PageRank. A large inverse PageRank indicates a page links to many pages directly or indirectly.
The idea is similar to the TrustRank, but the restart distribution A indicates probability of spam pages. It looks like propagating spam probability through in-links.
• SimRank
A general measure of similarity between objects using relations. It has a recursive definition
S(a,b) = C/| Neib(a)|/|Neib(b)| * Sum{ S( c,d ) | c in Neib(a) & d in Neib(b) }
where Neib(a) is the neighbor set of object a. The neighbor relation can be defined arbitrarily, e.g. web page’s out-linked pages. C is the damping factor. The base case is S(a,a)=1.The underlying intuition is that if two objects’ neighbors are similar, then the two objects are similar too. In terms of a random surfer model, S(a,b) can be considered as the expected hops two surfers will need to meet, if at each step them both move to one of the neighbors of the current objects. More precisely speaking, S(a,b)= Sum { Prob(P(l)) * C ^ l | l=0,1,2,…} where P(l) is a path of the graph G^2 ending with some node [x,x]. Due to the computational space and time complexities, there are some randomized algorithms to approximate SimRank.

## R tips

file.dat is
A,B,C
1,2,3
4,5,6
7,8,9
data\$A refers the column named “A”,…
• data1 <- edit(data) : edit your data in a UI. It’s pretty cool.
• par(mfrow=c(3, 2)) : show 2X3 canvas. The succeeding painting function will fill up the canvas.
• plot(data\$A): plot points (1,data\$A), (2,data\$A) …
• plot(data\$A,data\$B): plot points (data\$A,data\$B),…
• hist(data\$A): histogram of data\$A

## Min hash outline

Min hash is an implementation of Local Sensitive Hash. We have a set of binary vectors from a high dimensional feature space. The similarity between two vectors is defined as Jaccard similarity: Jaccard(V1,V2)=| V1 & V2 | / | V1 | V2 |. We want to find pairs of vectors of high similarity.
To avoid the time consuming pairwise similarity calculation, a heuristic method is to generate a group of hash values for each vector and only calculate similarities of pairs of vectors which have at least one hash value in common. Min hash is such a hash function, Hmin(V)= min { i | V[P[i]]=1 }, where P[] is a random permutation of features. Min hash function has a good property that
Prob{ Hmin(V1)=Hmin(V2) }=Jaccard(V1,V2). The property guarantees that we wouldn’t miss many pairs.

## Ubuntu下用tor翻墙

1. Tor project下载一个tor浏览器。（需要翻墙……）
2. 解压后用脚本start-tor-browser启动tor浏览器。
3. 在出来的control  panel的Setup Relaying–>Network 勾选”My ISP blocks connections …”。
4. 发送标题为”get bridges”的邮件到bridges@torproject.org获取三个bridges地址和端口。将bridge通过control panel地址加入配置。
5. 连接成功后会出现tor浏览器自带的firefox浏览器。
6. 可以使用其他浏览器，只需要配置代理服务器Localhost:8118
7. 最好使用如下的自动代理配置脚本。
function FindProxyForURL(url, host)
{
if (
dnsDomainIs(host, “appspot.com“)||
dnsDomainIs(host, “blogspot.com“)||
dnsDomainIs(host, “wordpress.com“)||
dnsDomainIs(host, “torproject.org“)||
shExpMatch(url,”*wordpress.com“)||
shExpMatch(url,”*torproject.org“)
)
return “PROXY 127.0.0.1:8118“;
else
return “DIRECT”;
}

## Pearls from

• "72"法则：假设r%年利率投资一笔钱y年，如果ry=72，那么投资差不多会翻番。推导很简单，二项式展开，用头三项近似一下就行了。更有用的说法应该是，72/r年后投资能翻番。
• Little定律：考虑一个带有输入和输出的任意系统，Little定律指出“系统中物体的平均数量等于物体离开系统的平均速率和每个物体在系统中停留的平均时间的乘积”，如果出入平衡的话，离开速率等于进入速率。
• 旋转数组ab为ba的方法：ab -> a’b -> a’b’ -> ba，a’表示a的倒转。这个算法比分解为gcd数目个轮换的方法要cache友好得多。
• 从n个元素{0,1,…,n-1}中均匀随机采样m个元素的算法，
spot=m;
for(int i=0;i<n;i++)
if( rand(0,n-i-1) < spot)spot–,printf("%d\n",i);

实际上就是把元素i随机放到n-i个可能的位置上，如果这个位置在spot之内，则选中。每次要去掉已经被占的位置。

• 另外有一个只需要生成m个随机数的算法：
int* p=new int[n];
for(int i=0;i<n;i++)p[i]=i;
for(int i=0;i<m;i++)swap(p[i],p[rand(i,n-1)]);
for(int i=0;i<m;i++)printf("%d\n",p[i]);

本质是从后向前用经典算法生成一个随机排列，然后取前m个元素。实际上只需要重排前m个元素。

• 如果要求采样算法的时间和空间复杂度跟n无关，则有以下的算法
selected={}
for i in range(n-m,n):
t=random.randint(0,i)
if t in selected: selected[i]=1
else: selected[t]=1

可以用数学归纳法证明这个算法的正确性。前n-1个元素中选取的m-1个元素的概率平均性推导到n个元素中选取m个概率平均性。

• 为单链表增加一个指针p[i]，指向第2*i个元素，那么可以简单的在对数时间内访问第n个元素。
• new操作符分配的内存块有额外开销，比如sizeof不超过12字节的结构体，new实际上总是分配48字节。