document selection and the jensen-shannon divergence
implementing the jensen-shannon divergence (jsd) wasn’t quite as straightforward as i first expected. but then again, are things ever?
essentially, jsd is a way of measuring the distance between n distributions. as such it is nonnegative and zero if all n distributions are equal.
i’m using jsd to decide which document, from a pool of unlabelled documents, is the most informative, given a classifier trained on a small set of annotated documents. the classifier is a committee of boosted decision trees, and the jsd is computed at the token level, then averaged across all tokens in each document so as to provide some sort of measure for the entire document. written in quasi latex, the jsd over a set of n probability distributions p_1, …, p_n with weights w_1, …, w_n is
jsd(p_1, p_2, …, p_n) = H(SUM_i=1^n(w_ip_i)) - SUM_i=1^n(w_iH(p_i))
where H(p) = - p log2(p)
now, the immediate problem is that, in my case, each token is assigned a probability distribution across the fifteen (15) possible labels in which many of the probabilities are 0. computing the entropy, H, or more precisely the log2, for a 0 probability yields the result -Infinity. and this is where it gets quick and dirty - the remedy to this is to remove all zeros, justifying it by thinking of a 0 in a distribution as a non-vote by one of the committee members. a problem introduced by this approach is that the jsd may assume negative values, something which it shouldn’t. since it is a distance, in some sense, we’re talking about, using the absolute value doesn’t seem like a too bad a thing.
anyway. i have tested it locally, and i’m about to deploy it on one of the servers any minute now. relatively speaking, there are not much more to it to compute the jsd than it is computing the vote entropy, which i’ve already used in a set of experiments - given the time it takes to complete one round in the active learning loop with vote entropy as the selection metric, i’ll have two rounds of jsd-based results by tomorrow.