some say this tumlbr thing is the new of the new. the sh*t to be had. it maybe so, but i guess those saying that has more time on their hands than do i.
i’ll have to decide whether to keep this blog or not.
but not today.
some say this tumlbr thing is the new of the new. the sh*t to be had. it maybe so, but i guess those saying that has more time on their hands than do i.
i’ll have to decide whether to keep this blog or not.
but not today.
listening to mars volta really helps when cooking word salad. loud. de-loused.
i’m a bit concerned about my wordings, i tend to use the same expressions over and over. admittedly, i’m currently writing up, aggregating if you will, a dozen of so papers all of which are pretty much on the same subject… there’s only so much variance to be had without things getting unreadable.
another thing. i use the mac os x widget to post to this blog. easy and convenient alright, but why isn’t the option to include a blog post headline included? i mean, it’s easy enough to scribble the blog body, but the benefits of using the widget sort of diminish if i have to go to the tumblr dashboard and create the heading from there. that’s why you’ll probably not see much of those here anyway.
still thinking about looking into how to add comments to the posts. no time for that yet though.
oh boy, do i have a hard time concentrating on what i should do? this week, i’ll focus on getting some of all my thoughts on paper. i’ll start off with active annotation and adaptive tools intended to aid a human in the annotation process. i’ve got the structure of the chapter pretty well figured out. now i only need to get those darn characters going… how hard can it be?
i’ll let you know by the end of the week.
disregard my previous post on the computation of the jensen-shannon divergence. i got crucial parts of it wrong.
i was looking for an example of how to compute it, but found none. so, i guess that now that i know how it should be done, i should write it down and post it. well. eventually, i might do that.
i cannot do without the fatjar plug-in to eclipse. when running the same piece of software on several machines, it can be a pain the ass setting things up, not to mention maintaining it. with fatjar, i create a single jar containing all libraries my software depend on. no need to install several jars…
sött!
a couple of days ago, the first round of active learning based on the decorate algorithm, activedecorate, called home. what puzzles me, and which i guess i’ll have to explain in the thesis later on, is that while query-by-boosting gave good results, the activedecorate approach did not. at all. strange. for the experiments reported in the literature, activedecorate outperforms query-by-boosting. but then again, those experiments i’ve encountered didn’t include using active learning for selecting documents based on the sequential tagging of tokens (think named entity recognition).
active learning for probability estimation using jensen-shannon divergence
implementing the jensen-shannon divergence (jsd) wasn’t quite as straightforward as i first expected. but then again, are things ever?
essentially, jsd is a way of measuring the distance between n distributions. as such it is nonnegative and zero if all n distributions are equal.
i’m using jsd to decide which document, from a pool of unlabelled documents, is the most informative, given a classifier trained on a small set of annotated documents. the classifier is a committee of boosted decision trees, and the jsd is computed at the token level, then averaged across all tokens in each document so as to provide some sort of measure for the entire document. written in quasi latex, the jsd over a set of n probability distributions p_1, …, p_n with weights w_1, …, w_n is
jsd(p_1, p_2, …, p_n) = H(SUM_i=1^n(w_ip_i)) - SUM_i=1^n(w_iH(p_i))
where H(p) = - p log2(p)
now, the immediate problem is that, in my case, each token is assigned a probability distribution across the fifteen (15) possible labels in which many of the probabilities are 0. computing the entropy, H, or more precisely the log2, for a 0 probability yields the result -Infinity. and this is where it gets quick and dirty - the remedy to this is to remove all zeros, justifying it by thinking of a 0 in a distribution as a non-vote by one of the committee members. a problem introduced by this approach is that the jsd may assume negative values, something which it shouldn’t. since it is a distance, in some sense, we’re talking about, using the absolute value doesn’t seem like a too bad a thing.
anyway. i have tested it locally, and i’m about to deploy it on one of the servers any minute now. relatively speaking, there are not much more to it to compute the jsd than it is computing the vote entropy, which i’ve already used in a set of experiments - given the time it takes to complete one round in the active learning loop with vote entropy as the selection metric, i’ll have two rounds of jsd-based results by tomorrow.
jensen-shannon divergence. there’s one particular thing i can’t get my head around now that i’m about to implement the jensen-shannon divergence as a means to select new data to annotate in my query-by-boosting set-up. ah well. i just need to wait for someone who speaks math to show up at work, then i’ll be good to go.
here are some of the things i digested last night, when i really should’ve been asleep: