<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"><channel><title>smudoscrapbook</title><link>http://smudo.tumblr.com/</link><description>a thesis’ journey from conception to oblivion…

i bookmark thesis related stuff at del.icio.us</description><generator>Tumblr (smudo)</generator><item><title>some say this tumlbr thing is the new of the new. the sh*t to be had. it maybe so, but i guess those...</title><description>&lt;p&gt;some say this tumlbr thing is the new of the new. the sh*t to be had. it maybe so, but i guess those saying that has more time on their hands than do i.&lt;/p&gt;

&lt;p&gt;i’ll have to decide whether to keep this blog or not.&lt;/p&gt;

&lt;p&gt;but not today.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/27091706</link><guid>http://smudo.tumblr.com/post/27091706</guid><pubDate>Sat, 23 Feb 2008 12:02:53 -0500</pubDate></item><item><title>listening to mars volta really helps when cooking word salad. loud. de-loused.

i’m a bit...</title><description>&lt;p&gt;listening to mars volta really helps when cooking word salad. loud. de-loused.&lt;/p&gt;

&lt;p&gt;i’m a bit concerned about my wordings, i tend to use the same expressions over and over. admittedly, i’m currently writing up, aggregating if you will, a dozen of so papers all of which are pretty much on the same subject… there’s only so much variance to be had without things getting unreadable.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/25033736</link><guid>http://smudo.tumblr.com/post/25033736</guid><pubDate>Tue, 29 Jan 2008 18:25:43 -0500</pubDate></item><item><title>another thing. i use the mac os x widget to post to this blog. easy and convenient alright, but why...</title><description>&lt;p&gt;another thing. i use the mac os x widget to post to this blog. easy and convenient alright, but why isn’t the option to include a blog post headline included? i mean, it’s easy enough to scribble the blog body, but the benefits of using the widget sort of diminish if i have to go to the tumblr dashboard and create the heading from there. that’s why you’ll probably not see much of those here anyway.&lt;/p&gt;

&lt;p&gt;still thinking about looking into how to add comments to the posts. no time for that yet though.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24858371</link><guid>http://smudo.tumblr.com/post/24858371</guid><pubDate>Mon, 28 Jan 2008 04:02:57 -0500</pubDate></item><item><title>oh boy, do i have a hard time concentrating on what i should do? this week, i’ll focus on...</title><description>&lt;p&gt;oh boy, do i have a hard time concentrating on what i should do? this week, i’ll focus on getting some of all my thoughts on paper. i’ll start off with active annotation and adaptive tools intended to aid a human in the annotation process. i’ve got the structure of the chapter pretty well figured out. now i only need to get those darn characters going… how hard can it be?&lt;/p&gt;

&lt;p&gt;i’ll let you know by the end of the week.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24846899</link><guid>http://smudo.tumblr.com/post/24846899</guid><pubDate>Mon, 28 Jan 2008 01:04:01 -0500</pubDate></item><item><title>disregard my previous post on the computation of the jensen-shannon divergence. i got crucial parts...</title><description>&lt;p&gt;disregard my previous post on the computation of the jensen-shannon divergence. i got crucial parts of it wrong.&lt;/p&gt;

&lt;p&gt;i was looking for an example of how to compute it, but found none. so, i guess that now that i know how it should be done, i should write it down and post it. well. eventually, i might do that.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24630853</link><guid>http://smudo.tumblr.com/post/24630853</guid><pubDate>Fri, 25 Jan 2008 05:35:17 -0500</pubDate></item><item><title>fatjar and eclipse</title><description>&lt;p&gt;i cannot do without the &lt;a href="http://fjep.sourceforge.net/"&gt;fatjar&lt;/a&gt; plug-in to eclipse. when running the same piece of software on several machines, it can be a pain the ass setting things up, not to mention maintaining it. with fatjar, i create a single jar containing all libraries my software depend on. no need to install several jars…&lt;/p&gt;
&lt;p&gt;sött! &lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24538011</link><guid>http://smudo.tumblr.com/post/24538011</guid><pubDate>Thu, 24 Jan 2008 05:52:30 -0500</pubDate></item><item><title>active decorate. bad choice?</title><description>&lt;p&gt;a couple of days ago, the first round of active learning based on the decorate algorithm, activedecorate, called home. what puzzles me, and which i guess i’ll have to explain in the thesis later on, is that while query-by-boosting gave good results, the activedecorate approach did not. at all. strange. for the experiments reported in the literature, activedecorate outperforms query-by-boosting. but then again, those experiments i’ve encountered didn’t include using active learning for selecting documents based on the sequential tagging of tokens (think named entity recognition).&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.cs.utexas.edu/users/ml/publication/paper.cgi?paper=ape-ecml-05.ps.gz"&gt;active learning for probability estimation using jensen-shannon divergence&lt;/a&gt; &lt;/p&gt;
&lt;p&gt;&lt;a href="http://Creating%20Diverse%20Ensemble%20Classifiers%20to%20Reduce%20Supervision"&gt;creating diverse ensemble classifier to reduce supervision &lt;/a&gt;&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24537773</link><guid>http://smudo.tumblr.com/post/24537773</guid><pubDate>Thu, 24 Jan 2008 05:48:20 -0500</pubDate></item><item><title>document selection and the jensen-shannon divergence</title><description>&lt;p&gt;implementing the jensen-shannon divergence (jsd) wasn’t quite as straightforward as i first expected. but then again, are things ever?&lt;br/&gt;&lt;br/&gt;essentially, jsd is a way of measuring the distance between n distributions. as such it is nonnegative and zero if all n distributions are equal. &lt;br/&gt;&lt;br/&gt;i’m using jsd to decide which document, from a pool of unlabelled documents, is the most informative, given a classifier trained on a small set of annotated documents. the classifier is a committee of boosted decision trees, and the jsd is computed at the token level, then averaged across all tokens in each document so as to provide some sort of measure for the entire document. written in quasi latex, the jsd over a set of n probability distributions p_1, …, p_n with weights w_1, …, w_n is&lt;/p&gt;
&lt;p&gt;jsd(p_1, p_2, …, p_n) = H(SUM_i=1^n(w_ip_i)) - SUM_i=1^n(w_iH(p_i))&lt;/p&gt;
&lt;p&gt;where H(p) = - p log2(p)&lt;/p&gt;
&lt;p&gt;now, the immediate problem is that, in my case, each token is assigned a probability distribution across the fifteen (15) possible labels in which many of the probabilities are 0. computing the entropy, H, or more precisely the log2, for a 0 probability yields the result -Infinity. and this is where it gets quick and dirty - the remedy to this is to remove all zeros, justifying it by thinking of a 0 in a distribution as a non-vote by one of the committee members. a problem introduced by this approach is that the jsd may assume negative values, something which it shouldn’t. since it is a distance, in some sense, we’re talking about, using the absolute value doesn’t seem like a too bad a thing.&lt;/p&gt;
&lt;p&gt;anyway. i have tested it locally, and i’m about to deploy it on one of the servers any minute now. relatively speaking, there are not much more to it to compute the jsd than it is computing the vote entropy, which i’ve already used in a set of experiments - given the time it takes to complete one round in the active learning loop with vote entropy as the selection metric, i’ll have two rounds of jsd-based results by tomorrow.&lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24536394</link><guid>http://smudo.tumblr.com/post/24536394</guid><pubDate>Thu, 24 Jan 2008 05:24:00 -0500</pubDate></item><item><title>jensen-shannon divergence</title><description>&lt;p&gt;jensen-shannon divergence. there’s one particular thing i can’t get my head around now that i’m about to implement the jensen-shannon divergence as a means to select new data to annotate in my query-by-boosting set-up. ah well. i just need to wait for someone who speaks math to show up at work, then i’ll be good to go.&lt;/p&gt;
&lt;p&gt;here are some of the things i digested last night, when i really should’ve been asleep:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fwww.math.ku.dk%2F~topsoe%2FISIT2004JSD.pdf&amp;ei=0WGWR9LeKYX2QdnrjA0&amp;usg=AFQjCNFTaKob5LURUh3-e1Uk-JYJMW4cnw&amp;sig2=jLKXE7ReQuCDjI3VWnjYGw"&gt;jensen-shannon divergence and hilbert space embedding&lt;/a&gt; &lt;/p&gt;
&lt;p&gt;&lt;a href="http://www.google.com/url?sa=t&amp;ct=res&amp;cd=1&amp;url=http%3A%2F%2Fwww.mai.liu.se%2F~tikos%2Fkurser%2FTAMS23%2Flindiv.pdf&amp;ei=2F6WR5OoOqbowwHG7dSwDQ&amp;usg=AFQjCNEZEJ5Bmg0XY6lACab9yEl7MxpNQw&amp;sig2=HhLxBXd6xdb1Py8aqY5aKQ"&gt;divergence measures based on the shannon entropy&lt;/a&gt; &lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24419189</link><guid>http://smudo.tumblr.com/post/24419189</guid><pubDate>Wed, 23 Jan 2008 00:35:00 -0500</pubDate></item><item><title>night time. deploy time.</title><description>spent too much time on the couch coding last night. damn those bugs. with a set of farily fresh eyes, i see that the bug was not a bug, only a stupidity not worth getting upset about.&lt;br/&gt;&lt;br/&gt;note to self: always. ALWAYS. test. before. deploying.&lt;br/&gt;&lt;br/&gt;so there.</description><link>http://smudo.tumblr.com/post/24419076</link><guid>http://smudo.tumblr.com/post/24419076</guid><pubDate>Wed, 23 Jan 2008 00:33:00 -0500</pubDate></item><item><title>annocaked</title><description>i’ve had that final piece of cake which remained in the fridge from the past weekend’s festivities. another cup of coffee on that and i’m ready to dive deeper into that pile of papers on the annotation of corpora. but first, is there a way of turning on commenting on this tumblr thang?&lt;br/&gt;&lt;br/&gt;now, how would i expect to get an answer to that?&lt;br/&gt;&lt;br/&gt;smartass.&lt;br/&gt;&lt;br/&gt;one year older today.</description><link>http://smudo.tumblr.com/post/24350177</link><guid>http://smudo.tumblr.com/post/24350177</guid><pubDate>Tue, 22 Jan 2008 03:36:00 -0500</pubDate></item><item><title>choices</title><description>note to self: defending a thesis is to defend the choices made. if no choices are made, nothing new is contributed.</description><link>http://smudo.tumblr.com/post/24271425</link><guid>http://smudo.tumblr.com/post/24271425</guid><pubDate>Mon, 21 Jan 2008 03:18:00 -0500</pubDate><category>thesis-related</category></item><item><title>so there...</title><description>&lt;p&gt;guess i need someplace like this to put textual thoughts up, not just images, as i do over at &lt;a href="http://www.smudo.org" title="smudo.org" target="_blank"&gt;smudo.org&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;i’m on vacation. but still at work. need time to write on my thesis. today’s topic is pre-tagging and the creation of text corpora.&lt;/p&gt;
&lt;p&gt;the general idea behind pre-tagging is to have a classifier, possibly trained on texts from a different domain, tag a new text prior to a human annotator dealing with the text in order create a high quality marked-up corpus. &lt;/p&gt;
&lt;p&gt;now, there exist two, seemingly valid opinions regarding this: use pre-tagging, and do not use pre-tagging. &lt;/p&gt;
&lt;p&gt;the people advocating the former view claim that pre-tagging may turn the process of manually marking up text into one of manually revising the text, which would reduce the burden for the human annotator. &lt;/p&gt;
&lt;p&gt;contrary, people advocating the latter view claim that pre-tagging introduce a bias which affect the human annotator in such a way that he will fail to mark up things in the text that he would have seen had he been presented the raw text instead. &lt;/p&gt;
&lt;p&gt;i know that it is very unlikely that someone will read this post, and if some one do, the chance that he or she will have anything to say about its content is small, next to none. but anyways, do you know of any references validating either of the points concerning pre-tagging, please let me know.&lt;/p&gt;
&lt;p&gt;cheers,&lt;/p&gt;
&lt;p&gt;f &lt;/p&gt;</description><link>http://smudo.tumblr.com/post/24268121</link><guid>http://smudo.tumblr.com/post/24268121</guid><pubDate>Mon, 21 Jan 2008 02:24:00 -0500</pubDate><category>thesis-related</category></item></channel></rss>
