<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Mega Micro Base</title>
	<atom:link href="http://megamicrobase.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://megamicrobase.wordpress.com</link>
	<description>Adventures in Python, Microarrays, GUI's and Feature Creep</description>
	<lastBuildDate>Thu, 07 May 2009 03:49:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='megamicrobase.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Mega Micro Base</title>
		<link>http://megamicrobase.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://megamicrobase.wordpress.com/osd.xml" title="Mega Micro Base" />
	<atom:link rel='hub' href='http://megamicrobase.wordpress.com/?pushpress=hub'/>
		<item>
		<title>NLP &#8230; Not Like Pixiedust</title>
		<link>http://megamicrobase.wordpress.com/2009/05/06/nlp-not-like-pixiedust/</link>
		<comments>http://megamicrobase.wordpress.com/2009/05/06/nlp-not-like-pixiedust/#comments</comments>
		<pubDate>Thu, 07 May 2009 03:49:36 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[PPI Text-mining]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[NLTK]]></category>
		<category><![CDATA[PPI]]></category>
		<category><![CDATA[pubmed]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=179</guid>
		<description><![CDATA[I discuss a new NLP project for finding protein-protein interactions from text-mining.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=179&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi all,</p>
<p>I&#8217;ve been doing mostly Natural Language Processing work lately.  The website and crowd-sourcing is really just a way of presenting the results and gathering known data from other researchers.</p>
<p>Don&#8217;t get me wrong &#8230; I love the NLP work that I&#8217;ve been doing lately, but as the title suggests &#8230; its not pixie dust.  Sprinkling NLTK code over a directory of text-files will not magically extract all relevant information into nice little morsels.</p>
<p>On to a more productive train of thought.  So, my new NLP task is as such:</p>
<ul>
<li>There is a database of protein-protein interactions between Human and HIV-1 proteins &#8230; <a href="http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/">here</a>.</li>
<li>These interactions are hand curated from PUBMED articles</li>
<li>These interactions are seperated by &#8220;type&#8221; such as: binding, upregulates, co-regulates, etc.</li>
<li>I need to find more such interactions from new literature (the above link hasn&#8217;t updated in ~1 year &#8230; a lifetime in HIV research)</li>
</ul>
<p>I&#8217;ll combine these text-mining predictions with our sequence based predictions and database cross-referencing to develop a web-tool for accessing HIV-1 interactions from multiple sources.  Then we&#8217;ll extend this concept to all other virii and create a giant database of host-pathogen protein-protein interactions.</p>
<p>I&#8217;ve searched throuh the literature and found a handful of webtools for finding protein-protein interactions from text-data.  However, these tools focus on &#8220;within-species&#8221; interactions and won&#8217;t even let you ask about &#8220;inter-species&#8221; interactions.</p>
<p>To this end I&#8217;ve downloaded all PubmedCentral articles (Title, Abstract, Full Text).  This is a hefty 1.4 gig tar-file of XML files.  All of the text-mining research I&#8217;ve read have only dealt with titles and abstracts &#8230; its not clear whether this is from a quantity or quality problem, but I&#8217;ll follow their lead until I decide otherwise.</p>
<p>I&#8217;ve thought of a multi-step classification process:</p>
<ol>
<li>Use the entire NIAID database to find all &#8220;HIV INTERACTION PAPERS&#8221;</li>
<li>Use the results of #1 and NIAID database grouped by &#8220;interaction type&#8221; to determine the type of proposed interaction.</li>
<li>Use a Named-Entity-Recognition to identify the gene mentioned in the abstract.</li>
</ol>
<p>I know I&#8217;ve read a few papers advertising tools to do #3 so I&#8217;ll likely just run my abstracts through their tools.</p>
<p>I&#8217;m always open to suggestions if anyone has done research like this,</p>
<p>Will</p>
<br />Posted in Natural Language Processing, PPI Text-mining Tagged: NLP, NLTK, PPI, pubmed <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/179/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/179/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/179/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=179&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/05/06/nlp-not-like-pixiedust/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>I don&#8217;t have forever here django</title>
		<link>http://megamicrobase.wordpress.com/2009/05/01/i-dont-have-forever-here-django/</link>
		<comments>http://megamicrobase.wordpress.com/2009/05/01/i-dont-have-forever-here-django/#comments</comments>
		<pubDate>Fri, 01 May 2009 17:52:16 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[microarray]]></category>
		<category><![CDATA[optimization]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=171</guid>
		<description><![CDATA[I discuss some "risky" ways to improve the speed of multiple database transactions using the django @transaction.commit_on_success decorator.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=171&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi all,</p>
<p>I&#8217;ve started loading microarray data into my django app.  I&#8217;m using a django-manage.py extension to load normalized data into the model described in a previous post.  Since I&#8217;d like to be able to use django queries to retrieve the data I need to load every normalized microarray probe measurement, with ForiegnKeys to the microarray itself and the probe_id.  For those of you who are not in the bioinformatics field I&#8217;ll give you some relative numbers &#8230; there are ~20,000 to ~50,000 probes per microarray and I&#8217;m looking to load in ~10,000 mircoarrays.  At the moment I&#8217;m not worried about a space issue (I&#8217;ll cross that bridge if I come to it); I&#8217;m worried about a time-issue.</p>
<p>I load each normalized value and its associated probe from a tab-delimited text-file created elsewhere.  I then use the .get_or_create() function (which in this case almost always creates) to put the value into the database.  My logs show that this takes ~0.3 seconds.  While this seems very fast, its not good enough.  This speed will finish loading an entire microarray in about ~1.5 hours &#8230; which is ~8 microarrays per day &#8230; which would mean it would take ~4 years to load in the 10,000 microarrays.</p>
<p>So I need to find a better way to load these in.  I did some google-ing and came across this <a href="http://www.reddit.com/r/programming/comments/24hil/sqlite_performance_and_django">reddit page</a>.  This took me to the transaction section of the <a href="http://docs.djangoproject.com/en/dev/topics/db/transactions/">django-docs</a>.  In summary, these told me that whenever I used the .save() command I&#8217;d have to wait for the SQLite I/O &#8230; essentially it set up ways to roll-back transactions and setup safeties in case of power failure.  I decided that I&#8217;m willing to trade some of the safety for speed in this instance &#8230; but as with any scientist I like quantitative numbers.  I set up the code below to graph the difference in speed between when I commit on each transaction and when I commit in bulk.</p>
<pre class="brush: python;">
from __future__ import with_statement
from django.core.management.base import BaseCommand, AppCommand
from django.db import transaction
from predictions.models import microarray
from microdata.models import *
from optparse import make_option
from itertools import izip, count
import os, re, time
import logging
import logging.handlers

class Command(BaseCommand):
	option_list = BaseCommand.option_list + (
			make_option('--queuedir', dest = 'queuedir',
			default = os.environ['MICROPATH'] + 'new_microdata'),
			make_option('--numfiles', dest = 'numfiles', type = 'int',
			default = 200),)

	help = 'Loads the formatted and normalized data into the database'

	requires_model_validation = True

	def handle(self, **options):
		def commit_each(lines, this_micro, this_plat, probe_queryset):
			for this_line in lines:
				parts = this_line.strip().split('\t')
				try:
					val = float(parts[1])
				except ValueError:
					val = None
				this_probe, is_created = probe_queryset.get_or_create(probe_name = parts[0],
										defaults = {'platform': this_plat})
				this_measurement, is_new = probe_measurement.objects.get_or_create(
											microarray_measured = this_micro,
											probe_measured = this_probe,
											defaults = {'norm_value': val})

		@transaction.commit_on_success
		def commit_total(lines, this_micro, this_plat, probe_queryset):
			for this_line in lines:
				parts = this_line.strip().split('\t')
				try:
					val = float(parts[1])
				except ValueError:
					val = None
				this_probe, is_created = probe_queryset.get_or_create(probe_name = parts[0],
										defaults = {'platform': this_plat})
				this_measurement, is_new = probe_measurement.objects.get_or_create(
											microarray_measured = this_micro,
											probe_measured = this_probe,
											defaults = {'norm_value': val})

		# create logger
		logger = logging.getLogger(&quot;simple_example&quot;)
		logger.setLevel(logging.DEBUG)
		# create console handler and set level to debug
		ch = logging.StreamHandler()
		ch.setLevel(logging.DEBUG)
		# create formatter
		formatter = logging.Formatter(&quot;%(asctime)s - %(name)s - %(levelname)s - %(message)s&quot;)
		# add formatter to ch
		ch.setFormatter(formatter)
		# add ch to logger
		logger.addHandler(ch)
		LOG_FILENAME = 'C:\\pyscratch\\logdata\\log_data_KEEP.log'
		handler = logging.handlers.RotatingFileHandler(LOG_FILENAME,
							maxBytes=2000000, backupCount=10)
		handler.setFormatter(formatter)
		logger.addHandler(handler)
		files = os.listdir(options['queuedir'])

		for this_file, ind in izip(files, range(options['numfiles'])):
			with open(options['queuedir'] + os.sep + this_file) as handle:
				gsm_label = this_file.split('.')[0]
				logger.warning(gsm_label)
				this_micro = microarray.objects.get(gsm_label = gsm_label)
				plat_label = handle.next().strip()
				this_plat = platform.objects.get(gpl_id = plat_label)
				probe_queryset = probe.objects.filter(platform = this_plat)
				line_list_len = 10
				line_list = []
				commit_each_val = False
				for this_line in handle:
					line_list.append(this_line)
					if len(line_list) == line_list_len:
						if commit_each_val:
							logger.warning('Starting commit_each size:' + str(line_list_len))
							commit_each(line_list, this_micro, this_plat, probe_queryset)
							commit_each_val = False
							line_list = []
							logger.warning('Finished commit_each size:' + str(line_list_len))
							line_list_len *= 2
						else:
							logger.warning('Starting commit_total size:' + str(line_list_len))
							commit_total(line_list, this_micro, this_plat, probe_queryset)
							commit_each_val = True
							line_list = []
							logger.warning('Finished commit_total size:' + str(line_list_len))
				commit_total(line_list, this_micro, this_plat, probe_queryset)

			time.sleep(5)
			os.remove(options['queuedir'] + os.sep + this_file)
</pre>
<p>This allowed me to create the plot below to show the difference in execution time between the &#8220;commit_each&#8221; and &#8220;commit_total&#8221; &#8230; just to get this out of the way early:  I know the &#8220;variability and complexity&#8221; involved with benchmarking, I was was actually writing this post while the code was running so it was far from controlled.  However, the differences are far outside the noise-level of the .time() function.</p>
<div id="attachment_176" class="wp-caption aligncenter" style="width: 310px"><img class="size-medium wp-image-176" title="Saving Times" src="http://megamicrobase.files.wordpress.com/2009/05/commiting_times1.png?w=300&#038;h=210" alt="The time it takes to save items based on the block size" width="300" height="210" /><p class="wp-caption-text">The time it takes to save items based on the block size</p></div>
<p>As you can see the speed improvement of commit_total() stays constant as the block_size increases.  So the ultimate question is:</p>
<p>&#8220;Do you feel lucky &#8230; Punk&#8221;.</p>
<p>The danger of commit_total is that if there&#8217;s a power-outage, system reboot, or ctrl-c&#8217;ing out the function then all of the pending .save()&#8217;s are lost.  So obviously I don&#8217;t want to wrap the ENTIRE manage.py function in a@transaction.commit_on_success decorator since this will take days to run, and I&#8217;m not nearly that lucky.  But its reasonable to do on a microarray by microarray basis (about ~11 minutes).  This even makes some &#8220;design logic&#8221; sense &#8230; an entire microarray should be loaded as one action.  This translates to ~80 arrays per-night &#8230; not great but much better than ~4.</p>
<p>Any suggestions would be greatly appreciated.</p>
<p>Will</p>
<br />Posted in django, python Tagged: django, microarray, optimization <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/171/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/171/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/171/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=171&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/05/01/i-dont-have-forever-here-django/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>

		<media:content url="http://megamicrobase.files.wordpress.com/2009/05/commiting_times1.png?w=300" medium="image">
			<media:title type="html">Saving Times</media:title>
		</media:content>
	</item>
		<item>
		<title>Haven&#8217;t I seen this before?</title>
		<link>http://megamicrobase.wordpress.com/2009/04/30/havent-i-seen-this-before/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/30/havent-i-seen-this-before/#comments</comments>
		<pubDate>Thu, 30 Apr 2009 19:23:09 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[defaultdict]]></category>
		<category><![CDATA[featureset]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[NLTK]]></category>
		<category><![CDATA[set]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=163</guid>
		<description><![CDATA[I discuss how I use a defaultdict and frozenset to ensure that there is only one example of a featureset in the training or testing dataset.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=163&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>I&#8217;ve found from my machine learning research that the quality and choice of training datasets is critical.  The phrase &#8220;<a href="http://en.wikipedia.org/wiki/Garbage_in,_garbage_out">Garbage In, Garbarge Out</a>&#8221; comes to mind</p>
<p><a href="www.last.fm/">Last.fm</a> took this phrase to heart when they made their music recommendation service.  Instead of spending millions of dollars developing a high-dimensional neural-network super amazing predictor they spent a few thousand dollars and improved their data collection routine &#8230; for those that don&#8217;t know, Last.fm &#8220;scrobbles&#8221; your music listening habits from iTunes, iPods and other listening devices.  I&#8217;ve found the recommendations from Last.fm to be far better then <a href="http://www.pandora.com/">Pandora</a> or the <a href="http://www.betanews.com/article/Apple-iTunes-80-A-closer-look-at-Genius/1220994527">Apple Genius</a> but YMMV.</p>
<p>Back to NLP &#8230;</p>
<p>Before I started doing tag predictions I took a look at the sentences that I&#8217;d be predicting.  There are numerous examples of identical, or nearly identical sentences:</p>
<pre class="brush: css;">
GSM301661:
Kidney. none. . . Quality metric = 28S to 18S: 1.6Patient Age: 70-79Gender: MaleEthnic
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 6-10Type of Tobacco Use:
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoPathological T: 1b
Pathological N: 0Pathological M: 0Pathological Stage: 1Pathological Grade: 2
Pathological Multiple Tumors: NoPathological Stage During or Following Multimodality
Therapy: NoPrimary Site: KidneyHistology: Conventional (clear cell) renal carcinoma.
Kidney - 421730

GSM138046:
Kidney. none. . . Quality metric = 28S to 18S: 1.2Patient Age: 50-60Gender: MaleEthnic
Background: CaucasianTobacco Use : NoAlcohol Consumption?: NoFamily History of Cancer?:
NoClinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 1Clinical Stage
During or Following Multimodality Therapy: NoPrimary Site: KidneyHistology: Conventional
(clear cell) renal carcinoma. Kidney - 215952

GSM53057
Kidney. none. . . Quality metric = 28S to 18S: 1.1Patient Age: 40-50Gender: MaleEthnic
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 21-25Type of Tobacco Use:
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoDays from Patient Diagnosis
to Excision: 60Clinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 4
Primary Site: KidneyHistology: Conventional (clear cell) renal carcinoma. Kidney - 1271
</pre>
<p>I need to make sure that these doubles do not get fed to the classifier as separate TP instances.  Initially I was using the sentence itself as the key in a &#8220;<a href="http://docs.python.org/library/itertools.html">unique_ever_seen</a>&#8221; iterator.  This turned out to be too permissive, since the &#8220;nearly identical&#8221; sentences would be labeled as separate instances.  Then I started using the first-half of the sentence as the key &#8230; this was too restrictive, some of the sentences were identical in the beginning but had important information at the end.</p>
<p>I needed something that could contain all of the &#8220;relevant&#8221; information of the sentence but be easily comparable.  Initially I was thinking a list of features &#8230; but lists aren&#8217;t hashable and the order is important in comparisons.  I was trying to use a sorted-tuple as a dictionary key but that became extremely clunky, slow and I&#8217;m convinced it had loop-holes that I couldn&#8217;t see.</p>
<p>I use set() a lot, but they&#8217;re not hashable.  I realized that they have a simple extension &#8230; <a href="http://docs.python.org/library/stdtypes.html?highlight=frozenset#frozenset">frozenset</a>().  These have all of the same functions as sets but are &#8220;frozen&#8221; &#8230; any &#8220;update&#8221; operations return new sets instead of updating the current set.  Best of all, they are hashable!  I use a <a href="http://docs.python.org/library/collections.html?highlight=defaultdict#collections.defaultdict">defaultdict</a> to hold sets of TP and TN for each frozenset.  Below is the snippet of code which takes care of all of this:</p>
<pre class="brush: python;">
...
def TrainingMicro_GEN(self):
	&quot;&quot;&quot;
	A utility function which groups sentences together and returns:
	(feat_dict, set(TP), set(TN))
	&quot;&quot;&quot;

	def default_factory():
		&quot;&quot;&quot;
		Returns [TP_set(), TN_set()]
		&quot;&quot;&quot;
		return [set(), set()]

	#get all of the hand curated TPs and TNs
	hand_queryset = hand_annotations.objects.filter(labeled_ont_id__ont_type = self.ont_type)
	#use select related so I don't have to hit the database a whole-lot
	queryset = hand_queryset.order_by('micro_id').select_related(depth = 1)
	total_num = queryset.count()

	#use a default-dictionary which will be keyed by feature-sets and contain
	#the TPs and TNs for each
	keyset_dict = defaultdict(default_factory)

	for this_annot, ind in IT.izip(queryset.iterator(), IT.count()):

		#get the actual sentence for the microarray
		this_sent = this_annot.micro_id.GetPredictableSent()

		#get the featureset
		feat_dict = self._process_token(this_sent)

		#since all of the keys just have True as a value the .keys()
		#gives us everything .. change it to a frozenset for keying
		keyset = frozenset(feat_dict.keys())
		this_tag = this_annot.labeled_ont_id.ont_id

		#the defaultdict takes care of the adding new values for each
		#new frozenset or adding to one that already exists
		if this_annot.is_TP:
			keyset_dict[keyset][0].add(this_tag)
		else:
			keyset_dict[keyset][1].add(this_tag)
		logging.warning('added %(ind)s:%(tot)s' % {'tot': str(total_num),
													'ind':str(ind)})

	#&quot;destructively&quot; yield the data to avoid memory errors
	while len(keyset_dict) &gt; 0:
		keyset, knowns_setlist = keyset_dict.popitem()
		yield keyset, knowns_setlist[0], knowns_setlist[1]
...
</pre>
<p>Since I make the training/testing split downstream of this generator I get an added benefit.  Since each featureset is only yield-ed once I can ensure that there are no IDENTICAL featuresets in both training and testing.  This would give me an over inflated ROC value.</p>
<p>I hope this helps people avoid repeating themselves when they&#8217;re creating training sets.</p>
<p>Will</p>
<br />Posted in Natural Language Processing, programming, python Tagged: defaultdict, featureset, NLP, NLTK, python, set <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/163/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=163&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/30/havent-i-seen-this-before/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>Predictive words and phrases</title>
		<link>http://megamicrobase.wordpress.com/2009/04/29/predictive-words-and-phrases/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/29/predictive-words-and-phrases/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 21:00:35 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Quick Thoughts]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=158</guid>
		<description><![CDATA[Hi all, So I played around with Wordle for a little bit today. Here is a Wordle I made for all of the words and phrases that are used in the prediction.  The sizes are proportional to the number of occurances in the dataset, NOT the predictive ability &#8230; I have any idea about how [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=158&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi all,<br />
So I played around with <a href="http://www.wordle.net">Wordle </a>for a little bit today.  Here is a Wordle I made for all of the words and phrases that are used in the prediction.  The sizes are proportional to the number of occurances in the dataset, NOT the predictive ability &#8230; I have any idea about how to do that but it&#8217;ll have to wait.<br />
<a href="http://www.wordle.net/gallery/wrdl/793474/microarray_words" title="Wordle: microarray words"><img src="http://www.wordle.net/thumb/wrdl/793474/microarray_words" alt="Wordle: microarray words" style="border:1px solid #ddd;padding:4px;"></a></p>
<br />Posted in Natural Language Processing, Quick Thoughts, Uncategorized  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/158/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/158/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/158/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=158&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/29/predictive-words-and-phrases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>

		<media:content url="http://www.wordle.net/thumb/wrdl/793474/microarray_words" medium="image">
			<media:title type="html">Wordle: microarray words</media:title>
		</media:content>
	</item>
		<item>
		<title>SQL Microarray Tables</title>
		<link>http://megamicrobase.wordpress.com/2009/04/29/sql-microarray-tables/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/29/sql-microarray-tables/#comments</comments>
		<pubDate>Wed, 29 Apr 2009 13:38:13 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[Website]]></category>
		<category><![CDATA[microarray]]></category>
		<category><![CDATA[models]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=145</guid>
		<description><![CDATA[Hi All, Now that I&#8217;ve established a sufficient tag prediction routine I&#8217;ve started to load the microarray data into the database.  I ultimately had four design choices: Use a single pickle-file for the gene expression of all genes on a single array. Use a single pickle-file for the gene expression of a single gene across all arrays. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=145&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>Now that I&#8217;ve established a sufficient tag prediction routine I&#8217;ve started to load the microarray data into the database.  I ultimately had four design choices:</p>
<ol>
<li>Use a single pickle-file for the gene expression of all genes on a single array.</li>
<li>Use a single pickle-file for the gene expression of a single gene across all arrays.</li>
<li>Some combination of 1 and 2</li>
<li>Place each normalized value into a SQL table with foreign-keys to the microarray and probe-id.</li>
</ol>
<p>Choice 1 would allow quick access to all of the gene values of a microarray but slow access to all of the values of a single gene across multiple arrays.  Choice 2 has the opposite problem.  Choice 3 would resolve the issue but would require double the memory space and require a massive amount of programming to keep everything properly synced.</p>
<p>Choice 4 has similar memory problems to Choice 3 but the django model layer should take care of the saving and retrieval.  It would allow for simple queries using the foreign-keys from the prediction models.</p>
<p>below is the models.py file, feel free to comment if you have any suggestions:</p>
<p> 
<pre class="brush: python;">
from django.db import models
from predictions.models import *

# Create your models here.
class probe(models.Model):
	probe_name = models.CharField('Probe Name', max_length = 40)
	platform = models.ManyToManyField('platform', null = True, default = None)
	gene_measured = models.ForeignKey('gene', null = True, default = None)

class platform(models.Model):
	platform_name = models.CharField('Platform Name', max_length = 20)
	organism = models.CharField('Target Organism', max_length = 50)
	gpl_id = models.CharField('NCBI ID', max_length = 10,
										default = None, null = True)

class gene(models.Model):
	official_gene_name = models.CharField('Official Name', max_length = 100)
	official_gene_symbol = models.CharField('Official Symbol', max_length = 10)
	entrez_gene_id = models.IntegerField('Entrez ID', default = None, null = True)
	synonyms = models.TextField(null = True, default = None)

class probe_measurement(models.Model):
	probe_measured = models.ForeignKey('probe')
	microarray_measured = models.ForeignKey(microarray)
	raw_value = models.FloatField(default = None, null = True)
	norm_value = models.FloatField(default = None, null = True)
</pre>
<p> </p>
<p>Thanks</p>
<p>Will</p>
<br />Posted in django, Website Tagged: django, microarray, models <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/145/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/145/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/145/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=145&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/29/sql-microarray-tables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>So, How well do I do?</title>
		<link>http://megamicrobase.wordpress.com/2009/04/27/so-how-well-do-i-do/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/27/so-how-well-do-i-do/#comments</comments>
		<pubDate>Mon, 27 Apr 2009 16:12:25 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Website]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[prediction]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=149</guid>
		<description><![CDATA[I discuss the high ROC values I've gotten for my first run at predicting.  I also mention an interesting side-effect of my annotation strategy and look for ways to avoid it in the future.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=149&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>So as I&#8217;ve discussed before I&#8217;ve been tagging microarray text annotations with ontology tags.  I&#8217;ve also discussed how I use <a href="http://megamicrobase.wordpress.com/2009/03/25/how-to-define-good/">ROC curves to measure the accuracy of the predictions</a>.  I&#8217;ve finally finished all of the back-end work and now I&#8217;ve been getting results.</p>
<p>First, ROC curves:</p>
<div id="attachment_153" class="wp-caption aligncenter" style="width: 310px"><img class="size-medium wp-image-153" title="ROC Curves" src="http://megamicrobase.files.wordpress.com/2009/04/roc_curves1.png?w=300&#038;h=225" alt="Predction ability of MMB" width="300" height="225" /><p class="wp-caption-text">Predction ability of MMB</p></div>
<p>These have AUCs of 0.94, 0.95, 0.87 for the Disease Type, Functional Anatomy and Cell Type respectively.  I have hand-annotated a LARGE number of microarray records, using a method I&#8217;ll describe later in this post.  Some details: I require that a GSM record has been annotated with AT LEAST 1 TP and 2 TNs to be included in the ROC calculations  &#8230; This results in the following numbers of training datasets &#8230; By default I use 2/3rds of the data for training and 1/3rd for testing &#8230; also the implementation of the MaxEntropy classifier that I use cannot train with TN annotations, so they are only used for the &#8220;specificity&#8221; calculation:</p>
<ul>
<li>Disease Type: 1051 TPs and 96533 TNs</li>
<li>Functional Anatomy: 523 TPs and 31017 TNs</li>
<li>Cell Type: 677 TPs and 16489 TNs</li>
</ul>
<p>While I am ECSTATIC about these results I think they are &#8220;suspiciously high&#8221;.  Most of the NLP research that I&#8217;ve combed through has AUCs (or comparable measures) that are closer to 0.7-0.8 &#8230; granted, these studies were on POS tagging or &#8220;Named Entity Recognition&#8221; (gene name identification) and relied only on features from a single sentences.  In my search to confirm or deny my prediction abilities I stumbled across something interesting which I&#8217;ll share here:</p>
<p>When I plot out a histogram of the predicted probabilites I notice an intersting side-effect of my annotation strategy:</p>
<div id="attachment_154" class="wp-caption aligncenter" style="width: 310px"><img class="size-medium wp-image-154" title="Probility Histogram" src="http://megamicrobase.files.wordpress.com/2009/04/prob_hist.png?w=300&#038;h=225" alt="A histogram of the probilities of labeled tags" width="300" height="225" /><p class="wp-caption-text">A histogram of the probilities of labeled tags</p></div>
<p>When I hand-curate the predictions I use the<a href="http://docs.djangoproject.com/en/dev/ref/contrib/admin/actions/"> django-admin actions</a> function.  I&#8217;ve created a simple admin-action in the predictions admin interface which essentially creates an &#8220;agreement&#8221; or &#8220;disagreement&#8221; with the predicted tag:</p>
<div id="attachment_155" class="wp-caption aligncenter" style="width: 310px"><img class="size-medium wp-image-155" title="admin-action" src="http://megamicrobase.files.wordpress.com/2009/04/admin-action1.jpg?w=300&#038;h=177" alt="A screen-shot of the admin-action website" width="300" height="177" /><p class="wp-caption-text">A screen-shot of the admin-action website</p></div>
<p>I simply check-off the predictions I agree with and then use the admin-action to create a hand-annotation.  When I sort by prediction-probability I get another bonus &#8230; It will group similar tags and similar descriptions together, this happens because the extracted features for many GSM records are identical so they have the same predicted tag with the same probability.  Due to this grouping it is trivial for me to scan through the predictions &#8230; I can annotate ~500 GSM records in about 10 minutes.</p>
<p>This is what created the &#8220;blips&#8221; that are circled in this histogram.  When I do my annotations I&#8217;ll usually scroll in the admin interface to a random probability.  Then I&#8217;ll annotate ~2000 GSM records and then scroll to another probability.</p>
<p>After I noticed this I tried to annotate in a &#8220;truly&#8221; random method by having the admin interface shuffle the predictions it gave me.  However, this DRASTICALLY reduced my annotation throughput &#8230; it took ~45 min to do 500 annotations (nearly 10-fold slower).  This because the records weren&#8217;t grouped and I had to read each one more carefully, I couldn&#8217;t just skim it to ensure it was identical to the previous one.</p>
<p>If anyone has any suggestions for how to correct my curation bias yet maintain my throughput I&#8217;d greatly appreciate it.</p>
<p>Will</p>
<br />Posted in django, Natural Language Processing, Website Tagged: django, NLP, prediction, Website <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/149/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/149/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/149/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=149&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/27/so-how-well-do-i-do/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>

		<media:content url="http://megamicrobase.files.wordpress.com/2009/04/roc_curves1.png?w=300" medium="image">
			<media:title type="html">ROC Curves</media:title>
		</media:content>

		<media:content url="http://megamicrobase.files.wordpress.com/2009/04/prob_hist.png?w=300" medium="image">
			<media:title type="html">Probility Histogram</media:title>
		</media:content>

		<media:content url="http://megamicrobase.files.wordpress.com/2009/04/admin-action1.jpg?w=300" medium="image">
			<media:title type="html">admin-action</media:title>
		</media:content>
	</item>
		<item>
		<title>django evolved</title>
		<link>http://megamicrobase.wordpress.com/2009/04/24/django-evolved/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/24/django-evolved/#comments</comments>
		<pubDate>Fri, 24 Apr 2009 14:00:34 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[Website]]></category>
		<category><![CDATA[django-evolution]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=141</guid>
		<description><![CDATA[I discuss django-evolution and how it can automatically add or remove fields from a django application.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=141&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi all,</p>
<p>I&#8217;ve been developing with django for ~5 weeks now.  Its really helped ease me into the whole SQL database thingy.  The django model layer allow the programmer to define something akin to a simple python class listing datatypes, providing &#8220;row level&#8221; methods, etc.  Mentally I think of the SQL tables as persistent vectors of class instances.  The django &#8220;model layer&#8221; takes care of the background SQL queries, insert, save and delete statements.</p>
<p>The problem I had after about a week of development was that I realized I needed a few extra fields in my model.  I can&#8217;t say I&#8217;ve ever written a code-base where I wrote the underlying data-model correctly the first time.  Initially I made changes to the models.py file and then ran &#8220;syncdb&#8221;.  The command finished without error and I went on programming assuming everything was fine.  I began getting strange error messages and tracked them down to the &#8220;model layer&#8221;.</p>
<p>I searched thought the django docs and realized that django would not &#8220;alter tables&#8221;, it would only create them.  It suggested issuing custom SQL statements and alter the tables by hand.  The SQL statements provided might just as well have been in Greek for all I understood of them.</p>
<p>My initial solution was to dump all of my data into fixtures, then delete the underlying SQLite database file, then syncdb with the new models.py and finally reload the data.  On my large database this would take ~2-3 hours and render my application unusable for programming/debugging purposes.  I figured there had to be a better way.</p>
<p>A few googlings and blog posts later I found <a href="http://code.google.com/p/django-evolution/">django-evolve </a>&#8230; although it seems to be unsupported for a few months.  This application intercepts syncdb calls and determines whether an ALTER TABLE command is required.  With a few extra commands (see the documentation for details) it will preview the SQL command and then issue it to the database.</p>
<p>The only problem I&#8217;ve noticed has been that it won&#8217;t &#8220;evolve&#8221; a ManyToMany field into the database &#8230; I assume its because it requires both a CREATE and ALTER command.  If I was a little better at SQL I would attempt to add it to the project, but alas.  My solution was to restore my database (its in SVN) to a time-point before i initially created the table and then re-syncdb&#8217;ed the database.</p>
<p>If anyone has any suggestions I&#8217;d be glad to hear them.</p>
<p>Will</p>
<br />Posted in django, python, Website Tagged: django, django-evolution, SQL <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/141/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/141/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/141/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=141&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/24/django-evolved/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>Lifting the Fog Around Feature Requests</title>
		<link>http://megamicrobase.wordpress.com/2009/04/05/lifting-the-fog-around-feature-requests/</link>
		<comments>http://megamicrobase.wordpress.com/2009/04/05/lifting-the-fog-around-feature-requests/#comments</comments>
		<pubDate>Mon, 06 Apr 2009 04:26:44 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[Future Work]]></category>
		<category><![CDATA[Project Management]]></category>
		<category><![CDATA[Quick Thoughts]]></category>
		<category><![CDATA[Website]]></category>
		<category><![CDATA[fogbugz]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=135</guid>
		<description><![CDATA[I talk about using Fogbugz and django to create a website corner that shows the expected date of website roll-outs.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=135&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>So I had a thought about a killer feature for a website.  In a previous <a href="http://megamicrobase.wordpress.com/2009/02/13/lifting-the-fog-around-time-to-completion/">post</a> I talked a little bit about <a href="http://www.fogcreek.com/FogBugz/">FogBugz</a>.  I&#8217;ve been using it for the past ~6 months and I&#8217;m hooked.  It really helps me to plan out all of my projects.  Its been encouraging me to &#8220;think ahead&#8221; and parcel out my program into a small set of &#8220;features&#8221; &#8230; which are often: code GetDescendents() subroutine, code django &#8220;intialdata&#8221; fixture, etc.  Since Fogbugz has about 150 completed cases of mine (with time estimates) its been getting VERY good at predicting when my projects will be finished.</p>
<p>So this website that I&#8217;ve been putting together for the NLP predictions should go live within ~4 weeks &#8230; 90% confidence from Fogbugz <img src='http://s2.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .  This website release will essentially just show the semi-static (updated nightly) tags of the GSM records &#8230; it will also allow some rudimentary searching, etc.  I&#8217;ve sketched out, in Fogbug of course, the tasks required for:  &#8221;downloading raw microarray data&#8221;, &#8220;normalizing ALL microarrays to allow online SAM experiments&#8221;, and about ~5 other &#8220;roll-outs&#8221;.</p>
<p>I was thinking that a small corner of the website could be devoted to displaying &#8220;time-to-completion&#8221; for these roll-outs.  Since they&#8217;re in FogBugz I&#8217;ll have an entire confidence distribution &#8230; I wouldn&#8217;t display all of the &#8220;cases&#8221; for each item since that&#8217;s &#8220;private&#8221; but the estimated release date would be useful to all of my site&#8217;s &#8220;regular&#8221; visitors.  This is similar to the &#8220;roadmap&#8221; that you find on most open source project websites &#8230; but this would be automatically and dynamically updated.</p>
<p>I&#8217;ve done a little bit of leg-work and this seems like it would be a pretty simple feature to implement.  I&#8217;ve mentally sketched out a django model and  looked through the Fogbugz API &#8230; I haven&#8217;t found a simple &#8220;get 50% confidence interval&#8221; in the fogbugz API but I didn&#8217;t look very hard.</p>
<p>If anyone has any suggestions or feature requests &#8230; since I&#8217;ll put this on google-code once its up and running &#8230; leave them in the comments.</p>
<p>Thanks,</p>
<p>Will</p>
<br />Posted in django, Future Work, Project Management, Quick Thoughts, Website Tagged: django, fogbugz <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/135/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/135/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/135/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=135&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/04/05/lifting-the-fog-around-feature-requests/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>I love django, but &#8230;</title>
		<link>http://megamicrobase.wordpress.com/2009/03/30/i-love-django-but/</link>
		<comments>http://megamicrobase.wordpress.com/2009/03/30/i-love-django-but/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 04:50:32 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[django]]></category>
		<category><![CDATA[Website]]></category>
		<category><![CDATA[template]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=132</guid>
		<description><![CDATA[I discuss how much I like django but how I'd love to have a set of basic website templates.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=132&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>So I&#8217;ve been looking to make a website for the predictions.  There are numerous python (and non-python) solutions for making a database.  Since I&#8217;ve had some experience playing around with django I decided to go with that.</p>
<p><a href="http://www.djangoproject.com/">Django</a> has a well developed <a href="http://docs.djangoproject.com/en/dev/topics/templates/#topics-templates">template system,</a> <a href="http://docs.djangoproject.com/en/dev/topics/http/urls/#topics-http-urls">url resolving</a>, <a href="http://docs.djangoproject.com/en/dev/topics/db/models/#topics-db-models">database integration</a>, <a href="http://docs.djangoproject.com/en/dev/topics/forms/modelforms/#topics-forms-modelforms">auto-form</a> creation and a half-dozen other <a href="http://docs.djangoproject.com/en/dev/">&#8220;batteries included&#8221; features</a>.  All of these made my setting up a relational database (which as I&#8217;ve mentioned <a href="http://megamicrobase.wordpress.com/2009/01/21/relation-less-…ation-database/">here</a> truly scares me) in a matter of minutes.  The admin interface is not particularly useful for my case &#8230; since all of my information is structured and I was able to write simple functions to import it.</p>
<p>The difficult part for me &#8230; since I&#8217;m a programmer &#8230; is the actual website design.  I&#8217;m not talking about &#8220;how do I link each page&#8221; or &#8220;how do I input this form-data into the database&#8221;.  I&#8217;m talking about &#8220;how do I make a snazzy website&#8221;, &#8220;how do I make a Table of Contents in a frame&#8221;, &#8220;how do I make a logo&#8221;, etc.</p>
<p>My programmer&#8217;s experience has taught me how to &#8220;reverse engineer&#8221; what I want from examples that do &#8220;almost&#8221; what I want.  So what would be VERY useful to me would be a set of &#8220;built-in&#8221; templates which produce a super basic, plain vanilla website.  If they have the &#8220;blocks&#8221; built in &#8230; or at least comments telling me &#8220;put title here&#8221;, &#8220;put TOC here&#8221;, &#8220;put logo image here&#8221;.</p>
<p>I&#8217;ve done a few google searches to try to find basic django templates but I haven&#8217;t been able to find anything useful.  If anyone has any suggestions I&#8217;d be glad to hear them.</p>
<p>-Will</p>
<br />Posted in django, Website Tagged: django, template, Website <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/132/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/132/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/132/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=132&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/03/30/i-love-django-but/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
		<item>
		<title>nosegrading MATLAB</title>
		<link>http://megamicrobase.wordpress.com/2009/03/28/nosegrading-matlab/</link>
		<comments>http://megamicrobase.wordpress.com/2009/03/28/nosegrading-matlab/#comments</comments>
		<pubDate>Sat, 28 Mar 2009 06:29:48 +0000</pubDate>
		<dc:creator>willdampier</dc:creator>
				<category><![CDATA[Quick Thoughts]]></category>
		<category><![CDATA[teaching]]></category>
		<category><![CDATA[matlab]]></category>
		<category><![CDATA[mlabwrap]]></category>
		<category><![CDATA[nosetest]]></category>

		<guid isPermaLink="false">http://megamicrobase.wordpress.com/?p=118</guid>
		<description><![CDATA[I discuss the possibility of using a set of nosetest plugins to create a "grading" program for python or MATLAB assignments in a class.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=118&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi All,</p>
<p>I had a quick thought.  When I&#8217;m teaching my Computational Biology course, which is basically a short MATLAB programming course, I make sure that all assignments are simple programs.  This makes grading easier since I can just run the code and evaluate the result against my solution &#8230; I ALWAYS make sure I can write a solution code using the techniques that I&#8217;ve already taught.</p>
<p>In order to actually grade the code I have a few boiler plate functions in MATLAB that check sizes, values, etc. and then grade accordingly.  This usually ends up being a lot of copy-pasting and downloading from gmail.</p>
<p>If I use the <a href="http://mlabwrap.sourceforge.net/">mlabwrap</a>, as I discussed <a href="http://megamicrobase.wordpress.com/2009/03/27/matlab-or-python-why-choose/">here</a>, to interface python with MATLAB I could do a more integrated  approach.  <a href="http://somethingaboutorange.com/mrl/projects/nose/">Nosetest</a> and its <a href="http://somethingaboutorange.com/mrl/projects/nose/doc/">plugin interface</a> does an amazing job at isolating code which may fail (like a minority of my student&#8217;s code).</p>
<p>The desired features of my solution would include:</p>
<ul>
<li>Use IMAP to download everyone&#8217;s code from a dedicated inbox &#8230; or make an internet submission system.</li>
<li>Use a nosetest plugin to grade each student&#8217;s submission.</li>
<li>Able to assign some types partial credit by regexp search for things like &#8220;used a for-loop&#8221;</li>
<li>Use a nosetest plugin and django templates to create a response which is automatically emailed to the student.</li>
<li>Use the xls interface to record the grade.</li>
</ul>
<p>I&#8217;ve done some google searches and haven&#8217;t been able to turn up anything relevant.  I may make a small set of nosetest plugins in my &#8220;free time&#8221;.</p>
<p>If anyone has any suggestions or feature requests please leave them in the comments.</p>
<p>-Will</p>
<br />Posted in Quick Thoughts, teaching Tagged: matlab, mlabwrap, nosetest, teaching <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/megamicrobase.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/megamicrobase.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/megamicrobase.wordpress.com/118/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=megamicrobase.wordpress.com&amp;blog=6180631&amp;post=118&amp;subd=megamicrobase&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://megamicrobase.wordpress.com/2009/03/28/nosegrading-matlab/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/2cd4d04570b1a24997ffa23a1e553026?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">willdampier</media:title>
		</media:content>
	</item>
	</channel>
</rss>
