Feeds:
Posts
Comments

Archive for April, 2009

Hi All,

I’ve found from my machine learning research that the quality and choice of training datasets is critical.  The phrase “Garbage In, Garbarge Out” comes to mind

Last.fm took this phrase to heart when they made their music recommendation service.  Instead of spending millions of dollars developing a high-dimensional neural-network super amazing predictor they spent a few thousand dollars and improved their data collection routine … for those that don’t know, Last.fm “scrobbles” your music listening habits from iTunes, iPods and other listening devices.  I’ve found the recommendations from Last.fm to be far better then Pandora or the Apple Genius but YMMV.

Back to NLP …

Before I started doing tag predictions I took a look at the sentences that I’d be predicting.  There are numerous examples of identical, or nearly identical sentences:

GSM301661:
Kidney. none. . . Quality metric = 28S to 18S: 1.6Patient Age: 70-79Gender: MaleEthnic 
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 6-10Type of Tobacco Use: 
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoPathological T: 1b
Pathological N: 0Pathological M: 0Pathological Stage: 1Pathological Grade: 2
Pathological Multiple Tumors: NoPathological Stage During or Following Multimodality 
Therapy: NoPrimary Site: KidneyHistology: Conventional (clear cell) renal carcinoma. 
Kidney - 421730

GSM138046:
Kidney. none. . . Quality metric = 28S to 18S: 1.2Patient Age: 50-60Gender: MaleEthnic 
Background: CaucasianTobacco Use : NoAlcohol Consumption?: NoFamily History of Cancer?: 
NoClinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 1Clinical Stage 
During or Following Multimodality Therapy: NoPrimary Site: KidneyHistology: Conventional 
(clear cell) renal carcinoma. Kidney - 215952

GSM53057
Kidney. none. . . Quality metric = 28S to 18S: 1.1Patient Age: 40-50Gender: MaleEthnic 
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 21-25Type of Tobacco Use: 
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoDays from Patient Diagnosis 
to Excision: 60Clinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 4
Primary Site: KidneyHistology: Conventional (clear cell) renal carcinoma. Kidney - 1271

I need to make sure that these doubles do not get fed to the classifier as separate TP instances.  Initially I was using the sentence itself as the key in a “unique_ever_seen” iterator.  This turned out to be too permissive, since the “nearly identical” sentences would be labeled as separate instances.  Then I started using the first-half of the sentence as the key … this was too restrictive, some of the sentences were identical in the beginning but had important information at the end.

I needed something that could contain all of the “relevant” information of the sentence but be easily comparable.  Initially I was thinking a list of features … but lists aren’t hashable and the order is important in comparisons.  I was trying to use a sorted-tuple as a dictionary key but that became extremely clunky, slow and I’m convinced it had loop-holes that I couldn’t see.

I use set() a lot, but they’re not hashable.  I realized that they have a simple extension … frozenset().  These have all of the same functions as sets but are “frozen” … any “update” operations return new sets instead of updating the current set.  Best of all, they are hashable!  I use a defaultdict to hold sets of TP and TN for each frozenset.  Below is the snippet of code which takes care of all of this:

...
def TrainingMicro_GEN(self):
	"""
	A utility function which groups sentences together and returns:
	(feat_dict, set(TP), set(TN))
	"""

	def default_factory():
		"""
		Returns [TP_set(), TN_set()]
		"""
		return [set(), set()]

	#get all of the hand curated TPs and TNs
	hand_queryset = hand_annotations.objects.filter(labeled_ont_id__ont_type = self.ont_type)
	#use select related so I don't have to hit the database a whole-lot
	queryset = hand_queryset.order_by('micro_id').select_related(depth = 1)
	total_num = queryset.count()

	#use a default-dictionary which will be keyed by feature-sets and contain
	#the TPs and TNs for each
	keyset_dict = defaultdict(default_factory)

	for this_annot, ind in IT.izip(queryset.iterator(), IT.count()):

		#get the actual sentence for the microarray
		this_sent = this_annot.micro_id.GetPredictableSent()

		#get the featureset
		feat_dict = self._process_token(this_sent)

		#since all of the keys just have True as a value the .keys()
		#gives us everything .. change it to a frozenset for keying
		keyset = frozenset(feat_dict.keys())
		this_tag = this_annot.labeled_ont_id.ont_id

		#the defaultdict takes care of the adding new values for each
		#new frozenset or adding to one that already exists
		if this_annot.is_TP:
			keyset_dict[keyset][0].add(this_tag)
		else:
			keyset_dict[keyset][1].add(this_tag)
		logging.warning('added %(ind)s:%(tot)s' % {'tot': str(total_num),
													'ind':str(ind)})

	#"destructively" yield the data to avoid memory errors
	while len(keyset_dict) > 0:
		keyset, knowns_setlist = keyset_dict.popitem()
		yield keyset, knowns_setlist[0], knowns_setlist[1]
...

Since I make the training/testing split downstream of this generator I get an added benefit.  Since each featureset is only yield-ed once I can ensure that there are no IDENTICAL featuresets in both training and testing.  This would give me an over inflated ROC value.

I hope this helps people avoid repeating themselves when they’re creating training sets.

Will

Read Full Post »

Hi all,
So I played around with Wordle for a little bit today. Here is a Wordle I made for all of the words and phrases that are used in the prediction.  The sizes are proportional to the number of occurances in the dataset, NOT the predictive ability … I have any idea about how to do that but it’ll have to wait.
Wordle: microarray words

Read Full Post »

Hi All,

Now that I’ve established a sufficient tag prediction routine I’ve started to load the microarray data into the database.  I ultimately had four design choices:

  1. Use a single pickle-file for the gene expression of all genes on a single array.
  2. Use a single pickle-file for the gene expression of a single gene across all arrays.
  3. Some combination of 1 and 2
  4. Place each normalized value into a SQL table with foreign-keys to the microarray and probe-id.

Choice 1 would allow quick access to all of the gene values of a microarray but slow access to all of the values of a single gene across multiple arrays.  Choice 2 has the opposite problem.  Choice 3 would resolve the issue but would require double the memory space and require a massive amount of programming to keep everything properly synced.

Choice 4 has similar memory problems to Choice 3 but the django model layer should take care of the saving and retrieval.  It would allow for simple queries using the foreign-keys from the prediction models.

below is the models.py file, feel free to comment if you have any suggestions:

 

from django.db import models
from predictions.models import *

# Create your models here.
class probe(models.Model):
	probe_name = models.CharField('Probe Name', max_length = 40)
	platform = models.ManyToManyField('platform', null = True, default = None)
	gene_measured = models.ForeignKey('gene', null = True, default = None)
	
	
	
class platform(models.Model):
	platform_name = models.CharField('Platform Name', max_length = 20)
	organism = models.CharField('Target Organism', max_length = 50)
	gpl_id = models.CharField('NCBI ID', max_length = 10, 
										default = None, null = True)
	
class gene(models.Model):
	official_gene_name = models.CharField('Official Name', max_length = 100)
	official_gene_symbol = models.CharField('Official Symbol', max_length = 10)
	entrez_gene_id = models.IntegerField('Entrez ID', default = None, null = True)
	synonyms = models.TextField(null = True, default = None)
	
class probe_measurement(models.Model):
	probe_measured = models.ForeignKey('probe')
	microarray_measured = models.ForeignKey(microarray)
	raw_value = models.FloatField(default = None, null = True)
	norm_value = models.FloatField(default = None, null = True)

 

Thanks

Will

Read Full Post »

Hi All,

So as I’ve discussed before I’ve been tagging microarray text annotations with ontology tags.  I’ve also discussed how I use ROC curves to measure the accuracy of the predictions.  I’ve finally finished all of the back-end work and now I’ve been getting results.

First, ROC curves:

Predction ability of MMB

Predction ability of MMB

These have AUCs of 0.94, 0.95, 0.87 for the Disease Type, Functional Anatomy and Cell Type respectively.  I have hand-annotated a LARGE number of microarray records, using a method I’ll describe later in this post.  Some details: I require that a GSM record has been annotated with AT LEAST 1 TP and 2 TNs to be included in the ROC calculations  … This results in the following numbers of training datasets … By default I use 2/3rds of the data for training and 1/3rd for testing … also the implementation of the MaxEntropy classifier that I use cannot train with TN annotations, so they are only used for the “specificity” calculation:

  • Disease Type: 1051 TPs and 96533 TNs
  • Functional Anatomy: 523 TPs and 31017 TNs
  • Cell Type: 677 TPs and 16489 TNs

While I am ECSTATIC about these results I think they are “suspiciously high”.  Most of the NLP research that I’ve combed through has AUCs (or comparable measures) that are closer to 0.7-0.8 … granted, these studies were on POS tagging or “Named Entity Recognition” (gene name identification) and relied only on features from a single sentences.  In my search to confirm or deny my prediction abilities I stumbled across something interesting which I’ll share here:

When I plot out a histogram of the predicted probabilites I notice an intersting side-effect of my annotation strategy:

A histogram of the probilities of labeled tags

A histogram of the probilities of labeled tags

When I hand-curate the predictions I use the django-admin actions function.  I’ve created a simple admin-action in the predictions admin interface which essentially creates an “agreement” or “disagreement” with the predicted tag:

A screen-shot of the admin-action website

A screen-shot of the admin-action website

I simply check-off the predictions I agree with and then use the admin-action to create a hand-annotation.  When I sort by prediction-probability I get another bonus … It will group similar tags and similar descriptions together, this happens because the extracted features for many GSM records are identical so they have the same predicted tag with the same probability.  Due to this grouping it is trivial for me to scan through the predictions … I can annotate ~500 GSM records in about 10 minutes.

This is what created the “blips” that are circled in this histogram.  When I do my annotations I’ll usually scroll in the admin interface to a random probability.  Then I’ll annotate ~2000 GSM records and then scroll to another probability.

After I noticed this I tried to annotate in a “truly” random method by having the admin interface shuffle the predictions it gave me.  However, this DRASTICALLY reduced my annotation throughput … it took ~45 min to do 500 annotations (nearly 10-fold slower).  This because the records weren’t grouped and I had to read each one more carefully, I couldn’t just skim it to ensure it was identical to the previous one.

If anyone has any suggestions for how to correct my curation bias yet maintain my throughput I’d greatly appreciate it.

Will

Read Full Post »

django evolved

Hi all,

I’ve been developing with django for ~5 weeks now.  Its really helped ease me into the whole SQL database thingy.  The django model layer allow the programmer to define something akin to a simple python class listing datatypes, providing “row level” methods, etc.  Mentally I think of the SQL tables as persistent vectors of class instances.  The django “model layer” takes care of the background SQL queries, insert, save and delete statements.

The problem I had after about a week of development was that I realized I needed a few extra fields in my model.  I can’t say I’ve ever written a code-base where I wrote the underlying data-model correctly the first time.  Initially I made changes to the models.py file and then ran “syncdb”.  The command finished without error and I went on programming assuming everything was fine.  I began getting strange error messages and tracked them down to the “model layer”.

I searched thought the django docs and realized that django would not “alter tables”, it would only create them.  It suggested issuing custom SQL statements and alter the tables by hand.  The SQL statements provided might just as well have been in Greek for all I understood of them.

My initial solution was to dump all of my data into fixtures, then delete the underlying SQLite database file, then syncdb with the new models.py and finally reload the data.  On my large database this would take ~2-3 hours and render my application unusable for programming/debugging purposes.  I figured there had to be a better way.

A few googlings and blog posts later I found django-evolve … although it seems to be unsupported for a few months.  This application intercepts syncdb calls and determines whether an ALTER TABLE command is required.  With a few extra commands (see the documentation for details) it will preview the SQL command and then issue it to the database.

The only problem I’ve noticed has been that it won’t “evolve” a ManyToMany field into the database … I assume its because it requires both a CREATE and ALTER command.  If I was a little better at SQL I would attempt to add it to the project, but alas.  My solution was to restore my database (its in SVN) to a time-point before i initially created the table and then re-syncdb’ed the database.

If anyone has any suggestions I’d be glad to hear them.

Will

Read Full Post »

Hi All,

So I had a thought about a killer feature for a website.  In a previous post I talked a little bit about FogBugz.  I’ve been using it for the past ~6 months and I’m hooked.  It really helps me to plan out all of my projects.  Its been encouraging me to “think ahead” and parcel out my program into a small set of “features” … which are often: code GetDescendents() subroutine, code django “intialdata” fixture, etc.  Since Fogbugz has about 150 completed cases of mine (with time estimates) its been getting VERY good at predicting when my projects will be finished.

So this website that I’ve been putting together for the NLP predictions should go live within ~4 weeks … 90% confidence from Fogbugz :).  This website release will essentially just show the semi-static (updated nightly) tags of the GSM records … it will also allow some rudimentary searching, etc.  I’ve sketched out, in Fogbug of course, the tasks required for:  “downloading raw microarray data”, “normalizing ALL microarrays to allow online SAM experiments”, and about ~5 other “roll-outs”.

I was thinking that a small corner of the website could be devoted to displaying “time-to-completion” for these roll-outs.  Since they’re in FogBugz I’ll have an entire confidence distribution … I wouldn’t display all of the “cases” for each item since that’s “private” but the estimated release date would be useful to all of my site’s “regular” visitors.  This is similar to the “roadmap” that you find on most open source project websites … but this would be automatically and dynamically updated.

I’ve done a little bit of leg-work and this seems like it would be a pretty simple feature to implement.  I’ve mentally sketched out a django model and  looked through the Fogbugz API … I haven’t found a simple “get 50% confidence interval” in the fogbugz API but I didn’t look very hard.

If anyone has any suggestions or feature requests … since I’ll put this on google-code once its up and running … leave them in the comments.

Thanks,

Will

Read Full Post »