Feeds:
Posts
Comments

Posts Tagged ‘featureset’

Hi All,

I’ve found from my machine learning research that the quality and choice of training datasets is critical.  The phrase “Garbage In, Garbarge Out” comes to mind

Last.fm took this phrase to heart when they made their music recommendation service.  Instead of spending millions of dollars developing a high-dimensional neural-network super amazing predictor they spent a few thousand dollars and improved their data collection routine … for those that don’t know, Last.fm “scrobbles” your music listening habits from iTunes, iPods and other listening devices.  I’ve found the recommendations from Last.fm to be far better then Pandora or the Apple Genius but YMMV.

Back to NLP …

Before I started doing tag predictions I took a look at the sentences that I’d be predicting.  There are numerous examples of identical, or nearly identical sentences:

GSM301661:
Kidney. none. . . Quality metric = 28S to 18S: 1.6Patient Age: 70-79Gender: MaleEthnic 
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 6-10Type of Tobacco Use: 
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoPathological T: 1b
Pathological N: 0Pathological M: 0Pathological Stage: 1Pathological Grade: 2
Pathological Multiple Tumors: NoPathological Stage During or Following Multimodality 
Therapy: NoPrimary Site: KidneyHistology: Conventional (clear cell) renal carcinoma. 
Kidney - 421730

GSM138046:
Kidney. none. . . Quality metric = 28S to 18S: 1.2Patient Age: 50-60Gender: MaleEthnic 
Background: CaucasianTobacco Use : NoAlcohol Consumption?: NoFamily History of Cancer?: 
NoClinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 1Clinical Stage 
During or Following Multimodality Therapy: NoPrimary Site: KidneyHistology: Conventional 
(clear cell) renal carcinoma. Kidney - 215952

GSM53057
Kidney. none. . . Quality metric = 28S to 18S: 1.1Patient Age: 40-50Gender: MaleEthnic 
Background: CaucasianTobacco Use : YesYears of Tobacco Use: 21-25Type of Tobacco Use: 
CigarettesAlcohol Consumption?: YesFamily History of Cancer?: NoDays from Patient Diagnosis 
to Excision: 60Clinical T: 1bClinical N: 0Clinical M: 0Clinical Stage: 1Clinical Grade: 4
Primary Site: KidneyHistology: Conventional (clear cell) renal carcinoma. Kidney - 1271

I need to make sure that these doubles do not get fed to the classifier as separate TP instances.  Initially I was using the sentence itself as the key in a “unique_ever_seen” iterator.  This turned out to be too permissive, since the “nearly identical” sentences would be labeled as separate instances.  Then I started using the first-half of the sentence as the key … this was too restrictive, some of the sentences were identical in the beginning but had important information at the end.

I needed something that could contain all of the “relevant” information of the sentence but be easily comparable.  Initially I was thinking a list of features … but lists aren’t hashable and the order is important in comparisons.  I was trying to use a sorted-tuple as a dictionary key but that became extremely clunky, slow and I’m convinced it had loop-holes that I couldn’t see.

I use set() a lot, but they’re not hashable.  I realized that they have a simple extension … frozenset().  These have all of the same functions as sets but are “frozen” … any “update” operations return new sets instead of updating the current set.  Best of all, they are hashable!  I use a defaultdict to hold sets of TP and TN for each frozenset.  Below is the snippet of code which takes care of all of this:

...
def TrainingMicro_GEN(self):
	"""
	A utility function which groups sentences together and returns:
	(feat_dict, set(TP), set(TN))
	"""

	def default_factory():
		"""
		Returns [TP_set(), TN_set()]
		"""
		return [set(), set()]

	#get all of the hand curated TPs and TNs
	hand_queryset = hand_annotations.objects.filter(labeled_ont_id__ont_type = self.ont_type)
	#use select related so I don't have to hit the database a whole-lot
	queryset = hand_queryset.order_by('micro_id').select_related(depth = 1)
	total_num = queryset.count()

	#use a default-dictionary which will be keyed by feature-sets and contain
	#the TPs and TNs for each
	keyset_dict = defaultdict(default_factory)

	for this_annot, ind in IT.izip(queryset.iterator(), IT.count()):

		#get the actual sentence for the microarray
		this_sent = this_annot.micro_id.GetPredictableSent()

		#get the featureset
		feat_dict = self._process_token(this_sent)

		#since all of the keys just have True as a value the .keys()
		#gives us everything .. change it to a frozenset for keying
		keyset = frozenset(feat_dict.keys())
		this_tag = this_annot.labeled_ont_id.ont_id

		#the defaultdict takes care of the adding new values for each
		#new frozenset or adding to one that already exists
		if this_annot.is_TP:
			keyset_dict[keyset][0].add(this_tag)
		else:
			keyset_dict[keyset][1].add(this_tag)
		logging.warning('added %(ind)s:%(tot)s' % {'tot': str(total_num),
													'ind':str(ind)})

	#"destructively" yield the data to avoid memory errors
	while len(keyset_dict) > 0:
		keyset, knowns_setlist = keyset_dict.popitem()
		yield keyset, knowns_setlist[0], knowns_setlist[1]
...

Since I make the training/testing split downstream of this generator I get an added benefit.  Since each featureset is only yield-ed once I can ensure that there are no IDENTICAL featuresets in both training and testing.  This would give me an over inflated ROC value.

I hope this helps people avoid repeating themselves when they’re creating training sets.

Will

Read Full Post »