How to define Good

Feeds:: Posts; Comments

How to define Good

March 25, 2009 by willdampier

Hi All,

Today’s push has been to implement a method which can evaluate predictions in an “unbiased” way. All of my prediction methods return probability distributions which describe how likely it is that a tag belongs to a record. I also have a set of ~2000 hand annotated records with “gold-standard” tags that I will use as evaluation cases.

In machine learning research there are a handful of ways to evaluate/rank the performance of a classifying technique. These include (but are certainly not limited too) the f-measure, Matthews Correlation Coefficient and Area Under the ROC Curve (AUC). I prefer the ROC curve since I tend to have binary classifications problems with a continuously valued predictor variable. The AUC also provides a fair measure when there are unequal numbers of TP and TN.

The algorithm behind the AUC is quite intuitive. If I assume that my classifier is a function C which takes the features F extracted from item I and returns a continuously valued variable … P_I . C(F_I) = P_I … I make the assumption that higher values indicate greater likelihood of a tag belonging to an item … think of P_I as akin to a probability, although it does not have to be between 0 and 1. In order to convert these values into an AUC you can use the following algorithm.

Choose N cutoffs between the max and min of P_I …. I tend to use N=100 but YMMV.
For each cutoff assume that all values GREATER than the cutoffs as positives, everything below as negatives.
Determine the number of True Positives, True Negatives, False Positives, and False Negatives.

Next we calculate the Sensitivity (Fraction of True Positives) and Specificity (Fraction of True Negatives) for each cutoff.

You can plot the sens vs. 1-spec to get a visual representation. I use trapezoidal rule to calculate the numeric integral. Below is a short function which takes a list of ProbDistI (as created from nltk.niavebayes … or something similar) and a list of set()’s that indicate True Positives.

from itertools import izip
import numpy

def ROC(TEST, REF, NUM_CUTS = 100):
“””
ROC
Calculates the sensitiviy and specificity of the classifier for
multiple cutoffs of a continiously valued TEST value.

@param: TEST A list of ProbDistI in which TEST[I] is a probabilty
distribution of all tags for the I’th item.
@param: REF A list of CORRECT tags for the I’th item. This can
be a list of tags, list of lists/sets (for items
with multiple tags).

@kwarg: NUM_CUTS The number of linearly spaced cutoffs to use
when determining SENS and SPEC

%returns: (SENS, SPEC, AUC) values for this classifier.

“””
#[TP, FP, TN, FN]
cutoffs = numpy.linspace(0,1,num=NUM_CUTS)
conf_mat = numpy.zeros((NUM_CUTS,4))

for this_test, this_ref in izip(TEST, setNormalizer(REF)):
for this_cut, this_ind in izip(cutoffs, xrange(NUM_CUTS)):
test_pos = set(filter(lambda x: this_test.prob(x)> this_cut,
this_test.samples()))
test_neg = set(filter(lambda x: this_test.prob(x) <= this_cut, this_test.samples())) conf_mat[this_ind, 0] += len(test_pos & this_ref) #TP conf_mat[this_ind, 1] += len(test_pos - this_ref) #FP conf_mat[this_ind, 2] += len(test_neg - this_ref) #TN conf_mat[this_ind, 3] += len(test_neg & this_ref) #FN sens = conf_mat[:,0] / (conf_mat[:,0] + conf_mat[:,3]) spec = conf_mat[:,2] / (conf_mat[:,2] + conf_mat[:,1]) auc = -numpy.trapz(sens, 1-spec) return sens, spec, auc [/sourcecode] There are a few design quirks here. There are numerous ways that I could loop through the cutoffs and predictions. In my Matlab implementations I always use the cutoff loop as the outer most loop ... this reduces the number of additions I make to "sliced" elements in the "conf_mat" array. However, I have one design consideration ... I plan to provide the TEST predictions as a generator function ... I'll post more on this later. Since it may take ~10-30 seconds to do each prediction, and storing the ProbDistI for each prediction in a list causes memory problems. This means that I can only loop through TEST once. Hope this helps people avoid just using the TP count of their predictor as the "goodness" measure. -Will

Posted in Natural Language Processing, programming | Tagged generators, prediction, ROC | Leave a Comment

Comments RSS

Mega Micro Base

Adventures in Python, Microarrays, GUI’s and Feature Creep