Hi all,
As I’ve discussed before I’m using the python NLTK to predict the Ontology Tags from the Ontology Foundry of microarray annotations. I have a background in machine learning and prediction algorithms but the “natural language” part is new to me. So I’m blindly gathering all of the “features” I can from each sentence and I’ll sort them out later.
The nltk.classify modules require a training set of feature-dict’s and known-labels. Below is a small snippet of tag/sentence pairs that I’ve removed from the ontologies. The first term is the ontology tag, the second is the source of the subsequent sentence/fragment: ON represents “official name”, SYN represents “synonyms”, DEF represents “definitions” and COM represents “comments”.
cl:0000052 ON totipotent stem cell cl:0000052 DEF a stem cell from which all cells of the body can form. cl:0000053 ON enamel secreting cell cl:0000050 ON megakaryocyte erythroid progenitor cell cl:0000050 DEF a progenitor cell committed to the megakaryocyte and erythroid lineages. cl:0000050 SYN colony forming unit erythroid megakaryocyte cl:0000050 SYN cfu-em cl:0000050 SYN mep cl:0000051 ON common lymphocyte progenitor cl:0000051 DEF a progenitor cell committed to the lymphoid lineage. cl:0000051 SYN early lymphocyte progenitor cl:0000051 SYN committed lymphopoietic stem cell cl:0000051 SYN lymphoid stem cell cl:0000051 SYN lymphopoietic stem cell cl:0000051 SYN clp cl:0000051 SYN elp cl:0000056 ON myoblast cl:0000056 DEF an embryonic (precursor) cell of the myogenic lineage that develops from the mesoderm. they undergo proliferation, migrate to their various sites, and then differentiate into the appropriate form of myocytes. cl:0000057 ON fibroblast cl:0000057 DEF a connective tissue cell which secretes an extracellular matrix rich in collagen and other macromolecules. cl:0000054 ON bone matrix secreting cell cl:0000055 ON non-terminally differentiated cell cl:0000055 DEF a precursor cell with a limited number of potential fates. cl:0000055 SYN blast cell cl:0000058 ON chondroblast cl:0000059 ON ameloblast cl:0000059 DEF a cylindrical epithelial cell in the innermost layer of the enamal organ. their functions include contribution to the development of the dentinoenamel junction by the deposition of a layer of the matrix, thus producing the foundation for the prisms (the structural units of the dental enamal), and production of the matrix for the enamel prisms and interprismatic substance. (from jablonski's dictionary of dentistry, 1992). fma:36042 ON proximal epiphyseal plate of fourth metatarsal bone fma:36043 ON proximal epiphyseal plate of right fourth metatarsal bone fma:36046 ON distal epiphyseal plate of right fourth metatarsal bone fma:36047 ON distal epiphyseal plate of left fourth metatarsal bone fma:36044 ON proximal epiphyseal plate of left fourth metatarsal bone fma:36045 ON distal epiphyseal plate of fourth metatarsal bone doid:11525 ON iris and ciliary body vascular disorder doid:11525 SYN vascular disorders of iris and ciliary body doid:11525 SYN iris and ciliary body vascular disorders nos (disorder) doid:11525 SYN vascular disorder of iris and/or ciliary body -retired- doid:11525 SYN iris and ciliary body vascular disorder (disorder) doid:11524 ON hyphema of iris and ciliary body doid:11527 ON laryngostenosis doid:11527 SYN stenosis of larynx doid:11527 SYN stenosis of larynx (disorder) doid:11521 ON hypertensive heart and renal disease with both (congestive) heart failure and renal failure doid:11521 SYN hypertensive heart and renal disease with both (congestive) heart failure and renal failure (disorder) doid:11521 SYN hypertensive heart and renal disease, unspecified, with heart failure and renal failure doid:11520 ON benign hypertensive renal disease doid:11520 SYN hypertensive renal disease, benign doid:11520 SYN benign hypertensive renal disease (disorder) doid:11520 SYN hypertensive renal disease, benign, without mention of renal failure doid:11523 ON food poisoning due to clostridium perfringens
Based on the examples I’ve seen in the NLTK documentation I’ve decided to use the presence of certain nouns, adjectives, and noun-adj phrases as my feature sets. Below is a simple implementation of this:
def LabelFeatures(SENT): tokens = nltk.tokenize.word_tokenize(SENT) tagged_tokens = nltk.tag.pos_tag(tokens) valid_tags = set(['NN', 'NNS', 'JJ']) feat_dict = {} for this_tok in tagged_tokens: if (this_tok[1] == 'NN') | (this_tok[1] == 'NNS'): feat_dict['has-noun:(%s)' % this_tok[0]] = True if this_tok[1] == 'JJ': feat_dict['has-adj:(%s)' % this_tok[0]] = True for key_val, group in IT.groupby(tagged_tokens, lambda x: x[1] in valid_tags): if key_val: g_list = list(group) if len(g_list) > 1: feat_dict['has-phrase:(%s)' % ' '.join(nltk.tag.untag(g_list))] = True return feat_dict
The decisiontree and naivebayes algorithms will select features (words and word phrases) which occur often in repetitions of the same tag and rarely in different tags.
I’m planning on running a set of simple tests to see how well this will work. In general, I’ll train on 2/3 of the examples and test on the remaining third. I also have a small sample (~400 and slowly growing) of hand annotated microarray records that I can use as tests as well.
However, I’m currently running into a MemoryError in the .train methods. I have a total of 157021 tag/sentence pairs. I run into a memory error at ~2100 pairs. I use generators to ensure that I’m not loading the entire file into memory at once, but I think the conditional probabilities in the naivebayes is exploding as I add more training examples.
Any suggestions would be welcome,
-Will
As mentioned in the comments on: http://streamhacker.wordpress.com/2009/02/23/chunk-extraction-with-nltk/ … It seems the RegexpParser of the NLTK can easily replace the itertools.groupby and will be much simpler …
Yeah, the NaiveBayesClassifier doesn’t scale, mostly due to the use of the probability classes. You’ll probably be better off implementing your own classifier, maybe backed by a database so you don’t run out of memory. BerkeleyDB is pretty fast (way faster than shelve), but it can be tricky to get working correctly.
I’m also not convinced the featureset model of classification is the way to go for classifying sentences or phrases, but I haven’t figured out a better way yet.
I spent a few hours today trying to use shelve to replace the dictionary of Freqdists for each conditional probability. I got it running but it is mind-numbingly slow. It took about ~4 hours to train on 4000 pairs. I like your relational database idea, once I have populated the database correctly it should be trivial to lookup the conditional probabilities using table-joins. And I only have to create the database once.