Skip to main content

Table 2 A text mining approach using an entropy-based scoring function rediscovers the molecular function of proteins sharing PROSITE motifs

From: The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

Motif # of proteins # of documents

Terms

EF_HAND

ef-hand

36

calcium-bind

183

calcium

 

ca 2+

 

calcium-bind protein

 

ca

 

2+ bind

 

2+

 

ef-hand motif

 

calmodulin

TRYSIN_SER

serin proteinas

11

proteinas

108

chymotrypsin

 

serin

 

serin proteas

 

elastase

 

ser-195

 

his-57

 

proteinas especially

 

proteolyt

PROTEIN KINASE_ST

protein kinas

15

catalyt domain

107

phosphoryl

 

substrat

 

autophosphoryl

 

phosphoryl site

 

kinas

 

threonin

 

catalyt

 

constitutively active

  1. The method extracts text from the abstracts of references annotated in each protein's Swiss-Prot record, pre-processes the text (tokenization into terms, removal of non-content words, and basic stemming to normalize word forms), and scores terms based on their distribution across proteins and their relative significance in the entire corpus of Swiss-Prot referenced documents. With no additional normalization, concept and word redundancy may be observed. Although still very preliminary, the method is able to capture the molecular function for each cluster of proteins shown: "ef-hand" and "calcium binding" for EF_HAND; "serine proteinase", "proteolysis", and the active site residues "ser-195" and "his-57" for TRYPSIN_SER; and "protein kinase", "phosphorylation", "catalytic domain" and the substrate residue "threonine" for PROTEIN_KINASE_ST.