Association of genes to genetically inherited diseases using data mining.
Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink, OMIM) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.