Our current research is primarily in bioinformatics. Bioinformatics has emerged as a new discipline bringing biology and computing together. While computing is ubiquitous in all areas of science, including biology, it has been mainly used there for numerical analysis, e.g., for solving equations. However, the real strength of modern computing science lies in its ability to manipulate symbols, and a great number of algorithms has been developed for that purpose. Therefore, it came as a revelation that many biological entities can be treated as symbols (e.g., DNA molecules as strings of A,C,T and G). This revelation plus the sheer volume of biological data, which is becoming too huge to handle manually, has brought bioinformatics to the forefront in solving many biological problems.

Statistical Learning

             The challenges posed by the large-scale heterogeneous data sets in biology are not merely a matter of computation, but more a matter of formulating hypotheses to interpret the data. Probabilistic and statistical approaches are particularly useful in formulating and testing hypotheses when either there is a lack of understanding of the system at the level of first principles or the underlying processes of the system are stochastic by nature. Living cells as a system exemplify both of these problems.

             Already, statistical modeling and learning, such as Markov models and support vector machines (SVMs), have been applied to studying DNA and protein sequences with remarkable success. However, the current approaches still remain, by and large, phenomenological, i.e., they fall short of exploring domain knowledge. Take annotation as an example. Annotation, in the context of genomics, is to assign biochemical and cellular functions to newly discovered genes. To do annotation, the state of the art is via sequence similarity with one or more genes whose functions are already known. If a statistically significant similarity is found, the function of a known gene is then assigned to the newly discovered one. On the other hand, the central dogma in modern biology states that the sequence of a gene is sufficient in determining the gene's product, protein, and the protein's structure and function. To pursue goals like this, our research is focused on incorporating increasing amounts of domain knowledge, as well as improving the effectiveness, expressiveness and interpretability of statistical learning methods. Specifically, we would like to devise more reliable statistical scoring schemes and to develop learning algorithms to acquire probability distributions, which from a bayesian point of view is as essential as the observed data.

Homology Identification and Protein Family Classification

             Annotation is a central issue in making good use of the DNA sequence data. The similarity-based homology detection has been essentially the only tool to address this issue computationally, versus the authoritative but more expensive biochemical experiments. Classifying proteins into families according to their functions can enhance the accuracy of homology detection algorithms. Our research in this line includes graph-theoretic clustering algorithms, to tackle multi domain proteins; support vector machines combined with pairwise similarity, to detect more remote homology; and hidden Markov models, to profile specific protein families such as transporter proteins. Inspired by our success with SVMs, our next goal is to exploit SVMs discriminative power for general homology detection by incorporating domain specific knowledge into SVMs via the kernel functions. Not satisfied with HMMs inability in capturing some long range correlation among protein sequences we would also like to develop new models.

Comparative Genomics

             To extract more information from primary sequences than just simple similarity comparisons of individual proteins, it is desirable to compare whole genomes. Our research has been concerned with comparing genomes from perspectives that are beyond simple sequence similarity. Specifically, genomes have been compared based on their metabolic pathway profiles. Such comparisons would provide valuable insight into issues like metabolic engineering. The hierarchical profiling methodology developed for this purpose is useful for other problems that involve comparing profiles based on attributes that bear a hierarchical relationship. Further work includes studying pathway evolution.

Biological networks and Metabolic pathway evolution

             Living cells are a complex system. Genes, proteins and the interactions among them collectively carry out cellular function. To unravel the complex machinery of cellular processes, genetic networks were proposed recently as a possible methodology for modeling genetic interaction using gene expression data from DNA micro-arrays. Current network models (such as Boolean networks) suffer from either weak inferential power or high computational complexity. Our research is to explore the concept and techniques for building better network models. And we believe that evolution of metabolic pathways can be better understood in a context of networks.


This website was designed and developed by Robel Kahsay, Ph.D.,
and is maintained by Roger Craig