Mohammad Shabbir Hasan

Ph.D. Student, Computer Science, Virgina Tech



Indels, though differing in allele sequence and position, are biologically equivalent when they lead to the same altered sequences. Storing biologically equivalent indels as distinct entries in databases causes data redundancy, and may mislead downstream analysis and interpretations. About 10% of the human indels stored in dbSNP are redundant. It is thus desirable to have a unified system for identifying and representing equivalent indels in publically available databases. This paper describes UPS-indel, a utility tool that creates a universal positioning system for indels so that equivalent indels can be uniquely determined by their coordinates in the new system. UPS-indel identifies nearly 15% indels in dbSNP (version 142) as redundant across all human chromosomes, higher than previously reported. When applied to COSMIC coding and noncoding indel datasets, UPS-indel identifies nearly 29% and 13% indels as redundant, respectively. Comparing the performance of UPS-indel with existing variant normalization tools vt normalize, BCFtools, and GATK LeftAlignAndTrimVariants shows that UPS-indel is able to identify 456,352 more redundant indels in dbSNP, 2,118 more in COSMIC coding, and 553 more in COSMIC noncoding indel dataset in addition to the ones reported jointly by these tools. UPS-indel is theoretically proven to find all equivalent indels, and is thus exhaustive. UPS-indel is written in C++ and is freely available to download here. The online version of UPS-indel is available here.

[Project Page] [Paper (Under Submission)]

Assessment of indel calling tools using real short-read data

Insertion and deletion (indel), a common form of genetic variation, has been shown to cause or contribute to human genetic diseases and cancer. With the advance of Next Generation Sequencing technology, many indel calling tools have been developed; however, evaluation and comparison of these tools using large-scale real data are still scant. Here we evaluated seven popular and publicly available indel calling tools, GATK Unified Genotyper, VarScan, Pindel, SAMtools, Dindel, GTAK HaplotypeCaller, and Platypus, using 78 human genome low coverage datafrom the 1000 Genomes project. Comparing indels called by these tools with a known set of indels, we found that Platypus outperforms other tools. In addition, a high percentage of known indels still remain undetected and the number of common indels called by all seven tools is very low. All these findings indicate the necessity of improving the existing tools or developing new algorithms to achieve reliable and consistent indel calling results.


SPAI: Single Platform for Analyzing Indel

SPAI which stand for Single Platform for Analyzing Indels is a workbench which is intended to aid in the research on indel calling. Unlike other existing command line tools which needs expertise from Computer Science (CS) to run them, we emphasis here to create an environment where users from non-CS background can run the existing indel calling tools using a Graphical User Interface (GUI). In addition to that, this interactive tool written in Java also provide several features that include downloading alignment files (BAM files) from the 1000 Genomes project, viewing the alignment files and variant files in a tabular format, cross-validation among the indel calling results from different tools, comparing the results from different indel calling tools with the benchmark dataset, and graphically visualizing the comparison results.

[Paper 1] [Paper 2]


Comparison between dindel and p-dindel for pooled samples. Comparison between dindel and p-dindel for diploid samples.

Insertion and deletion (indel) of DNA bases is the second most common forms of genetic variation in human genomes and is linked to various genetic diseases and cancers. With next generation sequencing technology, indels are identified through short read sequence alignment and subsequent indel calling. The open source indel calling program Dindel has relatively high sensitivity yet prohibitive running time. To accelerate indel calling of Dindel, we introduce P-Dindel, a multi-core multi-thread based implementation of Dindel. Results show that the proposed algorithm achieves 4X speed up for both diploid samples and pooled samples compared to Dindel, while producing the same result as Dindel.

[Project Page] [Paper]

Investigating gene relationships in microarray expression

GO graph for KRAS positive tissues. GO graph for normal tissues.

The aim of the study is to group biologically relevant genes by using gene clustering. For gene clustering, we developed a combined algorithm using k-means clustering and hierarchical clustering. Here we used k-means clustering algorithm where the initial number of clusters is decided from the output of hierarchical clustering. This approach overcomes the limitation of both k-means clustering and hierarchical clustering algorithms. Using 40 samples (20 normal tissues and 20 KRAS positive tissues) and 464 genes from the dataset of Adenocarcinoma which is the most frequent type of non-small-cell lung cancers, and after applying the combined clustering algorithms we obtained 4 clusters for both normal tissue dataset and KRAS positive dataset. Moreover, we examined the genes contained in each cluster with respect to their molecular functions based on Gene Ontology (GO) annotation to see what are the changes in the enrichment of the molecular functions of the genes took place from normal tissues to KRAS positive tissues. The k-means clustering algorithm combined with hierarchical clustering takes the advantage of hierarchical clustering to get a complete hierarchy of clusters and using this information it decides the initial number of clusters to be used in k-means clustering which produces a tighter cluster than hierarchical clustering. This way it overcomes the limitation of k-means clustering.

[Book Chapter][Paper]

EGID: an ensemble algorithm for improved genomic island detection in genomic sequences

Circular representations of the Escherichia coli O157:H7 str. Sakai (NC_002695) showing predicted GIs, with each circle predicted by each program. The predicted GIs from the outer to the inner circle are EGID, AlienHunter, COLOMBO SIGI-HMM, INDeGenIUS, Island-Path, and PAI-IDA. The shaded parts show the predicted GIs by EGID, and evidenced GIs by other programs.

Genomic islands (GIs) are genomic regions that are originally transferred from other organisms. The detection of genomic islands in genomes can lead to many applications in industrial, medical and environmental contexts. Existing computational tools for GI detection suffer either low recall or low precision, thus leaving the room for improvement. We developed a tool called EGID which stands for "Ensemble algorithm for Genomic Island Detection". EGID utilizes the prediction results of existing computational tools, filters and generates consensus prediction results. Performance comparisons between our ensemble algorithm and existing programs have shown that this ensemble algorithm perform 12.14% better than the previously known best program. [Project Page] [Paper 1] [Paper 2]

GIST: Genomic Isiland Suite of Tools for predicting genomic islands

GIST download window. GIST main window.

Genomic Islands (GIs) are genomic regions that are originally from other organisms, through a process known as Horizontal Gene Transfer (HGT). Detection of GIs plays a significant role in biomedical research since such align genomic regions usually contain important features, such as pathogenic genes. We have developed a use friendly graphic user interface, Genomic Island Suite of Tools (GIST), which is a platform for scientific users to predict GIs. This software package includes five commonly used tools, AlienHunter, IslandPath, Colombo SIGI-HMM, INDeGenIUS and Pai-Ida. It also includes an optimization program EGID that ensembles the result of existing tools for more accurate prediction. The tools in GIST can be used either separately or sequentially. GIST also includes a downloadable feature that facilitates collecting the input genomes automatically from the FTP server of the National Center for Biotechnology Information (NCBI). [Project Page] [Paper]

Copyright © 2011- Mohammad Shabbir Hasan. All rights reserved.