ComputationalBiologywith Clicks!

A platform bridging machine learning and omics-biology to unveil the encrypted language of life.

Preprint

bioRxiv

omicML: An Integrative Bioinformatics and Machine Learning Framework for Transcriptomic Biomarker Identification

Joy Prokash Debnath, Kabir Hossen, Md. Sayeam Khandaker, Shawon Majid, Md Mehrajul Islam, Siam Arefin, Preonath Chondrow Dev, Saifuddin Sarker, Tanvir Hossain

bioRxiv, Cold Spring Harbor Laboratory (2025)

DOI: 10.1101/2025.10.25.684517

Keywords

Graphical Abstract

Graphical Abstract

Abstract

Background

Transcriptomic biomarker discovery has been a challenge due to variation in datasets and platforms, complexity in statistical and computational methods, integration of multiple programing languages, intricacy of ML workflow to evaluate biomarkers. Standard workflows necessitate several stages (quality control, normalization, differential expression), typically executed in R or Python, resulting in bottlenecks for non-experts.

Method

We present omicML, an intuitive graphical user interface (GUI) that combines transcriptomic data analysis with machine learning (ML)-based classification via integrating R and Python packages/libraries. It supports both RNA-Seq and microarray data, automating preprocessing and differential expression analysis. Our extensive ML pipeline enables both supervised and unsupervised learning, integrates various datasets based on candidate gene signatures, and systematically finalizes the biomarker algorithm.

Result

In a case study, omicML identified a six-gene diagnostic model that distinguishes Mpox (monkeypox virus) infections from those caused by other viruses, including SARS-CoV-2, HIV, Ebola, and varicella-zoster. These results illustrate omicML's capacity to discern clinically relevant biomarkers from complex transcriptome data.

Conclusion

Integrating data normalization, differential gene expression analysis, annotation, heatmap analysis, dataset integration, batch effect removal, machine learning analysis, and functional analysis into a unified system diminishes technical barriers and accelerates the conversion of expression data into diagnostic insights for clinicians and bench scientists.

Key Features

Core capabilities of the OmicML framework

⁠Automated Data Preprocessing

Imputation of missing values and batch effect correction to make the expression matrix statistically significant

Detailed information about configuration options, supported data formats, and example workflows available in paper.

⁠Cross-Platform Compatibility

both RNA-Seq and microarray datasets, enabling cross-platform analysis

Detailed information about configuration options, supported data formats, and example workflows available in paper.

Annotation

Broad taxonomic coverage for gene level annotation across 367 species

Detailed information about configuration options, supported data formats, and example workflows available in paper.

⁠Integrated ML Framework

Data standardization, feature selection, benchmarking, nested cross-validation, hyperparameter tuning, feature importance, single-gene model building, multi-gene model (biomarker algorithm) building

Detailed information about configuration options, supported data formats, and example workflows available in paper.

⁠Biomarker Discovery and Validation

Gene-model based identification of biomarker for distinct conditions

Detailed information about configuration options, supported data formats, and example workflows available in paper.

⁠Network Analysis and Functional Enrichment

Contextualization of the candidate biomarkers within biological pathways

Detailed information about configuration options, supported data formats, and example workflows available in paper.

Workflow

Workflow Diagram

Our Team

The researchers and developers behind OmicML

Supervisors

Tanvir Hossain

Tanvir Hossain

Principal Investigator

Saifuddin Sarker

Saifuddin Sarker

Co-Principal Investigator

Preonath Chondrow Dev

Preonath Chondrow Dev

Co-Principal Investigator

Research Students

Joy Prokash Debnath

Joy Prokash Debnath

Research Student

Kabir Hossen

Kabir Hossen

Research Student

Shawon Majid

Shawon Majid

Research Student

Md. Mehrajul Islam

Md. Mehrajul Islam

Research Student

Md. Sayeam Khandaker

Md. Sayeam Khandaker

Research Student

Siam Arefin

Siam Arefin

Research Student

Package versions

R packages

WGCNAv-1.73

DESeq2v-1.44.0

tidyversev-2.0.0

Rtsnev-0.17

umapv-0.2.10.0

ggplot2v-3.5.1

readrv-2.1.5

limmav-3.62.2

apev-5.8.1

micev-3.17.0

dplyrv-1.1.4

BiocManagerv-1.30.25

biomaRtv-2.60.1

gplotsv-3.2.0

ggVennDiagramv-1.5.2

pheatmapv-1.0.12

RColorBrewerv-1.1.3

svav-3.52.0

STRINGdbv-2.18.0

stringrv-1.5.1

Python packages

annotated-typesv-0.7.0

anyiov-4.4.0

bcryptv-4.0.1

certifiv-2024.7.4

cffiv-1.17.0

clickv-8.1.7

contourpyv-1.3.1

cryptographyv-43.0.0

cyclerv-0.12.1

dnspythonv-2.6.1

ecdsav-0.19.0

email_validatorv-2.2.0

fastapiv-0.112.0

fastapi-cliv-0.0.5

fonttoolsv-4.56.0

h11v-0.14.0

httpcorev-1.0.5

httptoolsv-0.6.1

httpxv-0.27.0

idnav-3.7

Jinja2v-3.1.4

joblibv-1.4.2

josev-1.0.0

kiwisolverv-1.4.8

llvmlitev-0.44.0

markdown-it-pyv-3.0.0

MarkupSafev-2.1.5

matplotlibv-3.7.1

mdurlv-0.1.2

numbav-0.61.0

numpyv-1.24.3

packagingv-24.2

pandasv-2.2.2

passlibv-1.7.4

pillowv-11.1.0

pyasn1v-0.6.0

pycparserv-2.22

pydanticv-2.8.2

pydantic_corev-2.20.1

Pygmentsv-2.18.0

PyJWTv-2.9.0

pynndescentv-0.5.13

pyparsingv-3.2.1

python-dateutilv-2.9.0.post0

python-dotenvv-1.0.1

python-josev-3.3.0

python-multipartv-0.0.9

pytzv-2024.1

PyYAMLv-6.0.2

richv-13.7.1

rpy2v-3.5.16

rsav-4.9

scikit-learnv-1.6.1

scipyv-1.15.2

seabornv-0.13.2

shellinghamv-1.5.4

sixv-1.16.0

sniffiov-1.3.1

SQLAlchemyv-2.0.32

starlettev-0.37.2

threadpoolctlv-3.5.0

tqdmv-4.67.1

typerv-0.12.3

typing_extensionsv-4.12.2

tzdatav-2024.1

tzlocalv-5.2

umap-learnv-0.5.7

uvicornv-0.30.5

uvloopv-0.21.0

watchfilesv-0.23.0

websocketsv-12.0

xgboostv-2.1.4

dask[dataframe]v-2024.12.1

Note: Package versions are periodically updated to ensure compatibility and access to the latest features.

Future Perspectives

To address the gaps in OmicML v1.0, future development will prioritize the introduction of additional modules and data types. Proposed improvements encompass gene co-expression networks, survival analysis (notably for cancer cohorts), deep-learning frameworks, integration of proteomics data, single-cell RNA-Seq analysis, ChIP-Seq data processing, spatial transcriptomics, multi-omics integration, AI-driven multi-omics modelling, single-cell ATAC-Seq analysis, specialized modules for bulk RNA-Seq, pan-cancer comparative analyses, and AI applications in genomics, primer design, and PCR data analysis. This enhanced capability will establish omicML as a multifaceted and robust instrument for translational omics analysis.