Enrichr-KG: bridging enrichment analysis across multiple libraries

John Erol Evangelista, Zhuorui Xie, Giacomo B Marino, Nhi Nguyen, Daniel J B Clarke, Avi Ma’ayan, Enrichr-KG: bridging enrichment analysis across multiple libraries, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W168–W179, https://doi.org/10.1093/nar/gkad393

Navbar Search Filter Mobile Enter search term Search Navbar Search Filter Enter search term Search

Abstract

Gene and protein set enrichment analysis is a critical step in the analysis of data collected from omics experiments. Enrichr is a popular gene set enrichment analysis web-server search engine that contains hundreds of thousands of annotated gene sets. While Enrichr has been useful in providing enrichment analysis with many gene set libraries from different categories, integrating enrichment results across libraries and domains of knowledge can further hypothesis generation. To this end, Enrichr-KG is a knowledge graph database and a web-server application that combines selected gene set libraries from Enrichr for integrative enrichment analysis and visualization. The enrichment results are presented as subgraphs made of nodes and links that connect genes to their enriched terms. In addition, users of Enrichr-KG can add gene-gene links, as well as predicted genes to the subgraphs. This graphical representation of cross-library results with enriched and predicted genes can illuminate hidden associations between genes and annotated enriched terms from across datasets and resources. Enrichr-KG currently serves 26 gene set libraries from different categories that include transcription, pathways, ontologies, diseases/drugs, and cell types. To demonstrate the utility of Enrichr-KG we provide several case studies. Enrichr-KG is freely available at: https://maayanlab.cloud/enrichr-kg.

Graphical Abstract

INTRODUCTION

Gene and protein set enrichment analysis provides context for genes and proteins identified in omics experiments using prior knowledge ( 1). Enrichment analysis involves querying a gene set against a catalog of annotated gene sets to find significant overlap between the input set and the annotated prior-knowledge gene sets. The results are ranked associated terms such as pathways, transcription factors, small molecules, diseases and other phenotypes, cell lines, cell types and tissues, and other biological and biomedical terms.

Enrichr ( 2–4) is a widely popular search engine for gene sets, performing enrichment analysis instantly against many annotated gene sets. In the past 10 years, over 59 million gene sets have been submitted as queries to Enrichr; and as of mid-2023, Enrichr has grown to host over ∼400 000 annotated gene sets from ∼200 gene set libraries. Such a resource provides a comprehensive collection of knowledge about genes, including their transcriptional and translational regulation, membership in pathways and biological processes, regulation and binding to drugs, association with diseases and other phenotypes, and expression across cell types, tissues, and cell lines. While Enrichr has been a valuable resource for hypothesis generation for many studies, there is still an opportunity to improve its functionality by, for example, integrating enrichment results across libraries and domains of knowledge. This can be achieved by viewing results of the enrichment analysis across libraries as an integrated network of genes and their annotations.

Network representation of biological molecular systems have been widely applied in biomedical research for abstracting connections between molecular entities ( 5–10). At the same time, many widely used web-based tools have been developed for network visualization and analysis. For example, STRING provides network visualizations of known and predicted associations between proteins, including physical protein-protein interactions ( 11). Genes2Networks (G2N) returns a protein interaction subnetwork that connects a set of input genes/proteins based on known protein-protein interactions ( 12). Another example is GeneMania ( 13) which visualizes associations between genes using evidence from across domains of knowledge such as co-expression, physical interaction, pathway membership, and shared structural domains. Other notable examples are HumanNet ( 14) and the DisGeNet Cytoscape app ( 15) which provide integrated network visualizations centered on disease genes and include predictions and prioritization of gene-disease associations.

Recently, knowledge graphs have gained popularity for integrating and generating hypotheses from connected data ( 16, 17). Knowledge graphs have been used for studying disease mechanisms ( 18, 19), mining small molecules for drug discovery ( 20, 21), and analyzing connections between authors and biomedical entities using PubMed ( 22). Recently, there was an attempt to create a massive knowledge graph that integrates biomedical data for precision medicine ( 23). Within knowledge graphs data is stored as triples that describe how a subject entity is related to an object entity. For example, in the statement ‘Drug A’ targets ‘Protein B’, ‘Drug A’ is the subject, and ‘Protein B’ is the object, and the connection between them is described by the verb ‘targets’. Generating a collection of these triples made of different types of entities forms a network of knowledge that can be navigated, becoming the subject for application of graph traversal algorithms, and graph completion prediction algorithms. However, one of the challenges with knowledge graphs is that their size grows rapidly and querying the graph for useful applications becomes challenging. At the same time, biomedical and biological knowledge about genes and proteins, as well as other molecular entities, can be stored as annotated gene sets. Such gene sets are useful for performing gene set enrichment analysis ( 1). Many tools and databases have been developed for performing gene set enrichment analysis, for example, DAVID ( 24), g:Profiler ( 25), WebGestalt ( 26), MSigDB-GSEA ( 27) and Enrichr ( 3). Currently, most enrichment analysis tools and databases store knowledge as gene set libraries. While such a storage schema has benefits, for example, performing fast overlap analysis across thousands of gene sets instantly, the comparison of enrichment results across multiple gene set libraries is not trivial. To solve this, tools such as EnrichmentMaps ( 28) visualize gene set enrichment analysis results as ball-and-stick subgraphs that connect genes to their enriched terms. Hence, several gene set enrichment analysis tools with network visualization already exist, each providing different features and advantages. A collection of such tools with a comparison of their features is provided (Table 1).

A comparison of features from resources providing enrichment analysis with network representations. If a resource had a broken URL, its features were taken from the relevant literature. Column values are: A: Interactive web server, B: Number of libraries, C: Statistical method, D: Cytoscape enabled, E: Gene set augmentation/predictions, F: URL to site works, G: PPIs, H: Co-expression correlations, I: Multiple edge types, J: Different node types in the same graph, K: Provides enrichment analysis

Resource .	URL .	PMID .	A .	B .	C .	D .	E .	F .	G .	H .	I .	J .	K .
Enrichr-KG	maayanlab.cloud/enrichr-kg	✓	24	Fisher exact test	✓	✓	✓	✓	✓	×	✓	✓
EnrichmentMap	baderlab.org/Software/ EnrichmentMap	21085593	×	NA	NA	✓	×	✓	×	×	×	✓	×
BioGraph	biograph.pa.icar.cnr.it	30458802	✓	9	Fisher exact test	✓	×	×	×	×	×	✓	✓
MELODI	melodi.biocompute.org.uk	29342271	✓	5	Fisher exact test	×	×	✓	×	×	✓	✓	✓
Reactome graph database	reactome.org/dev/graph-database	29377902	×	1	NA	×	×	✓	×	×	×	×	×
GREG	www.moralab.science/GREG	32055858	✓	6	NA	×	×	×	✓	×	✓	✓	×
Bio4j	bio4j.github.io	NA	×	5	NA	✓	×	✓	✓	×	✓	✓	×
cyNeo4j	apps.cytoscape.org/apps/cyneo4j	26272981	×	NA	NA	✓	×	✓	×	×	✓	✓	×
DGLinker	dglinker.rosalind.kcl.ac.uk	34125897	✓	12	Fisher exact test	×	✓	✓	✓	✓	×	×	✓
AmiGO	amigo.geneontology.org/amigo	19033274	✓	2	Hypergeometric	✓	×	✓	×	×	×	×	✓
Genes2FANs	actin.pharm.mssm.edu/genes2FANs	22748121	✓	15	NA	✓	×	×	✓	✓	×	✓	×
STRING	string-db.org	36370105	✓	12	Kolmogorov–Smirnov	✓	×	✓	✓	✓	✓	×	✓
GeneMANIA	genemania.org	29912392	✓	20	Fisher	✓	✓	✓	✓	✓	✓	×	✓
DAVID	david.ncifcrf.gov	35325185	✓	16	EASE score	✓	×	✓	×	×	×	×	✓
ClueGO	apps.cytoscape.org/apps/cluego	19237447	×	3	Hypergeometric	✓	×	✓	×	×	×	✓	✓
Metascape	metascape.org	30944313	✓	24	Custom*	✓	×	✓	×	×	×	✓	✓
NetworkAnalyst	www.networkanalyst.ca	30931480	✓	11	GSEA	×	×	✓	✓	×	×	✓	✓
MSigDB-GSEA	www.gsea-msigdb.org	26771021	✓	15	GSEA	×	×	✓	✓	×	×	×	✓

Resource .	URL .	PMID .	A .	B .	C .	D .	E .	F .	G .	H .	I .	J .	K .
Enrichr-KG	maayanlab.cloud/enrichr-kg	✓	24	Fisher exact test	✓	✓	✓	✓	✓	×	✓	✓
EnrichmentMap	baderlab.org/Software/ EnrichmentMap	21085593	×	NA	NA	✓	×	✓	×	×	×	✓	×
BioGraph	biograph.pa.icar.cnr.it	30458802	✓	9	Fisher exact test	✓	×	×	×	×	×	✓	✓
MELODI	melodi.biocompute.org.uk	29342271	✓	5	Fisher exact test	×	×	✓	×	×	✓	✓	✓
Reactome graph database	reactome.org/dev/graph-database	29377902	×	1	NA	×	×	✓	×	×	×	×	×
GREG	www.moralab.science/GREG	32055858	✓	6	NA	×	×	×	✓	×	✓	✓	×
Bio4j	bio4j.github.io	NA	×	5	NA	✓	×	✓	✓	×	✓	✓	×
cyNeo4j	apps.cytoscape.org/apps/cyneo4j	26272981	×	NA	NA	✓	×	✓	×	×	✓	✓	×
DGLinker	dglinker.rosalind.kcl.ac.uk	34125897	✓	12	Fisher exact test	×	✓	✓	✓	✓	×	×	✓
AmiGO	amigo.geneontology.org/amigo	19033274	✓	2	Hypergeometric	✓	×	✓	×	×	×	×	✓
Genes2FANs	actin.pharm.mssm.edu/genes2FANs	22748121	✓	15	NA	✓	×	×	✓	✓	×	✓	×
STRING	string-db.org	36370105	✓	12	Kolmogorov–Smirnov	✓	×	✓	✓	✓	✓	×	✓
GeneMANIA	genemania.org	29912392	✓	20	Fisher	✓	✓	✓	✓	✓	✓	×	✓
DAVID	david.ncifcrf.gov	35325185	✓	16	EASE score	✓	×	✓	×	×	×	×	✓
ClueGO	apps.cytoscape.org/apps/cluego	19237447	×	3	Hypergeometric	✓	×	✓	×	×	×	✓	✓
Metascape	metascape.org	30944313	✓	24	Custom*	✓	×	✓	×	×	×	✓	✓
NetworkAnalyst	www.networkanalyst.ca	30931480	✓	11	GSEA	×	×	✓	✓	×	×	✓	✓
MSigDB-GSEA	www.gsea-msigdb.org	26771021	✓	15	GSEA	×	×	✓	✓	×	×	×	✓