Discovering disease-disease simialrties can be useful for us to inferring the mechanisms of complex diseases. Disease similarity has become the important basis in disease-related molecular function research. There have been a number of methods proposed for measuring disease similarity in recent years. But there are lacks of tools or web servers for the computation of these methods. We present an R package dSimer which provides computation of nine methods for measuring disease-disease similarity, including a standard cosine similarity measure and eight function-based methods. The disease similarity matrix obtained from these nine methods can be visualized through heatmap and network. Biological data widely used in disease-disease associations study are also provided by dSimer.
The quantification of similarity among diseases has drawn more and more attention in modern biology and medicine. Understanding similarity among diseases can not only help us gain deeper insights into complex diseases, but also lead to improvements in disease diagnosis, drug repurposing and new drug development. Due to the growing body of high-throughput biological data such as genomic and proteomic data, a number of methods have been proposed for the computation of similarity among diseases during the past decade.
Here we present dSimer for the calculation of nine disease similarity methods, including one standard cosine similarity method [1, 2] and eight function-based methods which compare disease-related gene set for calculating [3]. Besides, we also implemented methods for the visualization of disease similarity matrix.
To start with dSimer package, type following codes below:
library(dSimer)
help("dSimer")The cosine similarity method is a widely used standard measure in disease association analyses [1, 2]. First it constructs feature vectors for diseases from literature or by using other biological data. Then it calculate the cosine of the angle between normalized vector pairs as the similarity scores between diseases. We implemented function CosineDFV for this method:
data(d2s_hsdn) #get disease-symptom associations for constructing feature vectors
ds <- sample(unique(d2s_hsdn[,2]), 5) #get disease names sample
simmat <- CosineDFV(ds, ds, d2s_hsdn)
simmat##                              Neurocirculatory Asthenia
## Neurocirculatory Asthenia                   1.00000000
## Tuberculosis, Osteoarticular                0.01195284
## Azoospermia                                 0.02556797
## Acanthamoeba Keratitis                      0.02467001
## Epidermal Cyst                              0.01315075
##                              Tuberculosis, Osteoarticular Azoospermia
## Neurocirculatory Asthenia                      0.01195284  0.02556797
## Tuberculosis, Osteoarticular                   1.00000000  0.00000000
## Azoospermia                                    0.00000000  1.00000000
## Acanthamoeba Keratitis                         0.00000000  0.00000000
## Epidermal Cyst                                 0.18054960  0.00000000
##                              Acanthamoeba Keratitis Epidermal Cyst
## Neurocirculatory Asthenia                0.02467001     0.01315075
## Tuberculosis, Osteoarticular             0.00000000     0.18054960
## Azoospermia                              0.00000000     0.00000000
## Acanthamoeba Keratitis                   1.00000000     0.20303215
## Epidermal Cyst                           0.20303215     1.00000000BOG [4] is a method for disease similarity calculating only using disease-gene associations. We designed function BOG to implement this method. And for the purpose of normalizing the matrix produced by method BOG, we implemented function Normalize. For example:
data(d2g_separation) #get disease-gene associations
ds<-sample(names(d2g_separation),5)
ds## [1] "male urogenital diseases"              
## [2] "spastic paraplegia, hereditary"        
## [3] "cardiomyopathy, hypertrophic, familial"
## [4] "urogenital abnormalities"              
## [5] "mental retardation, x-linked"sim<-BOG(ds,ds,d2g_separation)
Normalize(sim) #normalize BOG sim scores##                                        male urogenital diseases
## male urogenital diseases                           7.316823e-02
## spastic paraplegia, hereditary                     0.000000e+00
## cardiomyopathy, hypertrophic, familial             4.085078e-05
## urogenital abnormalities                           2.155035e-03
## mental retardation, x-linked                       3.632515e-04
##                                        spastic paraplegia, hereditary
## male urogenital diseases                                            0
## spastic paraplegia, hereditary                                      1
## cardiomyopathy, hypertrophic, familial                              0
## urogenital abnormalities                                            0
## mental retardation, x-linked                                        0
##                                        cardiomyopathy, hypertrophic, familial
## male urogenital diseases                                         4.085078e-05
## spastic paraplegia, hereditary                                   0.000000e+00
## cardiomyopathy, hypertrophic, familial                           1.840760e-01
## urogenital abnormalities                                         0.000000e+00
## mental retardation, x-linked                                     1.017413e-03
##                                        urogenital abnormalities
## male urogenital diseases                            0.002155035
## spastic paraplegia, hereditary                      0.000000000
## cardiomyopathy, hypertrophic, familial              0.000000000
## urogenital abnormalities                            0.151733614
## mental retardation, x-linked                        0.000000000
##                                        mental retardation, x-linked
## male urogenital diseases                               0.0003632515
## spastic paraplegia, hereditary                         0.0000000000
## cardiomyopathy, hypertrophic, familial                 0.0010174126
## urogenital abnormalities                               0.0000000000
## mental retardation, x-linked                           0.2777999903Note that disease-gene associations d2g_separation is a list object that can be obtained by function x2y_df2list. Function x2y_df2list take a data.frame which has two columns (one column contains disease ids and one column contains gene ids, each row declares an association between a disease and a gene) as a input. See an example below:
options(stringsAsFactors = FALSE) #this may be neccessary
 
d2g_fundo_sample<-read.table(text = "DOID:5218    IL6
DOID:8649  EGFR
DOID:8649   PTGS2
DOID:8649   VHL
DOID:8649   ERBB2
DOID:8649   PDCD1
DOID:8649   KLRC1
DOID:5214   MPZ
DOID:5214   EGR2
DOID:5210   AMH")
d2g_fundo_list<-x2y_df2list(d2g_fundo_sample)
d2g_fundo_list## $`DOID:5218`
## [1] "IL6"
## 
## $`DOID:8649`
## [1] "EGFR"  "PTGS2" "VHL"   "ERBB2" "PDCD1" "KLRC1"
## 
## $`DOID:5214`
## [1] "MPZ"  "EGR2"
## 
## $`DOID:5210`
## [1] "AMH"Function Normalize normalize a matrix or a vector as the following formula: \[NormScore=\frac{OrigScore-MinScore}{MaxScore-MinScore}\]
Here’s an example:
m<-matrix(1:9,3,3)
m##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9Normalize(m)##       [,1]  [,2]  [,3]
## [1,] 0.000 0.375 0.750
## [2,] 0.125 0.500 0.875
## [3,] 0.250 0.625 1.000PSB [5] uses not only disease-gene associations but also GO-gene associations. We designed function PSB to implement this method. Function PSB needs disease-GO associations which can be obtained by function HypergeometricTest.
Here’s an example for PSB:
## get the data 
data(go2g_sample)
data(d2go_sample)
ds<-names(d2go_sample)
sim<-PSB(ds,ds,d2go_sample,go2g_sample)
sim##            DOID:5652 DOID:10005  DOID:2290
## DOID:5652  4.1743210 0.44099253 0.51055413
## DOID:10005 0.4409925 5.82708131 0.09897865
## DOID:2290  0.5105541 0.09897865 5.45446576Normalize(sim)##             DOID:5652 DOID:10005  DOID:2290
## DOID:5652  0.71146461 0.05970806 0.07185197
## DOID:10005 0.05970806 1.00000000 0.00000000
## DOID:2290  0.07185197 0.00000000 0.93494957FunSim [3] is a method that takes advantage of gene function network data from HumanNet [6]. First we got the raw data from HumanNet, then we normalized the associated log likelihood score of gene pairs. Then we can use function FunSim to mearsure similarity between diseases (note that we use function LLSn2List to convert HumanNet data from data.frame to list, for more efficient calculating):
## in this method, we must use disease-gene associations 
## which genes are represented by entrez ids because of
## HumanNet
data(d2g_fundo_entrezid) ##get disease-gene associations
data(HumanNet_sample)
## we specified 5 DOIDs to match Human_sample
ds<-c("DOID:8176","DOID:2394","DOID:3744","DOID:8466","DOID:5679")
llsnlist<-LLSn2List(HumanNet_sample)
FunSim(ds,ds,d2g_fundo_entrezid,llsnlist)##              DOID:8176  DOID:2394    DOID:3744    DOID:8466 DOID:5679
## DOID:8176 1.0000000000 0.43350044 0.4476822789 0.0002666029 0.1457867
## DOID:2394 0.4335004444 1.00000000 0.4222353572 0.0166884115 0.1592188
## DOID:3744 0.4476822789 0.42223536 1.0000000000 0.0001286379 0.1293250
## DOID:8466 0.0002666029 0.01668841 0.0001286379 1.0000000000 0.1349979
## DOID:5679 0.1457866621 0.15921884 0.1293250308 0.1349979009 1.0000000ICod [7] mearsures disease-disease similarities based on disease-related PPIs. We used an R graph igraph to import PPI data into R. For example:
## get disease-gene associations and HPRD PPI data
data(d2g_fundo_symbol)
data(PPI_HPRD)
graph_hprd<-graph.data.frame(PPI_HPRD,directed=FALSE) #get a igraph object based on HPRD PPI data
ds<-sample(names(d2g_fundo_symbol),5)
ICod(ds,ds,d2g_fundo_symbol,graph_hprd)##               DOID:684   DOID:1024   DOID:4989    DOID:227 DOID:11198
## DOID:684   1.000000000 0.006661085 0.005508185 0.009749806          0
## DOID:1024  0.006661085 1.000000000 0.014984122 0.000000000          0
## DOID:4989  0.005508185 0.014984122 1.000000000 0.031015951          0
## DOID:227   0.009749806 0.000000000 0.031015951 1.000000000          0
## DOID:11198 0.000000000 0.000000000 0.000000000 0.000000000          1Sun [8] proposed three measures which use different biological information. These three measures are: annotation-baesd, function-based and topological-based.
Sun’s annotation-based measure used the information of disease-gene association data. We implemented function Sun_annotation to calculate disease-disease similarity:
data(d2g_separation) ## get disease-gene associations
ds<-sample(names(d2g_separation),5)
Sun_annotation(ds,ds,d2g_separation)##                                glioma heart diseases
## glioma                    1.000000000    0.012931034
## heart diseases            0.012931034    1.000000000
## neurologic manifestations 0.000000000    0.012285012
## liver cirrhosis, biliary  0.000000000    0.000000000
## eye diseases              0.002941176    0.009487666
##                           neurologic manifestations
## glioma                                   0.00000000
## heart diseases                           0.01228501
## neurologic manifestations                1.00000000
## liver cirrhosis, biliary                 0.00000000
## eye diseases                             0.08141962
##                           liver cirrhosis, biliary eye diseases
## glioma                                  0.00000000  0.002941176
## heart diseases                          0.00000000  0.009487666
## neurologic manifestations               0.00000000  0.081419624
## liver cirrhosis, biliary                1.00000000  0.008620690
## eye diseases                            0.00862069  1.000000000Sun’s function-based measure uses both GO term annotations and disease-gene associations to estimate the disease similarity scores. We implemented function Sun_function to perform this method like follows:
## get a sample of disease-GO associations
data(d2go_sample)
ds<-names(d2go_sample)
Sun_function(ds,ds,d2go_sample)##             DOID:5652 DOID:10005  DOID:2290
## DOID:5652  1.00000000          0 0.02672606
## DOID:10005 0.00000000          1 0.00000000
## DOID:2290  0.02672606          0 1.00000000Sun’s topology-based measure takes advantage of the topology of PPI network and disease-gene association data to measure disease similarity scores. Futhurmore Function Sun_topology was implemented for this method. And in this method, the graphlet signature of nodes in PPI were calculated by a tool orca [9] (see details in [8]).
Here is an example:
data(d2g_fundo_symbol)
data(graphlet_sig_hprd) #get graphlet signatures of genes in HPRD PPI network
data(weight)
ds<-sample(names(d2g_fundo_symbol),5)
Sun_topology(ds,ds,d2g_fundo_symbol,graphlet_sig_hprd,weight)##            DOID:1785 DOID:9996 DOID:2741 DOID:10534  DOID:987
## DOID:1785  1.0000000 0.4521483         0  0.9503442 0.8500784
## DOID:9996  0.4521483 1.0000000         0  0.4923496 0.3793756
## DOID:2741  0.0000000 0.0000000         1  0.0000000 0.0000000
## DOID:10534 0.9503442 0.4923496         0  1.0000000 0.8183088
## DOID:987   0.8500784 0.3793756         0  0.8183088 1.0000000Method Separation is actually a method measuring a netowrk-based separation of disease pairs [10]. But the result can also be turned into similarity scores. And we implemented function Separation2Similarity for this purpose as the following formula: \[SimScore=\frac{MaxSepScore-SepScore}{MaxSepScore-MinSepScore}\]
Here is an example:
## get the disease-gene association data and interactome data
data(d2g_separation)
data(interactome)
## import ppi data to R by igraph
graph_interactome<-graph.data.frame(interactome,directed=FALSE)
## calculate separation of 5 sample diseases
ds<-sample(names(d2g_separation),5)
sep<-Separation(ds,ds,d2g_separation,graph_interactome)
## convert separation into simialrity
sim<-Separation2Similarity(sep)
sim##                                      amino acid metabolism, inborn errors
## amino acid metabolism, inborn errors                           0.65125385
## parkinson disease                                              0.10104460
## coronary artery disease                                        0.07339277
## esophageal diseases                                            0.08631854
## metabolic syndrome x                                           0.00000000
##                                      parkinson disease
## amino acid metabolism, inborn errors         0.1010446
## parkinson disease                            0.9318677
## coronary artery disease                      0.2569307
## esophageal diseases                          0.2689062
## metabolic syndrome x                         0.1841251
##                                      coronary artery disease
## amino acid metabolism, inborn errors              0.07339277
## parkinson disease                                 0.25693072
## coronary artery disease                           0.99929857
## esophageal diseases                               0.28663181
## metabolic syndrome x                              0.25550882
##                                      esophageal diseases
## amino acid metabolism, inborn errors          0.08631854
## parkinson disease                             0.26890615
## coronary artery disease                       0.28663181
## esophageal diseases                           0.89127818
## metabolic syndrome x                          0.17657526
##                                      metabolic syndrome x
## amino acid metabolism, inborn errors            0.0000000
## parkinson disease                               0.1841251
## coronary artery disease                         0.2555088
## esophageal diseases                             0.1765753
## metabolic syndrome x                            1.0000000Disease-gene associations are the most widely used information to measure disease similarity in function-based methods. Here we provide function plot_bipartite for users to observe associations between diseases and genes more intuitively. This function can also visualize other associations like disease-GO term associations. Here’s an example:
data(d2g_fundo_symbol)
d2g_sample<-d2g_fundo_symbol[1:10]
plot_bipartite(d2g_sample)For two specific diseases, we implemented function plot_topo to visualize the topological relationship of the two diseases’ gene sets in gene network. The following is an example:
data("PPI_HPRD")
g<-graph.data.frame(PPI_HPRD,directed = FALSE) #get an igraph graph
data(d2g_fundo_symbol)
a<-d2g_fundo_symbol[["DOID:8242"]] # get gene set a
b<-d2g_fundo_symbol[["DOID:4914"]] # get gene set b
plot_topo(a,b,g)We implemented two functions plot_heatmap and plot_net for the disease similarity network’s visualization. Function plot_heatmap is a reuse of function simplot in package DOSE [11] which shows the heatmap of disease similarity network. Given a cutoff, function plot_net will show the relationships among diseases in a graph. For example:
data(d2g_separation)
data(interactome)
graph_interactome<-graph.data.frame(interactome,directed=FALSE)
ds<-c("myocardial ischemia","myocardial infarction","coronary artery disease",
 "cerebrovascular disorders","arthritis, rheumatoid","diabetes mellitus, type 1",
 "autoimmune diseases of the nervous system","demyelinating autoimmune diseases, cns",
 "respiratory hypersensitivity","asthma","retinitis pigmentosa",
 "retinal degeneration","macular degeneration")
 
sep<-Separation(ds,ds,d2g_separation,graph_interactome)
sim<-Separation2Similarity(sep)
plot_heatmap(sim,font.size = 3)plot_net(sim,cutoff=0.2)[1] Zhou X Z, Menche J, Barabasi A L, et al. Human symptoms-disease network[J]. Nature communications, 2014, 5.
[2] Van Driel M A, Bruggeman J, Vriend G, et al. A text-mining analysis of the human phenome[J]. European journal of human genetics, 2006, 14(5): 535-542.
[3] Cheng L, Li J, Ju P, et al. SemFunSim: a new method for measuring disease similarity by integrating semantic and gene functional association[J]. PloS one, 2014, 9(6): e99415.
[4] Mathur S, Dinakarpandian D. Automated ontological gene annotation for computing disease similarity[J]. AMIA Summits on Translational Science Proceedings, 2010, 2010: 12.
[5] Mathur S, Dinakarpandian D. Finding disease similarity based on implicit semantic similarity[J]. Journal of biomedical informatics, 2012, 45(2): 363-371.
[6] Lee I, Blom U M, Wang P I, et al. Prioritizing candidate disease genes by network-based boosting of genome-wide association data[J]. Genome research, 2011, 21(7): 1109-1121.
[7] Paik H, Heo HS, Ban H, et al. Unraveling human protein interaction networks underlying co-occurrences of diseases and pathological conditions[J]. Journal of translational medicine, 2014, 12(1): 99.
[8] Sun K, Gonçalves JP, Larminie C. Predicting disease associations via biological network analysis[J]. BMC bioinformatics, 2014, 15(1): 304.
[9] Hočevar T, Demšar J. A combinatorial approach to graphlet counting[J]. Bioinformatics, 2014, 30(4): 559-565.
[10] Menche J, Sharma A, Kitsak M, et al. Uncovering disease-disease relationships through the incomplete interactome[J]. Science, 2015, 347(6224): 1257601.
[11] Yu G, Wang L G, Yan G R, et al. DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis[J]. Bioinformatics, 2015, 31(4): 608-609.