---
title: "Pathway Integrated Regression-based Kernel Association Test (PaIRKAT)"
author: 
    - name: Charlie Carpenter
      email: CHARLES.CARPENTER@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Weiming Zhang
      email: WEIMING.ZHANG@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Lucas Gillenwater
      email: LUCAS.GILLENWATER@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Cameron Severn
      email: CAMERON.SEVERN@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Tusharkanti Ghosh
      email: TUSHARKANTI.GHOSH@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Russel Bowler
      email: BOWLERR@NJHEALTH.ORG
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Katerina Kechris
      email: KATERINA.KECHRIS@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
    - name: Debashis Ghosh
      email: DEBASHIS.GHOSH@CUANSCHUTZ.EDU
      affiliation: University of Colorado Anschutz Medical Campus
package: pairkat
output: 
  BiocStyle::html_document:
    highlight: "tango"
    code_folding: show
    toc: true
    toc_float: 
      collapsed: false
date: "9/1/2021"
vignette: >
  %\VignetteIndexEntry{using-pairkat}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Pathway Integrated Regression-based Kernel Association Test (PaIRKAT) is a 
model framework for assessing statistical relationships between networks 
and some outcome of interest while adjusting for 
potential confounders and covariates.

The PaIRKAT method is motivated by the analysis of networks of metabolites
from a metabolomics assay and the relationship of those networks with a
phenotype or clinical outcome of interest, though the method can be generalized 
to other domains.

PaIRKAT queries the KEGG database to determine interactions between 
metabolites from which network connectivity is constructed. This model 
framework improves testing power on high dimensional data by including 
graph topography in the kernel machine regression setting. Studies on high 
dimensional data can struggle to include the complex relationships between 
variables. The semi-parametric kernel machine regression model is a powerful 
tool for capturing these types of relationships. They provide a framework for 
testing for relationships between outcomes of interest and high dimensional 
data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known 
biological connections between high dimensional variables by representing them 
as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g. metabolites) 
to be disconnected from all others within the graph, which leads to meaningful 
decreases in testing power whether or not the graph information is included. 
We include a graph regularization or ‘smoothing’ approach for managing 
this issue.

# Installation

```{r, eval=FALSE}
# install from bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("pairkat")
```

# Load pairkat library and example data

PaIRKAT comes with an example data set to use as a reference when formatting 
data inputs to its functions. This synthetic dataset called "smokers" contains 
a set of human subjects with phenotype variables related to 
lung health among smokers. Subjects have associated metabolomics assay data 
that are linked to KEGG pathway database IDs. 

```{r, message=F, warning = F, results='hide'}
# load pairkat library
library(pairkat)
data("smokers")
```

# Create a summarized experiment object

Data need to be added to a SummarizedExperiment class object before they can 
be analyzed with PaIRKAT. There are three components of the 
SummarizedExperiment that need to be added:

1. Phenotype Data: Contains outcomes of interest and any meaningful covariates 
to be adjusted for. Rows should be subjects and columns should be variables.

2. Pathway Data: Maps measured metabolites names to KEGG IDs in order to query 
the KEGG database. Rows are metabolite names and one column should contain 
KEGG IDs.

3. Metabolite Assay: Contains measurements of metabolites for all subjects. 
Rows should be subjects and columns should be metabolite names. One column 
should have subject IDs matching subject IDs in clinical data.

We will walk through each of these data and discuss their formats.

## Phenotype Data

Phenotype data should be formatted such that subject ID's are row names and 
variables of interest are column names. Categorical variables should be 
converted to dummy variables (i.e. one-hot-encoded). 

In our example, we will be working with a data set containing measures related 
to lung health among smokers. Variables in this data set include:

`log_FEV1_FVC_ratio`: [log(FEV1/FVC)] A measure of lung capacity. 
The FEV1/FVC ratio is the ratio of the forced expiratory volume in the first 
one second to the forced vital capacity of the lungs. This measure has been 
log transformed.

`high_log_FEV1_FVC_ratio_binary`: A binarized transformation of 
`log_FEV1_FVC_ratio` which indicates whether `log_FEV1_FVC_ratio` is "high" 
with a 1, otherwise 0.

`race_binary`: A binary indicator of race. 

`age`: Age in years

`bmi`: Body Mass Index

`smoking_status`: Binary indicator of whether a subject is a current smoker 
(1) or not (0)

`pack_years`: A way to measure the amount a person has smoked over a long 
period of time. It is calculated by multiplying the number of packs of 
cigarettes smoked per day by the number of years the person has smoked.

The example data are already packaged as a SummarizedExperiment. We will extract
each of the components to look at their structures. Phenotype data are saved
in the colData partition of the SummarizedExperiment can can be extracted as
follows:

```{r, message=F, warning = F}
phenotype <- SummarizedExperiment::colData(smokers)
head(phenotype)
```

## Pathway Data

Pathway data primarily functions to link metabolite names with KEGG ID's. 
Row names should be metabolite names and one column should contain KEGG IDs. 
Other columns may be present if desired.

Pathway data are saved
in the rowData partition of the SummarizedExperiment can can be extracted as
follows:

```{r}
pathways <- SummarizedExperiment::rowData(smokers)
head(pathways)[, 1:2]
```

## Metabolite Assay

Metabolite assay data should have metabolite names as row names (matching 
pathway data names) and subject IDs as column names.

Metabolite Assay data are saved
in the assays partition of the SummarizedExperiment. Multiple assays can be 
stored in the SummarizedExperiment object. PaIRKAT relies on the assay to be 
named "metabolomics". The assay data can can be extracted as follows:

```{r}
metabolome <- SummarizedExperiment::assays(smokers)$metabolomics
head(metabolome)[, 1:3]
```

## Create the Summarized Experiment Object

Once the three components are formatted correctly, it is straightforward to 
add these dataframes to the SummarizedExperiment object with the 
SummarizedExperiment function.

In this example, we re-save our SummarizedExperiment data to an object 
called `smokers`

```{r}
smokers <-
  SummarizedExperiment::SummarizedExperiment(
    assays = list(metabolomics = metabolome),
    rowData = pathways,
    colData = phenotype
  )
```


# GatherNetworks Function

GatherNetworks queries the KEGG API to discover molecular interactions and 
build a network graph between measured metabolites.

GatherNetworks takes 4 arguments 

`SE` - a SummarizedExperiment object

`keggID` - a string of the column name in pathway data containing KEGG IDs

`species` - a three letter species IDs can be found by 
running `keggList("organism")`

`minPathwaySize` - this argument filters KEGG pathways that contain fewer 
metabolites than the number specified

## Get species ID

Our data were gathered from humans, so we will use the three letter species 
code "hsa".

```{r}
head(KEGGREST::keggList("organism"))[, 2:3]
```
## Run GatherNetworks

```{r}
networks <- GatherNetworks(
  SE = smokers,
  keggID = "kegg_id",
  species = "hsa",
  minPathwaySize = 5
)
```


# PaIRKAT function

PaIRKAT can be used to get a kernel score for all pathways found in 
GatherNetworks. PaIRKAT takes 3 arguments:

`formula.H0` - The null model in the "formula" format used in lm and glm 
functions.

`networks` - networks object obtained with GatherNetworks

`tau` - A parameter to control the amount of smoothing, analogous to a 
bandwidth parameter in kernel smoothing. We found 1 often gave reasonable 
results, as over-smoothing can lead to inflated Type I errors.

```{r, message = F, warning = F}
# run PaIRKAT
output <- PaIRKAT(log_FEV1_FVC_ratio ~ age, 
                  networks = networks)

```

## View Results

The formula call and results of the kernel test can be viewed by calling items 
from the PaIRKAT function output. In our example, the first result can be 
interpreted as follows:

Histidine metabolism has a significant relationship with 
log_FEV1_FVC_ratio when controlling for age (p = 0.0038). 
Similarly, Arginine biosynthesis has a significant relationship with 
log_FEV1_FVC_ratio when controlling for age (p = 0.0070). etc.

```{r}
# view formula call
output$call

# view results
results <- dplyr::arrange(output$results, p.value)
head(results)
```

## Visualize Networks

To look further into significant metabolic pathways, it is possible to plot 
pathway networks using the plotNetworks function and passing the networks 
object along with one of the pathway names as a string. To visualize all 
networks, pass "all" to the pathways argument. 

```{r}
plotNetworks(networks = networks, 
             pathway = "Glycerophospholipid metabolism")
```

# Session Information
```{r}
sessionInfo()
```