---
title: "Pathway Integrated Regression-based Kernel Association Test (PaIRKAT)"
author:
- name: Charlie Carpenter
email: CHARLES.CARPENTER@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Weiming Zhang
email: WEIMING.ZHANG@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Lucas Gillenwater
email: LUCAS.GILLENWATER@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Cameron Severn
email: CAMERON.SEVERN@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Tusharkanti Ghosh
email: TUSHARKANTI.GHOSH@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Russel Bowler
email: BOWLERR@NJHEALTH.ORG
affiliation: University of Colorado Anschutz Medical Campus
- name: Katerina Kechris
email: KATERINA.KECHRIS@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
- name: Debashis Ghosh
email: DEBASHIS.GHOSH@CUANSCHUTZ.EDU
affiliation: University of Colorado Anschutz Medical Campus
package: pairkat
output:
BiocStyle::html_document:
highlight: "tango"
code_folding: show
toc: true
toc_float:
collapsed: false
date: "9/1/2021"
vignette: >
%\VignetteIndexEntry{using-pairkat}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction

Pathway Integrated Regression-based Kernel Association Test (PaIRKAT) is a
model framework for assessing statistical relationships between networks
and some outcome of interest while adjusting for
potential confounders and covariates.

The PaIRKAT method is motivated by the analysis of networks of metabolites
from a metabolomics assay and the relationship of those networks with a
phenotype or clinical outcome of interest, though the method can be generalized
to other domains.

PaIRKAT queries the KEGG database to determine interactions between
metabolites from which network connectivity is constructed. This model
framework improves testing power on high dimensional data by including
graph topography in the kernel machine regression setting. Studies on high
dimensional data can struggle to include the complex relationships between
variables. The semi-parametric kernel machine regression model is a powerful
tool for capturing these types of relationships. They provide a framework for
testing for relationships between outcomes of interest and high dimensional
data such as metabolomic, genomic, or proteomic pathways. PaIRKAT uses known
biological connections between high dimensional variables by representing them
as edges of â€˜graphsâ€™ or â€˜networks.â€™ It is common for nodes (e.g. metabolites)
to be disconnected from all others within the graph, which leads to meaningful
decreases in testing power whether or not the graph information is included.
We include a graph regularization or â€˜smoothingâ€™ approach for managing
this issue.

# Installation

```{r, eval=FALSE}
# install from bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("pairkat")
```

# Load pairkat library and example data

PaIRKAT comes with an example data set to use as a reference when formatting
data inputs to its functions. This synthetic dataset called "smokers" contains
a set of human subjects with phenotype variables related to
lung health among smokers. Subjects have associated metabolomics assay data
that are linked to KEGG pathway database IDs.

```{r, message=F, warning = F, results='hide'}
# load pairkat library
library(pairkat)
data("smokers")
```

# Create a summarized experiment object

Data need to be added to a SummarizedExperiment class object before they can
be analyzed with PaIRKAT. There are three components of the
SummarizedExperiment that need to be added:

1. Phenotype Data: Contains outcomes of interest and any meaningful covariates
to be adjusted for. Rows should be subjects and columns should be variables.

2. Pathway Data: Maps measured metabolites names to KEGG IDs in order to query
the KEGG database. Rows are metabolite names and one column should contain
KEGG IDs.

3. Metabolite Assay: Contains measurements of metabolites for all subjects.
Rows should be subjects and columns should be metabolite names. One column
should have subject IDs matching subject IDs in clinical data.

We will walk through each of these data and discuss their formats.

## Phenotype Data

Phenotype data should be formatted such that subject ID's are row names and
variables of interest are column names. Categorical variables should be
converted to dummy variables (i.e. one-hot-encoded).

In our example, we will be working with a data set containing measures related
to lung health among smokers. Variables in this data set include:

`log_FEV1_FVC_ratio`: [log(FEV1/FVC)] A measure of lung capacity.
The FEV1/FVC ratio is the ratio of the forced expiratory volume in the first
one second to the forced vital capacity of the lungs. This measure has been
log transformed.

`high_log_FEV1_FVC_ratio_binary`: A binarized transformation of
`log_FEV1_FVC_ratio` which indicates whether `log_FEV1_FVC_ratio` is "high"
with a 1, otherwise 0.

`race_binary`: A binary indicator of race.

`age`: Age in years

`bmi`: Body Mass Index

`smoking_status`: Binary indicator of whether a subject is a current smoker
(1) or not (0)

`pack_years`: A way to measure the amount a person has smoked over a long
period of time. It is calculated by multiplying the number of packs of
cigarettes smoked per day by the number of years the person has smoked.

The example data are already packaged as a SummarizedExperiment. We will extract
each of the components to look at their structures. Phenotype data are saved
in the colData partition of the SummarizedExperiment can can be extracted as
follows:

```{r, message=F, warning = F}
phenotype <- SummarizedExperiment::colData(smokers)
head(phenotype)
```

## Pathway Data

Pathway data primarily functions to link metabolite names with KEGG ID's.
Row names should be metabolite names and one column should contain KEGG IDs.
Other columns may be present if desired.

Pathway data are saved
in the rowData partition of the SummarizedExperiment can can be extracted as
follows:

```{r}
pathways <- SummarizedExperiment::rowData(smokers)
head(pathways)[, 1:2]
```

## Metabolite Assay

Metabolite assay data should have metabolite names as row names (matching
pathway data names) and subject IDs as column names.

Metabolite Assay data are saved
in the assays partition of the SummarizedExperiment. Multiple assays can be
stored in the SummarizedExperiment object. PaIRKAT relies on the assay to be
named "metabolomics". The assay data can can be extracted as follows:

```{r}
metabolome <- SummarizedExperiment::assays(smokers)$metabolomics
head(metabolome)[, 1:3]
```

## Create the Summarized Experiment Object

Once the three components are formatted correctly, it is straightforward to
add these dataframes to the SummarizedExperiment object with the
SummarizedExperiment function.

In this example, we re-save our SummarizedExperiment data to an object
called `smokers`

```{r}
smokers <-
SummarizedExperiment::SummarizedExperiment(
assays = list(metabolomics = metabolome),
rowData = pathways,
colData = phenotype
)
```

# GatherNetworks Function

GatherNetworks queries the KEGG API to discover molecular interactions and
build a network graph between measured metabolites.

GatherNetworks takes 4 arguments

`SE` - a SummarizedExperiment object

`keggID` - a string of the column name in pathway data containing KEGG IDs

`species` - a three letter species IDs can be found by
running `keggList("organism")`

`minPathwaySize` - this argument filters KEGG pathways that contain fewer
metabolites than the number specified

## Get species ID

Our data were gathered from humans, so we will use the three letter species
code "hsa".

```{r}
head(KEGGREST::keggList("organism"))[, 2:3]
```
## Run GatherNetworks

```{r}
networks <- GatherNetworks(
SE = smokers,
keggID = "kegg_id",
species = "hsa",
minPathwaySize = 5
)
```

# PaIRKAT function

PaIRKAT can be used to get a kernel score for all pathways found in
GatherNetworks. PaIRKAT takes 3 arguments:

`formula.H0` - The null model in the "formula" format used in lm and glm
functions.

`networks` - networks object obtained with GatherNetworks

`tau` - A parameter to control the amount of smoothing, analogous to a
bandwidth parameter in kernel smoothing. We found 1 often gave reasonable
results, as over-smoothing can lead to inflated Type I errors.

```{r, message = F, warning = F}
# run PaIRKAT
output <- PaIRKAT(log_FEV1_FVC_ratio ~ age,
networks = networks)

```

## View Results

The formula call and results of the kernel test can be viewed by calling items
from the PaIRKAT function output. In our example, the first result can be
interpreted as follows:

Histidine metabolism has a significant relationship with
log_FEV1_FVC_ratio when controlling for age (p = 0.0038).
Similarly, Arginine biosynthesis has a significant relationship with
log_FEV1_FVC_ratio when controlling for age (p = 0.0070). etc.

```{r}
# view formula call
output$call

# view results
results <- dplyr::arrange(output$results, p.value)
head(results)
```

## Visualize Networks

To look further into significant metabolic pathways, it is possible to plot
pathway networks using the plotNetworks function and passing the networks
object along with one of the pathway names as a string. To visualize all
networks, pass "all" to the pathways argument.

```{r}
plotNetworks(networks = networks,
pathway = "Glycerophospholipid metabolism")
```

# Session Information
```{r}
sessionInfo()
```