1 Why Do We Need A New Class?
2 Design Philosophy
3 Anatomy of a LongTable
- 3.1 Class Diagram
- 3.2 Object Structure and Cardinality
4 Building a LongTable
- 4.1 Single Assays Table
- 4.2 Multiple Assay Tables
5 LongTable Object
6 Accessor Methods

1 Why Do We Need A New Class?

The current implementation for the @sensitivity slot in a PharmacoSet has some limitations.

Firstly, it does not natively support dose-response experiments with multiple drugs and/or cancer cell lines. As a result we have not been able to include this data into a PharmacoSet thus far.

Secondly, drug combination data has the potential to scale to high dimensionality. As a result we need an object that is highly performant to ensure computations on such data can be completed in a timely manner.

2 Design Philosophy

The current use case is supporting drug and cell-line combinations in PharmacoGx, but we wanted to create something flexible enough to fit other use cases. As such, the current class makes no mention of drugs or cell-lines, nor anything specifically related to Bioinformatics or Computation Biology. Rather, we tried to design a general purpose data structure which could support high dimensional data for any use case.

Our design takes the best aspects of the SummarizedExperiment and MultiAssayExperiment class and implements them using the data.table package, which provides an R API to a rich set of tools for high performance data processing implemented in C.

3 Anatomy of a LongTable

3.1 Class Diagram

We have borrowed directly from the SummarizedExperiment class for the rowData, colData, metadata and assays slot names. We also implemented the SummarizedExperiment accessor generics for the LongTable.

3.2 Object Structure and Cardinality

There are, however, some important differences which make this object more flexible when dealing with high dimensional data.

Unlike a SummarizedExperiment, there are three distinct classes of columns in rowData and colData.

The first is the rowKey or colKey, these are implemented internally to keep mappings between each assay and the associated samples or drugs; these will not be returned by the accessors by default. The second is the rowIDs and colIDs, these hold all of the information necessary to uniquely identify a row or column and are used to generate the rowKey and colKey. Finally, there are the rowMeta and colMeta columns, which store any additional data about samples or drugs not required to uniquely identify a row in either table.

Within the assays the rowKey and colKey are combined to form a primary key for each assay row. This is required because each assay is stored in ‘long’ format, instead of wide format as in the assay matrices within a SummarizedExperiment. Thanks to the fast implementation of binary search within the data.table package, assay tables can scale up to tens or even hundreds of millions of rows while still being relatively performant.

Also worth noting is the cardinality between rowData and colData for a given assay within the assays list. As indicated by the lower connection between these tables and an assay, for each row or column key there may be zero or more rows in the assay table. Conversely for each row in the assay there may be zero or one key in colData or rowData. When combined, the rowKey and colKey for a given row in an assay become a composite key which maps that to

4 Building a LongTable

he current implementation of the buildLongTable function is able to assemble a LongTable object from two sources. The first is a single large table with all assays, row and column data contained within it. This is the structure of the Merck drug combination data that has been used to test the data structure thus far.

filePath <- '../data/merckLongTable.csv'
merckDT <- fread(filePath, na.strings=c('NULL'))
colnames(merckDT)

##  [1] "cell_line"        "combination_name" "BatchID"          "drugA_name"      
##  [5] "drugA Conc (uM)"  "drugB_name"       "drugB Conc (uM)"  "viability1"      
##  [9] "viability2"       "viability3"       "viability4"       "mu/muMax"        
## [13] "X/X0"

knitr::kable(head(merckDT)[, 1:5])

cell_line	combination_name	BatchID	drugA_name	drugA Conc (uM)
A2058	5-FU & ABT-888	1	5-FU	0.35
A2058	5-FU & BEZ-235	1	5-FU	0.35
A2058	5-FU & Bortezomib	1	5-FU	0.35
A2058	5-FU & Dasatinib	1	5-FU	0.35
A2058	5-FU & L778123	1	5-FU	0.35
A2058	5-FU & geldanamycin	1	5-FU	0.35

knitr::kable(head(merckDT)[, 5:ncol(merckDT)])

drugA Conc (uM)	drugB_name	drugB Conc (uM)	viability1	viability2	viability3	viability4	mu/muMax	X/X0
0.35	ABT-888	0.35000	0.971	1.090	0.949	0.996	0.992	0.988
0.35	BEZ-235	0.00450	0.921	0.947	0.915	0.956	0.965	0.953
0.35	Bortezomib	0.00045	0.983	0.962	0.950	0.954	0.978	0.970
0.35	Dasatinib	0.02400	0.798	0.778	0.946	0.312	0.879	0.846
0.35	L778123	0.32500	1.117	1.020	0.920	0.927	0.986	0.981
0.35	geldanamycin	0.02230	1.023	1.018	0.912	0.897	0.982	0.975

We can see that all the data related to the treatment response experiment is contained within this table.

4.1 Single Assays Table

To build a LongTable object from this file:

rowDataCols <- list(
    c(cell_line1="cell_line", BatchID="BatchID"))
colDataCols <- list(
    c(drug1='drugA_name', drug2='drugB_name',
     drug1dose='drugA Conc (uM)', drug2dose='drugB Conc (uM)'),
    c(comboName='combination_name'))
assayCols <- list(viability=paste0('viability', seq_len(4)),
                  viability_summary=c('mu/muMax', 'X/X0'))
longTable <- buildLongTable(from=filePath, rowDataCols,
                            colDataCols, assayCols)

## < LongTable > 
## dim:  8 583 
## assays(2): viability viability_summary 
## rownames(8): A2058:1 A2780:1 A2780:2 ... A427:1 CAOV3:1 CAOV3:2 
## rowData(2): cell_line1 BatchID 
## colnames(583): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... geldanamycin:SN-38:0.0223:0.000115 geldanamycin:Sorafenib:0.0223:2.5 geldanamycin:Topotecan:0.0223:0.0223 
## colData(5): drug1 drug2 drug1dose drug2dose comboName 
## metadata(0): none

This function will also work if directly passed a data.table or data.frame object:

## [1] "All equal? TRUE"

4.2 Multiple Assay Tables

The second option for building a LongTable is to pass it a list of different assays with a shared set of row and column identifiers. We haven’t had a chance to testing this functionality with real data yet, but do have a toy example.

assayList <- assays(longTable, withDimnames=TRUE, metadata=TRUE, key=FALSE)
assayList$new_viability <- assayList$viability  # Add a fake additional assay
assayCols$new_viability <-  assayCols$viability  # Add column names for fake assay
longTable2 <- buildLongTable(from=assayList, lapply(rowDataCols, names), lapply(colDataCols, names), assayCols)

## < LongTable > 
## dim:  8 583 
## assays(3): viability viability_summary new_viability 
## rownames(8): A2058:1 A2780:1 A2780:2 ... A427:1 CAOV3:1 CAOV3:2 
## rowData(2): cell_line1 BatchID 
## colnames(583): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... geldanamycin:SN-38:0.0223:0.000115 geldanamycin:Sorafenib:0.0223:2.5 geldanamycin:Topotecan:0.0223:0.0223 
## colData(5): drug1 drug2 drug1dose drug2dose comboName 
## metadata(0): none

We can see that a new assay has been added to the LongTable object when passed a list of assay tables containing the required row and column IDs. Additionally, any row or column IDs not already in rowData or colData will be appended to these slots automatically!

5 LongTable Object

As mentioned previously, a LongTable has both list and table like behaviours. For table like operations, a given LongTable can be thought of as a rowKey by colKey rectangular object.

To support data.frame like sub-setting for this object, the constructor makes pseudo row and column names, which are the ID columns for each row of rowData or colData pasted together with a ‘:’.

5.1 Row and Column Names

head(rownames(longTable))

## [1] "A2058:1" "A2780:1" "A2780:2" "A375:1"  "A375:2"  "A427:1"

We see that the rownames for the Merck LongTable are the cell line name pasted to the batch id.

head(colnames(longTable))

## [1] "5-FU:ABT-888:0.35:0.35"        "5-FU:AZD1775:0.35:0.0325"     
## [3] "5-FU:BEZ-235:0.35:0.0045"      "5-FU:Bortezomib:0.35:0.00045" 
## [5] "5-FU:Dasatinib:0.35:0.024"     "5-FU:Dinaciclib:0.35:0.000925"

For the column names, a similar pattern is followed by combining the colID columns in the form ‘drug1:drug2:drug1dose:drug2dose’.

5.2 `data.frame` Subsetting

We can subset a LongTable using the same row and column name syntax as with a data.frame or matrix.

row <- rownames(longTable)[1]
columns <- colnames(longTable)[1:2]
longTable[row, columns]

## < LongTable > 
## dim:  1 2 
## assays(2): viability viability_summary 
## rownames(1): A2058:1 
## rowData(2): cell_line1 BatchID 
## colnames(2): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 
## colData(5): drug1 drug2 drug1dose drug2dose comboName 
## metadata(0): none

5.2.1 Regex Queries

However, unlike a data.frame or matrix this subsetting also accepts partial row and column names as well as regex queries.

head(rowData(longTable), 3)

##    cell_line1 BatchID
## 1:      A2058       1
## 2:      A2780       1
## 3:      A2780       2

head(colData(longTable), 3)

##    drug1   drug2 drug1dose drug2dose      comboName
## 1:  5-FU ABT-888      0.35    0.3500 5-FU & ABT-888
## 2:  5-FU AZD1775      0.35    0.0325 5-FU & AZD1775
## 3:  5-FU BEZ-235      0.35    0.0045 5-FU & BEZ-235

For example, if we want to get all instance where ‘5-FU’ is the drug:

longTable[, '5-FU']

## < LongTable > 
## dim:  5 22 
## assays(2): viability viability_summary 
## rownames(5): A2058:1 A2780:1 A375:1 A427:1 CAOV3:1 
## rowData(2): cell_line1 BatchID 
## colnames(22): 5-FU:ABT-888:0.35:0.35 5-FU:AZD1775:0.35:0.0325 5-FU:BEZ-235:0.35:0.0045 ... 5-FU:geldanamycin:0.35:0.0223 MK-4541:5-FU:0.045:0.35 MRK-003:5-FU:0.35:0.35 
## colData(5): drug1 drug2 drug1dose drug2dose comboName 
## metadata(0): none

This has matched all colnames where 5-FU was in either drug1 or drug2. If we only want to match drug1, we have several options:

all.equal(longTable[, '5-FU:*:*:*'], longTable[, '^5-FU'])

## [1] TRUE

5.3 `data.table` Subsetting

In addition to regex queries, a LongTable object supports arbitrarily complex subset queries using the data.table API. To access this API, you will need to use the . function, which allows you to pass raw R expressions to be evaluated inside the i and j arguments for dataTable[i, j].

For example if I want to subset to rows where the cell line is VCAP and columns where drug1 is Temozolomide and drug2 is either Lapatinib or Bortezomib:

longTable[.(cell_line1 == 'CAOV3'),  # row query
          .(drug1 == 'Temozolomide' & drug2 %in% c('Lapatinib', 'Bortezomib'))]  # column query

## < LongTable > 
## dim:  1 2 
## assays(2): viability viability_summary 
## rownames(1): CAOV3:1 
## rowData(2): cell_line1 BatchID 
## colnames(2): Temozolomide:Bortezomib:2.75:0.00045 Temozolomide:Lapatinib:2.75:0.055 
## colData(5): drug1 drug2 drug1dose drug2dose comboName 
## metadata(0): none

We can also invert matches or subset on other columns in rowData or colData:

subLongTable <-
  longTable[.(BatchID != 2),
            .(drug1 == 'Temozolomide' & drug2 != 'Lapatinib')]

To show that it works as expected:

print(paste0('BatchID: ', paste0(unique(rowData(subLongTable)$BatchID), collapse=', ')))

## [1] "BatchID: 1"

print(paste0('drug2: ', paste0(unique(colData(subLongTable)$drug2), collapse=', ')))

## [1] "drug2: ABT-888, AZD1775, BEZ-235, Bortezomib, Dasatinib, Dinaciclib, Erlotinib, MK-2206, MK-4827, MK-5108, MK-8669, MK-8776, Oxaliplatin, PD325901, SN-38, Sorafenib, Topotecan, geldanamycin"

6 Accessor Methods

6.1 rowData

head(rowData(longTable), 3)

##    cell_line1 BatchID
## 1:      A2058       1
## 2:      A2780       1
## 3:      A2780       2

head(rowData(longTable, key=TRUE), 3)

##    cell_line1 BatchID rowKey
## 1:      A2058       1      1
## 2:      A2780       1      2
## 3:      A2780       2      3

6.2 colData

head(colData(longTable), 3)

##    drug1   drug2 drug1dose drug2dose      comboName
## 1:  5-FU ABT-888      0.35    0.3500 5-FU & ABT-888
## 2:  5-FU AZD1775      0.35    0.0325 5-FU & AZD1775
## 3:  5-FU BEZ-235      0.35    0.0045 5-FU & BEZ-235

head(colData(longTable, key=TRUE), 3)

##    drug1   drug2 drug1dose drug2dose      comboName colKey
## 1:  5-FU ABT-888      0.35    0.3500 5-FU & ABT-888      1
## 2:  5-FU AZD1775      0.35    0.0325 5-FU & AZD1775      2
## 3:  5-FU BEZ-235      0.35    0.0045 5-FU & BEZ-235      3

6.3 assays

assays <- assays(longTable)
assays[[1]]

##       viability1 viability2 viability3 viability4 rowKey colKey
##    1:      0.971      1.090      0.949      0.996      1      1
##    2:      0.893      1.106      0.907      1.029      1      2
##    3:      0.921      0.947      0.915      0.956      1      3
##    4:      0.983      0.962      0.950      0.954      1      4
##    5:      0.798      0.778      0.946      0.312      1      5
##   ---                                                          
## 2911:      0.824      0.817      0.988      0.835      8    499
## 2912:      0.926      0.871      1.069      0.995      8    560
## 2913:      0.815      0.845      0.753      0.677      8    561
## 2914:      0.670      0.779      0.647      0.822      8    568
## 2915:      1.028      1.020      1.021      1.032      8    574

assays[[2]]

##       mu/muMax  X/X0 rowKey colKey
##    1:    0.992 0.988      1      1
##    2:    0.984 0.977      1      2
##    3:    0.965 0.953      1      3
##    4:    0.978 0.970      1      4
##    5:    0.879 0.846      1      5
##   ---                             
## 2911:    0.757 0.714      8    499
## 2912:    0.947 0.930      8    560
## 2913:    0.683 0.645      8    561
## 2914:    0.581 0.559      8    568
## 2915:    1.032 1.045      8    574

assays <- assays(longTable, withDimnames=TRUE)
colnames(assays[[1]])

##  [1] "drug1"      "drug2"      "drug1dose"  "drug2dose"  "comboName" 
##  [6] "cell_line1" "BatchID"    "viability1" "viability2" "viability3"
## [11] "viability4"

assays <- assays(longTable, withDimnames=TRUE, metadata=TRUE)
colnames(assays[[2]])

## [1] "drug1"      "drug2"      "drug1dose"  "drug2dose"  "comboName" 
## [6] "cell_line1" "BatchID"    "mu/muMax"   "X/X0"

assayNames(longTable)

## [1] "viability"         "viability_summary"

Using these names we can access specific assays within a LongTable.

6.4 assay

colnames(assay(longTable, 'viability'))

## [1] "viability1" "viability2" "viability3" "viability4" "rowKey"    
## [6] "colKey"

assay(longTable, 'viability')

##       viability1 viability2 viability3 viability4 rowKey colKey
##    1:      0.971      1.090      0.949      0.996      1      1
##    2:      0.893      1.106      0.907      1.029      1      2
##    3:      0.921      0.947      0.915      0.956      1      3
##    4:      0.983      0.962      0.950      0.954      1      4
##    5:      0.798      0.778      0.946      0.312      1      5
##   ---                                                          
## 2911:      0.824      0.817      0.988      0.835      8    499
## 2912:      0.926      0.871      1.069      0.995      8    560
## 2913:      0.815      0.845      0.753      0.677      8    561
## 2914:      0.670      0.779      0.647      0.822      8    568
## 2915:      1.028      1.020      1.021      1.032      8    574

colnames(assay(longTable, 'viability', withDimnames=TRUE))

##  [1] "drug1"      "drug2"      "drug1dose"  "drug2dose"  "comboName" 
##  [6] "cell_line1" "BatchID"    "viability1" "viability2" "viability3"
## [11] "viability4"

assay(longTable, 'viability', withDimnames=TRUE)

##              drug1        drug2 drug1dose drug2dose                comboName
##    1:         5-FU      ABT-888    0.3500  0.350000           5-FU & ABT-888
##    2:         5-FU      AZD1775    0.3500  0.032500           5-FU & AZD1775
##    3:         5-FU      BEZ-235    0.3500  0.004500           5-FU & BEZ-235
##    4:         5-FU   Bortezomib    0.3500  0.000450        5-FU & Bortezomib
##    5:         5-FU    Dasatinib    0.3500  0.024000         5-FU & Dasatinib
##   ---                                                                       
## 2911: Temozolomide    Erlotinib    2.7500  0.055000 Temozolomide & Erlotinib
## 2912:      Zolinza   Dinaciclib    0.0925  0.000925     Zolinza & Dinaciclib
## 2913:      Zolinza    Erlotinib    0.0925  0.055000      Zolinza & Erlotinib
## 2914:      Zolinza      MK-8776    0.0925  0.092500        Zolinza & MK-8776
## 2915:      Zolinza Temozolomide    0.0925  2.750000   Zolinza & Temozolomide
##       cell_line1 BatchID viability1 viability2 viability3 viability4
##    1:      A2058       1      0.971      1.090      0.949      0.996
##    2:      A2058       1      0.893      1.106      0.907      1.029
##    3:      A2058       1      0.921      0.947      0.915      0.956
##    4:      A2058       1      0.983      0.962      0.950      0.954
##    5:      A2058       1      0.798      0.778      0.946      0.312
##   ---                                                               
## 2911:      CAOV3       2      0.824      0.817      0.988      0.835
## 2912:      CAOV3       2      0.926      0.871      1.069      0.995
## 2913:      CAOV3       2      0.815      0.845      0.753      0.677
## 2914:      CAOV3       2      0.670      0.779      0.647      0.822
## 2915:      CAOV3       2      1.028      1.020      1.021      1.032

The LongTable Class

27 October 2020

Package

Contents

1 Why Do We Need A New Class?

2 Design Philosophy

3 Anatomy of a LongTable

3.1 Class Diagram

3.2 Object Structure and Cardinality

4 Building a LongTable

4.1 Single Assays Table

4.2 Multiple Assay Tables

5 LongTable Object

5.1 Row and Column Names

5.2 `data.frame` Subsetting

5.2.1 Regex Queries

5.3 `data.table` Subsetting

6 Accessor Methods

6.1 rowData

6.2 colData

6.3 assays

6.4 assay

The LongTable Class

27 October 2020

Package

Contents

1 Why Do We Need A New Class?

2 Design Philosophy

3 Anatomy of a LongTable

3.1 Class Diagram

3.2 Object Structure and Cardinality

4 Building a LongTable

4.1 Single Assays Table

4.2 Multiple Assay Tables

5 LongTable Object

5.1 Row and Column Names

5.2 data.frame Subsetting

5.2.1 Regex Queries

5.3 data.table Subsetting

6 Accessor Methods

6.1 rowData

6.2 colData

6.3 assays

6.4 assay

5.2 `data.frame` Subsetting

5.3 `data.table` Subsetting