-
Notifications
You must be signed in to change notification settings - Fork 0
/
1_MedInfCIN2024_Omics.Rmd
458 lines (321 loc) · 13 KB
/
1_MedInfCIN2024_Omics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
---
title: "Download and explore data from GEO - Gene Expression Omnibus"
output:
html_document:
toc: yes
toc_float: yes
editor_options:
markdown:
wrap: 80
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## 1) Installing packages
Dependencies for today's exercises include:
| Package | source | Description |
|:------------------:|:------------------:|:--------------------------------------:|
| `biomaRt` | Bioconductor | Get molecular annotation for features (gene, proteins, etc) |
| `Biobase` | Bioconductor | Handle Expression-Sets (gene expression data) |
| `GEOquery` | Bioconductor | Get data from GEO |
| `qusage` | Bioconductor | Read GMT files |
Install these packages from Bioconductor
```{r ins_dep2, results=F, message=F}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install(c("biomaRt", "Biobase", "GEOquery", "qusage"))
```
Let's now load the libraries:
```{r load_lib, results=F, message=F}
library("biomaRt")
library("Biobase")
library("GEOquery")
library("qusage")
```
## 2) Loading data set from GEO
You can visit the GEO entry for `GSE116436` at
<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116436>
`GSE116436` is the GEO entry for the entire data set. It has transcriptome
profiles of **60 cell lines after exposure to 15 anti-cancer agents** after 2, 6
and 24 hrs and associated drug concentrations (incl. null concentration, i.e.
control). If you want to know more about this datasets you can read the article:
> Monks A et al. The NCI Transcriptional Pharmacodynamics Workbench: A Tool to
> Examine Dynamic Expression Profiling of Therapeutic Response in the NCI-60
> Cell Line Panel. Cancer Res 2018 Dec 15;78(24):6807-6817. PMID:
> [30355619](https://www.ncbi.nlm.nih.gov/pubmed/30355619)
You can also click on one sample (transcriptome profile) to see **sample
metadata**. For instance,
[GSM3231645](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3231645)
Importantly, this **data set is large (7.5k samples)**. Thus the data is
stratified by drug:
| GEO accession | Drug |
|:--------------|:--------------|
| GSE116437 | 5-Azacytidine |
| GSE116438 | bortezomib |
| GSE116439 | cisplatin |
| GSE116440 | dasatinib |
| GSE116441 | doxorubicin |
| GSE116442 | erlotinib |
| GSE116443 | geldanamycin |
| GSE116444 | gemcitabine |
| GSE116445 | lapatinib |
| GSE116446 | paclitaxel |
| GSE116447 | sirolimus |
| GSE116448 | sorafenib |
| GSE116449 | sunitinib |
| GSE116450 | topotecan |
| GSE116451 | vorinostat |
Let's start with one subset of the data to understand how it works.
We choose the `GEO accession number` of one subserie. For instance, we choose
lapatinib (GSE116445).
We use the following command line to download the data
```{r message=FALSE, warning=FALSE, include=FALSE}
Sys.setenv("VROOM_CONNECTION_SIZE" = 1.5 * 1024^2)
```
```{r pressure, cache=TRUE, results=F, message=F}
# Download the lapatinib dataset
lapatinib <- getGEO("GSE116445", AnnotGPL = TRUE)[[1]]
```
You might need to increase the size of the connection buffer. For example to
1.5MB `Sys.setenv("VROOM_CONNECTION_SIZE" = 1.5*1024^2)`
**Try it for yourself. Download the cisplatin dataset and store it in a variable
named `cisplatin`.**
## 3) Exploring the data object
After this, we have a R object named `lapatinib`. This object contains the gene
expression data and sample metadata. To explore this object, we can use:
```{r expl_dim_eset, results=F, message=F}
class(lapatinib)
# It is a Expression-Set, which is a special object to store 'expression data'
# What is the dimension of this dataset?
dim(lapatinib)
# What attributes does the object have?
attributes(lapatinib)
```
**How many Features and Samples are in the cisplatin data?**
### What do the features and samples look like?
```{r expl_dim_eset2, results=F, message=F}
# Features = probes
head(featureNames(lapatinib))
# Samples = transcriptome profiles (under certain experimental conditions)
head(sampleNames(lapatinib))
```
**Are the features of the lapatinib set the same as the cisplatin set?**
## 4) Subsetting the data
We expect to see the strongest patterns of change if we compare control samples
that were not treated with the drug (concentration = 0) and samples that were
treated with the highest dose. Maybe we would also expect that the effect will
be most noticeable if we compare the expressions at the last measured time
point, i.e., after 24 hours of administration. To subset the data to select
samples of interest, we first get the sample metadata for these two features:
drug concentration and time point.
```{r sample_metadata, results=F, message=F}
# the sample metadata contains the drug concentrations within tags
sample_tags <- lapatinib$title
# Have a look at some tags
head(sample_tags)
## Drug concentration
# Extract the part describing drug concentration
drug_conc <- sapply(strsplit(sample_tags, "_"), `[`, 3)
# Replace nanoMolar with empty string
drug_conc <- sub("nM", "", drug_conc)
# Convert to numbers
drug_conc <- as.numeric(drug_conc)
# Now we have the drug concentration information
head(drug_conc)
# Max conc
drug_conc.max <- max(drug_conc)
# Min conc
drug_conc.min <- min(drug_conc)
# Add drug concentrations back to lapatinib set
lapatinib$drug_conc <- drug_conc
### Extract information about the lapatinib time points (element 4). What are the minimum and maximum time points?
## Time point: basically the same as above
drug_tpoint <- sapply(strsplit(sample_tags, "_"), `[`, 4)
drug_tpoint <- sub("h", "", drug_tpoint)
drug_tpoint <- as.numeric(drug_tpoint)
# Find min and max
# drug_tpoint.max =
# drug_tpoint.min =
## Add back to lapatinib set
lapatinib$drug_tpoint <- drug_tpoint
```
Finally we subset the data
```{r subset_prof, results=F, message=F}
# Select samples that have been given the maximum or minimum drug concentration
selected_samples <- (drug_conc == drug_conc.max | drug_conc == drug_conc.min)
lapatinib <- lapatinib[, selected_samples]
# How many samples do we now have in the lapatinib set?
dim(lapatinib)
```
We can further subset the data to contain only information about the last time
point. You can attempt to do that and see the effect of this subsetting on the
analyses that follow.
## 5) Access to the gene expression and annotation data
Expression-Sets are objects to store gene expression data and metadata
associated to both genes and samples. All this information is stored in an
`R object`.
**The Gene Expression data**: Gene expression data is stored as a quantitative
matrix. This matrix contains measurements of the gene-level expression. Each row
corresponds to a gene and each column corresponds to a sample.
**Metadatafor samples and features**: There are two `data.frame`s containing
metadata for samples and features. Phenotype information and complementary
information for the probes.
To access to the gene expression data, we use `exprs()`.
```{r exprs_access, results=F, message=F}
# Access expression set
e_data <- exprs(lapatinib)
head(e_data)
```
`Expresion-Set` objects store the metadata information within the object itself.
To access this information, you can use two functions from the `Biobase`
package:
- `pData()` : Access to phenotypeData. The phenotypeData is the sample
metadata described in the GEO entry of each sample.
- `fData()` : Access to featureData. The featureData is complementary
information for each one of the features (probes).
### Sample metadata
```{r pdata_access, results=F, message=F}
# Access phenotype data
p_data <- pData(lapatinib)
head(p_data)
```
### Feature metadata
```{r fdata_access, results=F, message=F}
# Access feature data
f_data <- fData(lapatinib)
head(f_data)
```
## 6) Identifier conversion
In order to analyse and interpret the data, we need to map this identifiers to
genes and experimental conditions of each sample. Microarrays are
High-Throughput platforms where gene-level expression is measured using one or a
few probes for the same gene.
For instance, we could know which platform was used in our dataset:
```{r platform, results=F, message=F}
# GEO platform identifier
annotation(lapatinib)
# Or using the sample metadata
unique(pData(lapatinib)$hyb_protocol)
```
In order to create a matrix of Genes x Samples, we are going to collapse the
intensity of one or more probes into one single value that represents the
expression of the gene. For this, we need to create a vector that maps each
probe to its corresponding gene
```{r match, cache=T, eval=F, results=F, message=F}
# Loading Mart object for Ensembl Homo sapiens data
mart <- useMart("ENSEMBL_MART_ENSEMBL", "hsapiens_gene_ensembl",
host = "www.ensembl.org"
)
query <- featureNames(lapatinib)
# Using annotation and hyb-platform information above
matching_table <- getBM(c("affy_hg_u133a_2", "external_gene_name"),
filters = "affy_hg_u133a_2", values = query, mart = mart
)
colnames(matching_table) <- c("probe", "gene")
```
### Workaround (if Ensembl servers are down)
Gene information is found in feature metadata
```{r eval=F, results=F, message=F}
# View feature data gene column
head(f_data$`Gene symbol`) %>% head()
# Getting unique IDs (first gsymbol)
matching_table2 <- data.frame(
probe = f_data$ID,
gene = gsub(
"(///|-).*", "",
f_data$`Gene symbol`
)
)
```
### Renaming columns
```{r eval=F, results=F, message=F}
# Look at current head of expression data
e_data[1:5, 1:5]
# Convert columns names to something readable
our_columns <- colnames(e_data)
our_columns <- p_data[our_columns, "title"]
colnames(e_data) <- our_columns
```
### Collapsing probe intensities to gene-level expression
```{r col_expr, eval=F, results=F, message=F}
# Transform into a data.frame
df <- as.data.frame(e_data)
# Matching Identifiers
gsymbols <- as.character(sapply(rownames(df), function(x) {
aux <- matching_table[matching_table$probe == x, "gene"][1]
}))
# We add another column to the data.frame which describe the corresponding gene
df$gene <- gsymbols
# Aggregate values for the same gene by the average
df2 <- aggregate(. ~ gene, data = df, mean)
# Replace row.names with genes
rownames(df2) <- as.character(df2$gene)
# Removing unmapped GeneSymbols
df2 <- df2[!rownames(df2) == "", ]
# Remove the column with the genes
df2 <- df2[, colnames(df2) != "gene"]
```
## 7) Saving the data
```{r save_data, eval=F, results=F, message=F}
# Saving the expression data
saveRDS(df2, "lapatinib_expression.rds")
# Let us save the annotations for the upcoming days
saveRDS(p_data, "lapatinib_phenotype.rds")
```
## 8) Retrieving pathway annotation
Let's take as an example a curated gene set (C2) from
[MSigDB](http://software.broadinstitute.org/gsea/msigdb/).
Download the `.gmt` file from the Biocarta curated pathways and subset the list
of genes corresponding to the MAPK pathway (`BIOCARTA_MAPK_PATHWAY`). \>
Alternatively, you can search directly on the website and dowload only that \>
specific list of
genes([here](http://software.broadinstitute.org/gsea/msigdb/cards/BIOCARTA_MAPK_PATHWAY.html))
```{r load_gset, eval=F, results=F, message=F}
gsets <- read.gmt("data/c2.cp.biocarta.v7.1.symbols.gmt")
gset <- gsets[["BIOCARTA_MAPK_PATHWAY"]]
```
## 9) Obtaining PPI network
Now we will download the protein-protein interaction network from
[OmniPath](http://omnipathdb.org/interactions?genesymbols=1).
```{r opath, eval=F, results=F, message=F}
url <- "http://omnipathdb.org/interactions?genesymbols=1"
opath <- unique(read.table(url, sep = "\t", header = T)[, c(3, 4)])
# Subsetting the PPI to members of the pathway
pathway <- opath[opath[, 1] %in% gset & opath[, 2] %in% gset, ]
```
## 10) Mapping expression to pathway
Finally, we extract the median expression of all nodes in our pathway and save
them in a file as well as the network itself.
```{r map_exp_path, eval=F, results=F, message=F}
unique_genes <- unique(c(
as.character(pathway[, 1]),
as.character(pathway[, 2])
))
measurements <- data.frame()
# Extracting median of measurements of pathway members
for (g in unique_genes) {
measurements[g, 1] <- rowMedians(as.matrix(df2[g, ]))
}
# Storing the results
write.table(pathway, "pathway.txt", sep = "\t", quote = F, row.names = F)
write.csv(measurements, "measurements.csv", quote = F)
```
### Kegg apoptosis
```{r kegg, eval=F, results=F, message=F}
gsets <- read.gmt("data/kegg_apop.gmt")
gset <- gsets[["KEGG_APOPTOSIS"]]
pathway <- opath[opath[, 1] %in% gset & opath[, 2] %in% gset, ]
unique_genes <- unique(c(
as.character(pathway[, 1]),
as.character(pathway[, 2])
))
measurements <- data.frame()
# Extracting median of measurements of pathway members
for (g in unique_genes) {
measurements[g, 1] <- rowMedians(as.matrix(df2[g, ]))
}
# Storing the results
write.table(pathway, "apop_pathway.txt", sep = "\t", quote = F, row.names = F)
write.csv(measurements, "apop_measurements.csv", quote = F)
```