For whatever reason, following on from my despair with normalizing gene expression data from earlier in the week, my most recent challenge has been to take a Bioconductor ExpressionSet of gene expression data measured using an Affymetrix GeneChip® Human Transcriptome Array 2.0 but instead of labeling each row with its probe ID having it mapped to its corresponding gene symbol.
I have seen a lot of code samples that suggest using variations on a theme of using the biomaRt package or querying a SQL database of annotation data directly: in the former I gave up trying; in the latter, I ran away to hide, having only interacted with a SQL database through Java’s JPA abstraction layer recently.
It turns out to be very easy to do this using the affycoretools package by James W MacDonald which contains ‘various wrapper functions that have been written to streamline the more common analyses that a core Biostatistician might see.’
As you can see below, you can very easily extract a vector of gene symbols for each of your probe IDs and assign it as the rownames to your gene expression data.frame.
I hope this will save you the trouble of finding this gem of a package.
#retrieve ExpressionSet using GEOquery
gse76250 <- getGEO('GSE76250')[]
# populate the fData slot in the ExpressionSet
# with gene symbols
annot.gse76250 <- annotateEset(gse76250, pd.hta.2.0)
# extract the expression data.frame
gse76250.expr <- exprs(annot.gse76250)
# change rownames of expression data.frame
# to gene symbols instead of probe ids
gene.symbols <- fData(annot.gse76250)$SYMBOL
rownames(gse76250.expr) <- gene.symbols
Recently I have been familiarising myself with analysing microarray data in R. Statistics and Analysis for Microarrays Using R and Bioconductor by Sorin Draghici is proving to be indispensible in guiding me through retrieving microarray data from the Gene Expression Omnibus (GEO), performing typical quality control on samples and normalizing expression data over multiple samples.
As an example, I wanted to examine the gene expression profiles from GSE76250 which is a comprehensive set of 165 Triple-Negative Breast Cancer samples. In order to perform the quality control on this dataset as detailed by the book, I needed to download the Affymetrix .CEL files and then load them into R as an AffyBatch object:
cel.files <- list.celfiles('GSE76250/')
raw.data <- ReadAffy(filenames=CEL.files)
The raw.data AffyBatch object representing these data when loaded into R takes over 4 gigabytes of memory. When you then perform normalization on this data using rma(raw.data) (Robust Multi-Array Average), this creates an ExpressionSet that effectively doubles that.
This is where I come a bit unstuck. My laptop is an Asus Zenbook 13-inch UX303A which comes with (what I thought to be) a whopping 11 gigabytes of RAM. This meant that after loading and transforming all the data onto my laptop, I had effectively maxed out my RAM. The obvious answer would be to upgrade my RAM. Unfortunately, due to the small form factor of my laptop, I only have one accessible RAM slot meaning my options are limited.
So, I have concluded that I have three other options to resolve this issue.
- Firstly, I could buy a machine that has significantly more memory capacity at the expense of portability. Ideally, I don’t want to do this because it is the most expensive approach to tackling the problem.
- Another option would be to rent a Virtual Private Server (VPS) with lots of RAM and to install RStudio Webserver on it. I’m quite tempted by the idea of this but I don’t like the idea of my programming environment being exposed to the internet. Having said this, the data I am analysing is not sensitive data and, any code that I write could be safely committed to a private Bitbucket or Github repository.
- Or, I could invest the time in approaching the above problem in a less naive way! This would mean reading the documentation for the various R and Bioconductor packages to uncover a more memory restricted method or, it could mean scoping my data tactically so that, for instance, the AffyBatch project will be garbage collected, thereby freeing up memory once I no longer need it.
In any case, I have learned to be reluctant to follow the final path unless it is absolutely necessary. I don’t particularly want to risk obfuscating my code by adding extra layers of indirection while, at the same time, leaving myself open to making more mistakes by making my code more convoluted.
The moral of the story is not to buy a laptop for its form factor if your plan is to do some real work on it. Go for the clunkier option that has more than one RAM slot.
Either that or I could Download More Ram.