For whatever reason, following on from my despair with normalizing gene expression data from earlier in the week, my most recent challenge has been to take a Bioconductor ExpressionSet of gene expression data measured using an Affymetrix GeneChip® Human Transcriptome Array 2.0 but instead of labeling each row with its probe ID having it mapped to its corresponding gene symbol.
I have seen a lot of code samples that suggest using variations on a theme of using the biomaRt package or querying a SQL database of annotation data directly: in the former I gave up trying; in the latter, I ran away to hide, having only interacted with a SQL database through Java’s JPA abstraction layer recently.
It turns out to be very easy to do this using the affycoretools package by James W MacDonald which contains ‘various wrapper functions that have been written to streamline the more common analyses that a core Biostatistician might see.’
As you can see below, you can very easily extract a vector of gene symbols for each of your probe IDs and assign it as the rownames to your gene expression data.frame.
I hope this will save you the trouble of finding this gem of a package.
#retrieve ExpressionSet using GEOquery
gse76250 <- getGEO('GSE76250')[]
# populate the fData slot in the ExpressionSet
# with gene symbols
annot.gse76250 <- annotateEset(gse76250, pd.hta.2.0)
# extract the expression data.frame
gse76250.expr <- exprs(annot.gse76250)
# change rownames of expression data.frame
# to gene symbols instead of probe ids
gene.symbols <- fData(annot.gse76250)$SYMBOL
rownames(gse76250.expr) <- gene.symbols