Batch Effects: The Road (Often) Not Taken

Saket Choudhary

What are Batch-Effects?

Technical sources of variations that often confound the effects arising from biological differences. img

Arise from but not limited to:

  • Different processing time
  • Different handlers
  • Amount of reagent
  • Different instrument or different lanes

Why care at all about Batch-Effects?

  • Inherent goal of all high-throughput sequencing: Separate signal from noise to understand underlying biology
  • Further complicated by latent variables or unwanted hetereogenity
  • Most widely recognised latent variable: Batch-effects
  • Severly compromising effect on biological and/or statistical validity

Batch-effects are widespread in literature



modENCODE: Expressions in tissues are species specific (?!)


modENCODE: Experimental Design


Correcting for batch-effects restablishes well known fact

Expression is tissue-specific (mostly) and not species-specific


In-house example: Controls and Knockdown are similar (?!)


In-house example: Correcting batch-effects


Looking for batch effects

PCA/MDS aren't often sufficient


Looking for batch effects

RLE plots: Relative Log Expression Median centered \( \log \) values


Methods for batch-effects correction

  • Remove Unwanted Variation : RUV - Gagnon-Bartsch et. al., Biostatistics (2007); Risso et. al. Nature Biotech (2014)
  • Surrogate Variable Analysis : SVA - Leek et. al., Plos Genetics (2007); Leek et. al., Bioinformatics (2012)
  • Correct for measured Batch Effects : ComBat - Johnson et. al., Biostatistics (2007)

RUV -- With replicate/negative controls

Assume there are genes that can act as negative controls: difference between exp values arises due to unmodelled factors

For \( J \) genes and \( n \) samples and \( k \) unmodelled factors, \( p \) known covariates (independent of unmodelled factors): \[ \begin{align*} \log E[Y|W,X,O] &= \underbrace{W_{n \times k}}_{\text{Hidden factors design}} \times \alpha_{k \times J} + \overbrace{X_{n \times p}}^{\text{Known covariates design matrix}} \times \beta \\ & + \underbrace{O}_{\text{offset}} \end{align*} \]

RUV -- With replicate/negative controls

General Idea, given a pool of \( J_c \) negative control genes:

  • \( Z_c = \log Y_c - O \)
  • \( Z_c^* = Z - median(Z_{.j}) \)
  • \( Z_c^* = U_{n \times n} \Lambda_{n \times J} V^T_{J \times J} \)
  • Assume \( k \) given(how?), estimate \( \hat{W\alpha} = U\Lambda_kV^T \) retaining only the highest \( k \) singular values in \( \Lambda_k \)
  • Substitute \( \hat{W} \)(=\( \hat{W} = U\Lambda \)) to estimate \( \alpha \), \( \beta \)

Can be modified to account for replicates.

SVA -- For any unmodelled factors, not just Batch

Genes 201-500: affected by an independent factor (unmodelled factor, say age), possibly correlated with class img

SVA -- For any unmodelled factors, not just Batch

For \( g^{th} \) gene and \( j^{th} \) sample and \( L \) 'unmodelled' factors:

\[ \begin{align*} \overbrace{Y_{gj}}^\text{Expression} &= \underbrace{\mu_g}_\text{basal expression} + \overbrace{f_g(c_j)}^\text{Dependence on primary variable(say condition)} + \\ &+ \sum_{l=1}^L \underbrace{\gamma_{lg}}_{\text{Gene specific coeff.}} \times \overbrace{p_{lj}}^{l^{th}\text{unmodelled factor(say batch)}} + \underbrace{\epsilon_{gj}}_\text{Noise} \end{align*} \]

SVA -- For any unmodelled factors, not just Batch

General Idea:

  • Remove effect due to primary level by obtaining a residual matrix
  • For the residual matrix find an orthogonal basis, identifying singular vectors representing more variation than by chance
  • Identify subset of genes that account for the significant vectors
  • Create a 'surrogate' variable for the gene subsets based on overall expression matrix

ComBat -- Regress Batch Effects

For \( i^{th} \) batch and  \( j^{th} \) sample

\[ \begin{align*} \overbrace{Y_{ijg}}^\text{Normalise expression in gene $g$} &= \underbrace{\alpha_g}_{\text{Overall gene exp.}} + \overbrace{X}^{\text{Design Matrix}}\underbrace{\beta_g}_{\text{Regression coeff.}}\\ &+ \underbrace{\gamma_{ig}}_{\text{Additive effect}} + \overbrace{\delta_{ig}}^{\text{Multiplicative effect}}\epsilon_{ijg}\\ \gamma_{ig} &= N(Y_i, \tau_i^2) \\ \delta^2_{ig} &= \text{Inverse Gamma} (\lambda_i, \theta_i)\\ \underbrace{Y_{ijg}^*}^{\text{Batch adjusted values} } &= \frac{ Y_{ijg}-\hat{\alpha}_g - X\hat{\beta}_g - \hat{\gamma}_{ig} }{ \hat{\delta}_{ig} } + \hat{\alpha}_g + X\hat{\beta}_g \end{align*}\\ \]

SVA -- Example

Assume you have 8 treated/control samples coming from 4 cell lines. Unmodelled factor - cell line. Can SVA catch it?


  • SV1 can help differentiate N080611 from 3 others
  • SV2 can help differentiate N080611 and N61311
  • SV3 should … ?

What to use and when?

  • If Batches are not known, RLE plot and heatmaps are a good proxy to make an informed guess
  • Batches are known, no other 'unmodelled factor': ComBat
  • Batch factors are not known, 'unmodelled factors' with intractable relationships: SVA, RUV
  • SVA and RUV do not regress the batch-effects, use the learned values as covariates
  • ComBat regresses the batch-effects

