Skip to content

Comparison of methods in a proteomics benchmark: method comparison, pipeline behavior, and noise analysis using ESM-2

~4.1k tokens

I wanted to test whether my proteomics pipeline can do two things at the same time: - not invent differences where none exist - detect differences where they were deliberately introduced

This dataset is ideal for that purpose, because it is a benchmark with designed true positives (spike-ins) defined by the experiment creators. Below, I show step by step how I verified that the results are trustworthy, and where the pipeline has natural limitations.

Data description

I analyzed data from the “Dynamic Range Benchmark” experiment (PXD000279), originally described in Cox et al. (2014). This dataset was designed to validate proteomics pipelines and consists of:

  1. Biological background (E. coli): a bacterial lysate present at a constant concentration in all samples. These proteins should not show differential abundance, making them an ideal negative control for estimating false discoveries (FDR).
  2. Standard proteins (UPS): a set of 48 human proteins (Universal Proteomics Standard) spiked into the background in strictly defined amounts.
    • UPS1: all 48 proteins at the same fixed concentration (5,000 fmol).
    • UPS2: the same proteins split into concentration groups spanning six orders of magnitude (from 50,000 fmol down to 0.5 fmol).

Comparing UPS2 vs UPS1 allows verification that the analysis correctly detects known concentration changes (positive control) while maintaining no change in the E. coli background.

Software (R packages)

The analysis was performed in R using: tidyverse, limma, vsn, imputeLCMD, pheatmap, ggrepel, RColorBrewer, sva, pROC, knitr, kableExtra, patchwork.


Data acquisition

Experiment files were downloaded from the PRIDE repository via FTP.

Preprocessing

  1. From the proteinGroups table, entries flagged as Reverse, Potential contaminant, and Only identified by site were removed.
  2. LFQ intensity values were extracted and assigned to groups: UPS1 (replicates L1–L4) and UPS2 (replicates H1–H4).
  3. A protein intensity matrix was built. Zeros were converted to NA.
  4. Protein identifiers and FASTA header metadata were added. Based on these headers, each protein was labeled as E. coli background or UPS/Human spike-in (key fields typically include OS= and species identifiers). Example classification rule:

    protein_species <- case_when(
      grepl("ECOLI|Escherichia coli", fasta_headers, ignore.case = TRUE) ~ "ECOLI",
      grepl("HUMAN|Homo sapiens|UPS", fasta_headers, ignore.case = TRUE) ~ "UPS",
      TRUE ~ "Unknown"
    )
    

  5. Data were transformed to log2 scale.

Missing values and filtering

In proteomics, missing values often arise from detection limits. Before statistical testing, I applied completeness-based filtering. I retained only proteins with at least 3 out of 4 replicates present in each group (UPS1 and UPS2). After filtering, the number of proteins decreased from 2246 to 2074, and the fraction of NA values dropped from 5.3% to 1.0% (Fig. 1). This criterion increases the reliability of log2FC estimates and enables stable variance estimation for comparative tests.

Figure 1. Missingness map before and after filtering. Rows are proteins, columns are samples from UPS1 and UPS2 (with a Group annotation bar). Cell color encodes missingness: light = present (0), dark = missing (1). Filtering required ≥3/4 values in each group (UPS1 and UPS2), reducing missingness from 5.3% to 1.0% and the protein count from 2246 to 2074.

Normalization

Two normalization variants were compared: - Variant 1: median normalization, no aggressive imputation. - Variant 2: VSN normalization. Because VSN requires a complete matrix, missing values were filled with the row median only for the purpose of estimating the VSN transformation, without using QRILC as a default step.

Additionally, quantile normalization was prepared as a fallback option. Results were saved for downstream analyses.


Diagnostics of batch effects and hidden variation

Before differential analysis, I assessed whether between-sample variation aligned with the biological group (UPS1 vs UPS2) and whether any technical factor dominated (batch, duplicates).

Test What it checks Output
2.1 Name parsing extracts replicate information from sample names structure table
2.2 Clustering do samples cluster by biological group dendrogram + assessment
2.3 Correlations within-group vs between-group correlations correlation difference
2.4 Technical duplicates unusually high correlation pairs (e.g., > 0.98) list of pairs
2.5 PCA diagnostics do leading PCs correlate with the biological group plot + correlation
2.6 SVA detects hidden sources of variation (surrogate variables) number of SVs

Hierarchical clustering of samples using distance 1 − Pearson r. UPS1 and UPS2 form two main clusters, supporting replicate consistency and indicating no dominant batch effect.

Sample PCA. PC1 explains 40.2% of variance and shows a moderate correlation with group (r = 0.45); PC2 explains 17.5%. Partial separation is expected in spike-in experiments because most _E. coli background proteins do not change between groups, and the group signal comes primarily from UPS proteins._

Differential analysis using limma

Differential testing for UPS2 vs UPS1 was performed with limma. While in classic microarray use cases limma is often paired with normalizeBetweenArrays, in proteomics normalization is typically performed upstream, and limma receives a log2 intensity matrix.

The method consists of: (1) fitting linear models for each protein (lmFit), (2) variance moderation using empirical Bayes (eBayes), and (3) multiple-testing correction (Benjamini–Hochberg FDR). Compared to a standard t-test, limma stabilizes variance estimates under small numbers of replicates via variance “shrinkage” toward a global distribution learned across all proteins.

Design and contrast. A simple group-only model was used:

design <- model.matrix(~ 0 + group)

with contrast:

contrast_matrix <- makeContrasts(   UPS2_vs_UPS1 = UPS2 - UPS1,   levels = design )

Based on the prior diagnostics (group-consistent clustering, no suspicious duplicates, no surrogate variables detected by SVA), there was no need to add batch terms or use duplicateCorrelation().

Significance threshold. Proteins were called differential if they met both criteria: adj.P.Val < 0.05 and |log2FC| > 1:

results_table$Significant <- results_table$adj.P.Val < 0.05 &                              abs(results_table$logFC) > 1

Comparison of normalization variants

The limma analysis was run for both preprocessing variants: (1) median normalization (no imputation) and (2) VSN normalization. Both approaches produced similar qualitative conclusions.

Volcano plots (UPS2 vs UPS1) for median and VSN normalization. Thresholds: adj.p < 0.05 and |log2FC| > 1. Both normalizations show a similar result pattern and comparable numbers of significant proteins.

PCA after median and VSN normalization. The variance explained by PC1 is similar in both variants, and the sample layout remains comparable, suggesting normalization does not distort the global structure of the data.

Heatmap of the 50 most differential proteins (z-score) for median and VSN normalization. Both variants show consistent UPS1 vs UPS2 patterns among top-ranked proteins, supporting robustness to the normalization choice.

Benchmark: assessing pipeline quality

Because this dataset is a spike-in experiment (UPS proteins on an E. coli background), pipeline quality can be measured along two independent axes:

  1. False positives on the stable E. coli background
  2. Agreement between observed and expected UPS log2FC values

Expected fold changes for UPS

In UPS1, all UPS proteins are at 5,000 fmol. In UPS2, proteins fall into concentration groups. Expected changes were defined as:

log2FC = log2(UPS2_fmol / UPS1_fmol).

UPS group UPS2 concentration (fmol) UPS1 concentration (fmol) Ratio (UPS2/UPS1) Expected log2FC
A 50,000 5,000 10:1 3.32
B 5,000 5,000 1:1 0.00
C 500 5,000 1:10 -3.32
D 50 5,000 1:100 -6.64
E 5 5,000 1:1,000 -9.97
F 0.5 5,000 1:10,000 -13.29

Reference table of expected log2FC values for UPS groups (A–F).

Empirical FP rate on E. coli (error control)

In this benchmark, the E. coli background is nominally 1:1 between UPS1 and UPS2, so no E. coli proteins should be called differential. Any significant E. coli protein is treated as a false positive. The reported measure is the false-positive rate among all E. coli proteins:

FP_rate = (# E. coli with adj.P.Val < 0.05 and |log2FC| > 1) / (number of E. coli proteins).

Because the expected effect for E. coli is zero, I treat this FP rate as an empirical estimate of FDR in this benchmark.

Both normalization variants produced 0 false positives on E. coli (0% FP rate), supporting correct Benjamini–Hochberg control and absence of inflated test statistics on the background.

Accuracy of log2FC estimation for UPS

For UPS proteins, expected and observed values were compared using Pearson correlation and error metrics (RMSE/MAE). Group-wise analysis also highlights the common phenomenon of ratio compression (systematic underestimation of large fold changes relative to ground truth).

Expected vs observed log2FC for UPS proteins. Median normalization: r = 0.986. VSN: r = 0.98. The dashed line shows perfect agreement (y = x). Both variants show very high trend agreement.

Separation of UPS vs E. coli (ROC/AUC)

Treating UPS as the positive class and E. coli as the negative class, I evaluated how well the pipeline separates truly changing proteins from stable background proteins. Both variants achieved AUC = 1.0, indicating perfect separation under the chosen thresholds.

ROC curves for the normalization variants. Both achieve AUC = 1.0, i.e., ideal separation of UPS vs _E. coli for this analysis configuration._

ESM-2

As a final component, I used a small but representative model from the ESM-2 family by Meta, a language model for protein sequences.

It does not analyze quantitative proteomics data directly. Instead, it models the “grammar” of amino acid sequences and produces embeddings that can be used as inputs for downstream tasks. This model is still actively downloaded and used, and I wanted to sanity-check its behavior in this context.

The checkpoint esm2_t6_8M_UR50D is a BERT-style bidirectional Transformer encoder, trained with masked language modeling.

Key configuration: - 6 Transformer layers (num_hidden_layers = 6) - ~7.84M parameters (hence “8M”) - embedding dimension hidden_size = 320 - 20 attention heads (num_attention_heads = 20), i.e., 320/20 = 16 dimensions per head - MLP inner dimension intermediate_size = 1280 - rotary positional embeddings (position_embedding_type = rotary) - maximum sequence length max_position_embeddings = 1026 - vocabulary size vocab_size = 33 (20 amino acids plus special and rare tokens)

I used the smallest model because I cannot run larger checkpoints conveniently on my local machine. This size is suitable for CPU-based sanity checks and rapid prototyping.

I checked whether the model would reveal any structure related to false positives. However, consistent with the statistical control analysis, it did not reveal anything notable.

The embedding analysis did not indicate any bias toward a particular biochemical class. Points are well mixed, which is expected. ESM-2 cannot “discover” false positives in this pipeline, because false positives are driven by measurement noise, missingness, variance structure, and the statistical model. A sequence-only model (with E. coli amino acid sequences retrieved via the UniProt API) does not have access to that axis of information.

Shot noise as a physical source of variability

The main phenomenon I can highlight is shot noise in mass spectrometry, which arises from the discrete nature of ion counting. The number of ion counts collected in a fixed time follows a Poisson distribution. In a Poisson process, measurement uncertainty (standard deviation) equals the square root of the count number: (\(\sigma = \sqrt{N}\)). Therefore, the relative error (coefficient of variation) is: \(1/\sqrt{N}\). The smaller the number of counted ions (lower abundance, lower AveExpr), the larger the contribution of random fluctuation to the final measurement. The observed negative correlation:

          AveExpr     Noise
AveExpr  1.000000 -0.272612
Noise   -0.272612  1.000000

indicates that for low-abundance proteins, instrument precision decreases in a predictable statistical way. This generates variability in observed logFC values as a physical detection limitation, not a biological signal or an analysis artifact.

Conclusions

1) Quality assessment of both pipeline variants

Metric Median (no imputation) VSN
Empirical FP rate on E. coli 0% (no false positives) 0% (no false positives)
Agreement expected vs observed log2FC (UPS) high (r ≈ 0.99) high (r ≈ 0.98)
RMSE/MAE for UPS comparable comparable
### 2) Most reliable default variant

Median normalization (no imputation) is preferred when: - missingness is low after filtering - the goal is the simplest, stable, and easily auditable procedure - there is no strong evidence of heteroscedasticity requiring VSN-type variance stabilization

VSN is reasonable when: - variance stabilization at low intensities is important - the data show a clear mean–variance dependency - an additional technical transformation step is acceptable in exchange for potentially better behavior in the low-abundance range

3) Key observations

  • Both normalizations preserve sample consistency and do not indicate strong batch effects, so the simple ~ 0 + group limma model is sufficient.
  • The E. coli background remains non-differential in both variants (0% FP rate), supporting correct multiple-testing control under Benjamini–Hochberg.
  • Expected vs observed UPS behavior shows ratio compression, especially at large theoretical fold changes. This is typical in proteomics and is not caused solely by normalization choice.
  • ROC/AUC confirms strong separation of UPS vs E. coli under the chosen decision thresholds for both variants.

4) Final recommendation

For this dataset (spike-in benchmark, low missingness after filtering), both variants are valid, and practical differences are small. As the default, I recommend median normalization without imputation, because it is minimally invasive and easiest to interpret and reproduce. The alternative (VSN) is sensible if variance stabilization at low intensities is the priority and heteroscedasticity is visible in the data.

The codebase is available at GitHub.