1 Process

1.1 Global

Protein-Level Label-Free Quantification

1.1.1 Packages

Load required packages into R session.

library(devtools)
library(tidyverse)

1.1.2 Workflow

Load Protein Measurements spreadsheet exported from Progenesis.

protm <- "20180502_WOS52_Cr_UPS_protm.csv" %>%
  read_csv(., skip = 2, col_types = cols())

Separate leading protein accession from group members.

Some protein accessions were grouped together by semicolons (;) in the data. These groups represent homologous proteins with shared peptide identifications. The first accession in each group is generally the most confident inference and is used for analysis, other accessions are considered group members.

protm <- protm %>%
  separate(Accession,
           into = c("Accession", "Group members"),
           sep = ";",
           extra = "merge",
           fill = "right")

protm

Define column indeces for the normalized abundance values.

The Protein Measurements export from Progenesis contains the normalized and raw abundances for each raw file in the experiment.

protm %>% names()

##  [1] "Accession"              "Group members"         
##  [3] "Peptide count"          "Unique peptides"       
##  [5] "Confidence score"       "Anova (p)"             
##  [7] "Max fold change"        "Highest mean condition"
##  [9] "Lowest mean condition"  "Description"           
## [11] "20141222_WOS521"        "20141222_WOS526"       
## [13] "20141222_WOS5211"       "20141222_WOS5216"      
## [15] "20141222_WOS522"        "20141222_WOS527"       
## [17] "20141222_WOS5212"       "20141222_WOS5217"      
## [19] "20141222_WOS523"        "20141222_WOS528"       
## [21] "20141222_WOS5213"       "20141222_WOS5218"      
## [23] "20141222_WOS521_1"      "20141222_WOS526_1"     
## [25] "20141222_WOS5211_1"     "20141222_WOS5216_1"    
## [27] "20141222_WOS522_1"      "20141222_WOS527_1"     
## [29] "20141222_WOS5212_1"     "20141222_WOS5217_1"    
## [31] "20141222_WOS523_1"      "20141222_WOS528_1"     
## [33] "20141222_WOS5213_1"     "20141222_WOS5218_1"

samples <- 11:22

samples

##  [1] 11 12 13 14 15 16 17 18 19 20 21 22

Filter to remove proteins from the contaminant database.

Filter to remove proteins if not enough peptide evidence.

Proteins are identified by unique and shared PSMs, so usually the confidence in our protein identification is greatest with increasing unique peptides. It is common in proteomics to remove “one-hit-wonders” by requiring > 1 unique peptide per protein.

Select the identifier and abundance columns.

data <- protm %>%
  filter(Description != "cRAP") %>%
  filter(`Peptide count` >= 2 & `Unique peptides` >= 1) %>%
  select(Accession, samples) %>%
  data.frame()

data

1.2 PTM

Phosphosite-Level Label-Free Quantification

During raw MS file processing, Progenesis subdivided each LC-MS run into peak features, which are MS1 precursor ions with a defined isotopic cluster with characteristic retention time and monoisotopic mass. The same peak feature coordinates are assigned for every LC-MS run and the summed intensity (abundance) from each is recorded.
MS2 spectra from data-dependent acquisition (DDA) contain the associated MS1 precursor ion mass and retention time, allowing them to be mapped to peak features.
The Peptide Measurements export from Progenesis contains the normalized abundances, raw abundances, and spectral counts for each identified peptide that was mapped to a MS1 peak feature in each raw file.

1.2.1 Packages

Load required packages into R session.

library(Biostrings)
library(devtools)
library(tidyverse)

1.2.2 Functions

url <- "https://raw.githubusercontent.com/hickslab/ProgenesisLFQ/master/"
source_url(paste0(url, "R/ProcessLFQ.R"))

## SHA-1 hash of file is 1aa852d0c0de78e0ae9df5b1d207ba0de50ba350

1.2.3 Workflow

Load Peptide Measurements spreadsheet exported from Progenesis.

pepm <- "20190715_EWM_TOR1_Phospho_pepm.csv" %>%
  read_csv(., skip = 2, col_types = cols())

pepm

Load Protein Measurements spreadsheet exported from Progenesis.

protm <- "20190715_EWM_TOR1_Phospho_protm.csv" %>%
  read_csv(., skip = 2, col_types = cols()) %>%
  select(1:4) %>%
  separate_rows(., Accession, sep = ";")

protm

Load protein sequence database.

database <- "Cr_uniprot_crap_20190130.fasta" %>%
  readAAStringSet(.)

names(database) <- str_split(names(database), " ", simplify = TRUE)[, 1]

database

##   A AAStringSet instance of length 18944
##         width seq                                      names               
##     [1]   751 MTISTPEREAKKVKIAVDR...GGIATTWSFFLARIISVG sp|P12154|PSAA_CHLRE
##     [2]   735 MATKLFPKFSQGLAQDPTT...YIFTYAAFLIASTSGRFG sp|P09144|PSAB_CHLRE
##     [3]    81 MAHIVKIYDTCIGCTQCVR...SVRVYLGSESTRSMGLSY sp|Q00914|PSAC_CHLRE
##     [4]    43 MIFDFNYIHIFMLTITSYV...LVFTLGIYLGLLKVVKLI sp|P50369|PETL_CHLRE
##     [5]   160 MSVTKKPDLSDPVLKAKLA...LGIGSTFPIDISLTLGLF sp|P23230|PETD_CHLRE
##     ...   ... ...
## [18940]   238 MSKGEELFTGVVPILVELD...LEFVTAAGITHGMDELYK sp|GFP_AEQVI
## [18941]   204 MAEEVEEERLKYLDFVRAA...LPLLPTEKITKVFGDEAS sp|SRPP_HEVBR
## [18942]   138 MAEDEDNQQGQGEGLKYLG...SSLPGQTKILAKVFYGEN sp|REF_HEVBR
## [18943]   348 MFSSVMVALVSLAVAVSAN...VMNADNHEYFSENNPAQS sp|PLMP_GRIFR
## [18944]   271 MSHIQRETSCSRPRLNSNL...DNPDMNKLQFHLMLDEFF sp|KKA1_ECOLX

Define column indeces for the normalized abundance values.

pepm %>% names()

##  [1] "#"                      "Retention time (min)"  
##  [3] "Charge"                 "m/z"                   
##  [5] "Measured mass"          "Mass error (u)"        
##  [7] "Mass error (ppm)"       "Score"                 
##  [9] "Sequence"               "Modifications"         
## [11] "Accession"              "Description"           
## [13] "Use in quantitation"    "Max fold change"       
## [15] "Highest mean condition" "Lowest mean condition" 
## [17] "Anova"                  "Maximum CV"            
## [19] "20170312_ctrl_1"        "20170312_ctrl_2_1"     
## [21] "20170312_ctrl_3"        "20170312_ctrl_4"       
## [23] "20170312_ctrl_5_1"      "20170312_azd_1"        
## [25] "20170312_azd_2"         "20170312_azd_3"        
## [27] "20170312_azd_4"         "20170312_azd_5"        
## [29] "20170312_torin_1"       "20170312_torin_2"      
## [31] "20170312_torin_3"       "20170312_torin_4"      
## [33] "20170312_torin_5"       "20170312_rap_1"        
## [35] "20170312_rap_2"         "20170312_rap_3"        
## [37] "20170312_rap_4"         "20170312_rap_5_1"      
## [39] "20170312_ctrl_1_1"      "20170312_ctrl_2_1_1"   
## [41] "20170312_ctrl_3_1"      "20170312_ctrl_4_1"     
## [43] "20170312_ctrl_5_1_1"    "20170312_azd_1_1"      
## [45] "20170312_azd_2_1"       "20170312_azd_3_1"      
## [47] "20170312_azd_4_1"       "20170312_azd_5_1"      
## [49] "20170312_torin_1_1"     "20170312_torin_2_1"    
## [51] "20170312_torin_3_1"     "20170312_torin_4_1"    
## [53] "20170312_torin_5_1"     "20170312_rap_1_1"      
## [55] "20170312_rap_2_1"       "20170312_rap_3_1"      
## [57] "20170312_rap_4_1"       "20170312_rap_5_1_1"    
## [59] "20170312_ctrl_1_2"      "20170312_ctrl_2_1_2"   
## [61] "20170312_ctrl_3_2"      "20170312_ctrl_4_2"     
## [63] "20170312_ctrl_5_1_2"    "20170312_azd_1_2"      
## [65] "20170312_azd_2_2"       "20170312_azd_3_2"      
## [67] "20170312_azd_4_2"       "20170312_azd_5_2"      
## [69] "20170312_torin_1_2"     "20170312_torin_2_2"    
## [71] "20170312_torin_3_2"     "20170312_torin_4_2"    
## [73] "20170312_torin_5_2"     "20170312_rap_1_2"      
## [75] "20170312_rap_2_2"       "20170312_rap_3_2"      
## [77] "20170312_rap_4_2"       "20170312_rap_5_1_2"

samples <- 19:34

samples

##  [1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

Filter to keep peptides with Percolator-adjusted Mascot scores > 13.

Filter to remove peptides from proteins in the contaminant database.

Join the protein identification statistics onto the Peptide Measurement data.

Each peptide was assigned to a single protein accession, which will have identification statistics in the Protein Measurements export file.

Summarize duplicate peak features.

Some features were matched with peptides having identical sequence, modifications, and score, but alternate protein accessions. These groups were reduced to satisfy the principle of parsimony and represented by the protein accession with the highest number of unique peptides, else the protein with the largest confidence score assigned by Progenesis.
Some features were duplicated with differing peptide identifications and were reduced to a single peptide with the highest Mascot ion score.

Filter to keep only modified peptides.

The pattern is mutable depending on what variable modifications were included for the database search.
The default case is “Phospho” to specify phosphorylation from a Mascot database search. Able to change to “Phospho (ST)” or “Phospho (Y)” to specify modified residues.

Build an identifier term.

The protein accession is concatenated with the particular residue and position in the protein sequence of any modifications identified on the peptide.

Summarize duplicate identifiers.

The dataset was then reduced to unique identifiers by summing the abundance of all contributing features (i.e., peptide charge states, missed cleavages, and combinations of additional variable modifications).
Each identifier group was represented by the peptide with the highest Mascot score in the final dataset.

Simplify the data format.

Select the identifier and abundance columns to simplify in downstream processing.

data <- pepm %>%
  filter(Score > 13) %>%
  filter(Description != "cRAP") %>%
  left_join(., protm, by = "Accession") %>%
  reduce_features() %>%
  filter(str_detect(Modifications, "Phospho")) %>%
  get_identifier(., database, mod = "Phospho") %>%
  reduce_identifiers(., samples) %>%
  select(Identifier, samples) %>%
  data.frame()

data

2 Analyze

2.1 Packages

Load required packages into R session.

library(imp4p)
library(broom)

2.2 Functions

url <- "https://raw.githubusercontent.com/hickslab/ProgenesisLFQ/master/"
source_url(paste0(url, "R/StatLFQ.R"))

## SHA-1 hash of file is ec4bb0ef89330f09a354f5def7216143ec1de3e0

2.3 Workflow

Define input data.

data <- "20180502_WOS52_Cr_UPS_protm_Process.csv" %>%
  read_csv(., col_types = cols()) %>%
  data.frame()

Define the column indeces for replicates in each condition.

Assuming input data has simplified format of an identifier column followed by abundance columns for each raw file in the experiment.

a <- 2:5; b <- 6:9; c <- 10:13 # ???

group <- list("25" = a, "50" = b, "100" = c) # ???

group.compare <- list("25-50" = list(a, b), # ???
                      "25-100" = list(a, c),
                      "50-100" = list(b, c))

a

## [1] 2 3 4 5

## [1] 6 7 8 9

## [1] 10 11 12 13

group

## $`25`
## [1] 2 3 4 5
## 
## $`50`
## [1] 6 7 8 9
## 
## $`100`
## [1] 10 11 12 13

group.compare

## $`25-50`
## $`25-50`[[1]]
## [1] 2 3 4 5
## 
## $`25-50`[[2]]
## [1] 6 7 8 9
## 
## 
## $`25-100`
## $`25-100`[[1]]
## [1] 2 3 4 5
## 
## $`25-100`[[2]]
## [1] 10 11 12 13
## 
## 
## $`50-100`
## $`50-100`[[1]]
## [1] 6 7 8 9
## 
## $`50-100`[[2]]
## [1] 10 11 12 13

Rename the abundance columns in a simplified “Condition-Replicate” format.

data <- data %>%
  rename_columns(., group)

data %>% names()

##  [1] "Accession" "25-1"      "25-2"      "25-3"      "25-4"     
##  [6] "50-1"      "50-2"      "50-3"      "50-4"      "100-1"    
## [11] "100-2"     "100-3"     "100-4"

Filter to keep identifiers with >50% of replicates having nonzero abundances in any condition.

Perform a log₂-transformation of abundances as a variance-stabilization.

The base R function log2() returns “-Inf” for zero values. These instances are changed to “NA” following the transformation.

Impute missing values (“NA”) using a conditional strategy with the imp4p package.

Iterate by condition and check if each idenfifier has reliable quantitation.
If at least one replicate has a nonzero abundance, impute other replicates with values drawn from a normal distribution centered on the mean of nonzero replicates.
If all replicates have nonzero abundance, impute with small values drawn from a normal distribution centered on the lower 25^th percentile of abundances.

data2 <- data %>%
  clean_min(., group, nonzero = 3) %>%
  transform_data(., group, method = "log2") %>%
  impute_imp4p(., group)

2.3.1 Pairwise t-test

Perform a t-test on each identifier for defined condition pairs.

By default, a two-sided, equal-variance t-test is run with Benjamini-Hochberg FDR correction.

data3 <- data2 %>%
  calculate_ttest(., group.compare, fdr = TRUE)

2.3.2 One-way ANOVA

Perform a one-way analysis of variance (ANOVA) on each identifier across conditions.

By default, a one-way ANOVA is run for all unique conditions with Benjamini-Hochberg FDR correction.

data4 <- data2 %>%
  calculate_1anova()

2.3.3 Fold change

Calculate fold change using the mean replicate abundance for each condition.

Subtracting log₂-transformed values is equivalent to dividing the non-transformed values and then transforming: \(log~2~(B) - log~2~(A) = log~2~(B/A)\)

data3 <- data3 %>%
  calculate_fc(., group.compare, difference = TRUE) %>%
  add_fc_max()


data4 <- data4 %>%
  calculate_fc(., group.compare, difference = TRUE) %>%
  add_fc_max()

2.3.4 Clustering

Unsupervised heirarchical clustering.

data4 <- data4 %>%
  filter(FDR < 0.05) %>%
  
  calculate_hclust(., group, k = 2) %>%
  left_join(data4, ., by = names(data4)[1])

3 Annotate

3.1 Packages

Load required packages into R session.

3.2 Functions

url <- "https://raw.githubusercontent.com/hickslab/ProgenesisLFQ/master/"
source_url(paste0(url, "R/AnnotateLFQ.R"))

## SHA-1 hash of file is 5cec22e5e96332eccf5edb1bba1e74a269bad5cc

3.3 Annotation

Given a list of protein accessions, access UniProt and pull known information.

data5 <- data4 %>%
  add_missingness(., data, group)
  #keep_entry_uniprot() %>%
  #split_identifier() %>%
  #add_uniprot(., paste0(url, "data/Cr_uniprot_20190130_annotation.tsv"))

4 Plot

4.1 Packages

Load required packages into R session.

4.2 Functions

url <- "https://raw.githubusercontent.com/hickslab/ProgenesisLFQ/master/"
source_url(paste0(url, "R/PlotLFQ.R"))

## SHA-1 hash of file is 09985f2a702181cc3013b1f189c9f9ed8ab6ea5f

4.3 Theme

Define custom plotting elements to create a theme.

theme_custom <- function(base_size = 32){
  theme_bw(base_size = base_size) %+replace%
    theme(
      strip.background = element_blank(),
      axis.ticks =  element_line(colour = "black"),
      panel.background = element_blank(),
      panel.border = element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(), 
      plot.background = element_blank(),
      plot.margin = unit(c(0.5,  0.5, 0.5, 0.5), "lines"),
      axis.line.x = element_line(color="black", size = 1),
      axis.line.y = element_line(color="black", size = 1)
    )
}

4.4 Example

Custom plot function.

plot_jitter <- function(df, group){
  temp.df <- df %>%
    select(1, group %>% flatten_int()) %>%
    gather(sample, abundance, -1)
  
  temp.df %>%  
    ggplot(., aes(x = sample, y = abundance, color = sample)) +
    geom_jitter(alpha = 0.5) +
    geom_boxplot(color = "black",
                 fill = NA,
                 outlier.shape = NA,
                 size = 1.5) +
    guides(color = FALSE, fill = FALSE) +
    coord_flip()

}

data2 %>%
  plot_jitter(., group) +
  theme_custom()

4.5 PCA

Principal component analysis (PCA).

data2 %>%
  plot_pca(., group) +
  theme_custom()

4.6 Volcano Plot

Biological versus statistical significance.

data3 %>%
    plot_volcano(.,
                 group,
                 group.compare,
                 fdr = TRUE,
                 threshold = 2,
                 xlimit = 8,
                 ylimit = 5) +
  theme_custom()

4.7 Trend Profiles

Visualization of results from heirarchical clustering.

data4 %>%
  filter(FDR < 0.05) %>%
  #filter(abs(`0-60_FC`) >= 1) %>%
  #filter(abs(FC_max) >= 1) %>%
  
  plot_hclust(., group, k = 2) +
  theme_custom()

4.8 GO Summary

Assuming data contains mapped UniProt annotations.

data5 %>%
  filter(FDR < 0.05) %>%
  #filter(abs(`0-60_FC`) >= 1) %>%
  #filter(cluster == "A") %>%
  plot_GO(top = 5) + theme_custom()


data5 %>%
  filter(FDR < 0.05) %>%
  filter(`0-60_FC` >= 1) %>%
  #filter(cluster == "A") %>%
  plot_GO_hclust(., group, column = "Gene ontology (biological process)", threshold = 3) +
  #plot_GO_FC(., group, column = "Gene ontology (biological process)", threshold = 1) +
  theme_custom()

5 Output

5.1 Dataframes

The write_csv() function from the readr package in tidyverse is one option to save a dataframe copy in the working directory.

#data5 %>% write_csv(., "data5.csv")

5.2 Plots

Consistent plot size/resolution.

#pdf("figure.pdf", width = 10, height = 10)

#png("figure.png", width = 12, height = 9, units = "in", res = 600)
# RUN PLOT
#dev.off()

# TODO plot_save()

6 Session

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] broom_0.5.2         imp4p_0.7           norm_1.0-9.5       
##  [4] truncnorm_1.0-8     Iso_0.0-18          Biostrings_2.50.2  
##  [7] XVector_0.22.0      IRanges_2.16.0      S4Vectors_0.20.1   
## [10] BiocGenerics_0.28.0 forcats_0.4.0       stringr_1.4.0      
## [13] dplyr_0.8.3         purrr_0.3.2         readr_1.3.1        
## [16] tidyr_0.8.3         tibble_2.1.3        ggplot2_3.2.0      
## [19] tidyverse_1.2.1     devtools_2.1.0      usethis_1.5.1      
## [22] knitr_1.23         
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.0        pkgload_1.0.2     jsonlite_1.6     
##  [4] modelr_0.1.4      assertthat_0.2.1  cellranger_1.1.0 
##  [7] yaml_2.2.0        remotes_2.1.0     sessioninfo_1.1.1
## [10] pillar_1.4.2      backports_1.1.4   lattice_0.20-35  
## [13] glue_1.3.1        digest_0.6.20     rvest_0.3.4      
## [16] colorspace_1.4-1  plyr_1.8.4        htmltools_0.3.6  
## [19] pkgconfig_2.0.2   haven_2.1.1       zlibbioc_1.28.0  
## [22] scales_1.0.0      processx_3.4.0    generics_0.0.2   
## [25] withr_2.1.2       lazyeval_0.2.2    cli_1.1.0        
## [28] magrittr_1.5      crayon_1.3.4      readxl_1.3.1     
## [31] memoise_1.1.0     evaluate_0.14     ps_1.3.0         
## [34] fs_1.3.1          nlme_3.1-137      xml2_1.2.0       
## [37] pkgbuild_1.0.3    tools_3.5.1       prettyunits_1.0.2
## [40] hms_0.5.0         munsell_0.5.0     callr_3.3.0      
## [43] compiler_3.5.1    rlang_0.4.0       grid_3.5.1       
## [46] rstudioapi_0.10   labeling_0.3      rmarkdown_1.13   
## [49] testthat_2.1.1    gtable_0.3.0      curl_3.3         
## [52] R6_2.4.0          lubridate_1.7.4   zeallot_0.1.0    
## [55] rprojroot_1.3-2   desc_1.2.0        stringi_1.4.3    
## [58] Rcpp_1.0.1        vctrs_0.2.0       tidyselect_0.2.5 
## [61] xfun_0.8