SRM Analysis, revisiting correlations between abundance & environment

Taking a simpler approach to identifying correlations between environmental data and protein abundances. Goals of today was to:

Subselect salinity data, generate summary stats, to be able to include in analysis
Generate correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats
From plots, select env. summary stats to use in 1) multiple linear regression model & 2) structural equation model
Generate correlation plots with selected env. variables to identify potential interactions
Run multiple linear regression model
Interpret results
Revise paper to include results from these tasks

1. Subselected salinity data, generate summary stats, include in analysis

The environmental data manipulation referenced in this post was executed in the Stats 2 R script.

In one of our team meetings we made the determination to ignore salinity data for the time being, since several probes malfunctioned. Up until this point I have performed all tasks/stats on salinity data alongside the other parameters, but ignored salinity when analyzing correlations. Today I generated correlation plots and found salinity may play a role. So, I decided to clean up salinity data as much as I could to include in my analysis. To do so, I reviewed raw salinity data via this plotly plot. Determined that the following salinity data needed to be removed due to probe malfunction: All CIE; FBB > 7/3 @ 09:50; WBB > 6/25 @ 05:30. I then removed all outliers from the updated salinity time-series data, and re-plotted. This is the resulting plot

Raw salinity data:

Bad data & outliers removed:

2. Generated correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats

The rest of this post refers to actions taken in the Stats 4 R script.

Discovered the excellent ggscatter function in the ggpubr library to generate correlation plots. I generated correlation plots for the 3 differentially abundant proteins (HSP90, Puromycin-sensitive aminopeptidase, Trifunctional Enzyme). At first I was going to just use Pep1 (the most abundant peptide from each protein) to generate these plots. Instead, I z-transformed all peptide abundance data so they were all on the same scale, and thus could regress all peptides from each protein against environmental paramters.
It’s a good idea to quickly summarize how I’ve manipulated peptide abundance data in this project:

lambda-transformed abundance by each protein to create normal distributions, then compared across habitats, sites, and regions using ANOVA
z-transformed by each peptide ((X-mean)/SD), then ran correlation analysis against each environmental parameter

3. From plots, selected env. summary stats to use in models

Selected the following environmental parameters per the respective proteins for model constructions, using an approximate alpha=0.01 as cutoff: % Growth, DO.sd, DO.var, pH.sd.1, salinity.mean, salinity.sd.1. Here are the correlation plots for these parameters, using HSP90 for the purposes of this example:

4. Generate correlation plots with selected env. variables to identify potential interactions

Generated a correlation plot with the pairs() function to identify any environmental parameters that are highly correlated; if identified, I can leave one of the two correlated metrics out of my linear model. From the correlation plot, it appears that:

DO.sd & DO.var highly correlated. Select DO.sd since it’s correlation p=value is slightly smaller
pH.sd & pH.var highly correlated. Select pH.var since it’s correlation p=value is smaller and r value higher

env-parameters-corr

5. Run multiple linear regression model

Ran the following model and viewed summary:

Call:
lm(formula = value ~ DO.mean.loc.Z + DO.sd.loc.Z + pH.var.loc.Z + 
    Salinity.mean.loc.Z + Temperature.mean.loc.Z, data = data.pepsum.Env.Stats.HSP90.Z, 
    singular.ok = TRUE)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7794 -0.4787  0.1177  0.5790  1.2928 

Coefficients: (1 not defined because of singularities)
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)              0.2450     0.5825   0.421  0.67510   
DO.mean.loc.Z            0.6813     1.1876   0.574  0.56776   
DO.sd.loc.Z             -1.0086     0.9314  -1.083  0.28206   
pH.var.loc.Z             0.3494     0.2268   1.541  0.12723   
Salinity.mean.loc.Z      1.0207     0.3477   2.935  0.00432 **
Temperature.mean.loc.Z       NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7499 on 82 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.4062,	Adjusted R-squared:  0.3772 
F-statistic: 14.02 on 4 and 82 DF,  p-value: 9.27e-09

Also ran anova() on the model:

> anova(lm.HSP90)
Analysis of Variance Table

Response: value
                    Df Sum Sq Mean Sq F value   Pr(>F)    
DO.mean.loc.Z        1 25.041 25.0415 44.5306 2.74e-09 ***
DO.sd.loc.Z          1  0.575  0.5754  1.0232 0.314744    
pH.var.loc.Z         1  1.077  1.0767  1.9147 0.170197    
Salinity.mean.loc.Z  1  4.846  4.8458  8.6171 0.004319 ** 
Residuals           82 46.112  0.5623                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> 

The coefficient significances are not adequate, and one parameter (T) was left out due to singularity. After messing with models for weeks I have decided to not include them in my paper. I’ve made this decision for several reasons: 1) I don’t believe that I have all influential parameters at hand to construct a representative model, 2) With the exiting environmental data, I believe that correlation analysis is adequate for the purposes of our paper, and perhaps most importantly 3) I don’t know what I’m doing so unable to make informed decisions nor interpret results (I hope to take the regression course next quarter). I also attempted to build a Structural Equation Model, but again, failed.

Written on January 12, 2018