Finalized list of SRM data analysis stats
The Thanksgiving weekend provided time to ponder/reflect on the SRM stats that I’ve run thus far, what else needs to be done, and how to finish up. Received guidance from Dave-o. Note: next time, include a housekeeping protein in list of targets.
This is my “finalized” list of SRM & environmental stats to run. In the last few days I’ve completed much of this. In bold are those remaining. Tomorrow I’ll hopefully be done, and will post my scripts and results.
SRM Protein & Environmental Data Analysis Steps
Each Protein: (assume proteins are independent)
- Test for normality
- Lambda transformation
- Test for normality post transformation
- Assess outliers, remove if necessary
- N-way ANOVA by: a) location b) habitat c) site d) region
- Determine P-adjusted, correct for multiple comparisons (bonferroni method, P/13)
- Post-hoc test to ID differences (is this really necessary?)
- Ultimate goal: which proteins are different between locations?
- Compare total abundance between sites (sum peptide abundance)
Each environmental variable:
- Download tidal chart data for each site
- Edit pH, DO & Salinity data:
a. Remove data from exposed time points, as determined from tidal charts
b. Identify and remove outliers from pH, DO & Salinity data
c. Recombined outlier-scrubed data with Temp, Tide data. - Assess Normality of each env. variable (all time points)
- Found to be non-parametric (pH is kinda, but let’s assume not). Dataset is large (>6000 for each parameter), so did not determine lambda via
tukeytransform
function. Instead, used Krusgal-Wallis non-parametric analysis in lieu of ANOVA
- Found to be non-parametric (pH is kinda, but let’s assume not). Dataset is large (>6000 for each parameter), so did not determine lambda via
- KW test for each env. variable by location, by region
- Dunn Test post-hoc test to ID differences
- Use bonferroni correction for P-adjusted in tests
- Ultimate goal: which env. variables are different between locations?
a. basically all of them.
Prep for regression model:
- Calculate summary statistics: mean, variance, sd, min, max, median, %>1 sd from mean, %>2 sd from mean
- Plot() all env. variables- are any linearly related, aka not independent? If so, need to include interaction parameter in regression model.
- Plot() protein peptides against each other to confirm linear correlation; equation should be ~1:1.
- If all correlated select 1 peptide to use in regression model; highest abundance is best.
Run regression models for each representative peptide:
- Step-wise linear regression models with all env. variables; I would expect that only the variables that were found to be different via the ANOVA would significantly contribute to the model
- General linear model with variables ID’d in step-wise lm
- Figure out when to add a constant, and if I should do that in this scenario
- Run anova on best fit model, find P-value of the env. variables to determine confidence in the influence of each env. variable on proteins.
- Run model on the other peptides in the protein (not used as representative peptides); ID the R^2 and P values
Written on November 28, 2017