#' --- #' title: "Transfering Species Distribution Models" #' author: "Cory Merow" #' output: html_document #' --- #' #' #' #' #' #' #'

Download the PDF of the presentation

#' #' #' [ The R Script associated with this page is available here](`r output`). Download this file and open it (or copy-paste into a new script) with RStudio so you can follow along. #' #' # Setup #' ## ----message=FALSE------------------------------------------------------- library(spocc) library(raster) library(sp) library(rgdal) library(ROCR) library(corrplot) library(maxnet) library(spThin) #' #' #' # A simple SDM #' #' The goal of this section is to use a simple set of operations to build an SDM. There are many packages that will perform much more refined versions of these steps, at the expense that decisions are made behind the scenes, or may be obscure to the user. So before getting into fancier tools, let's see what the bare minimum looks like. #' #' ## Get presence data #' #' The `spocc` package allows you to hit a number of the larger databases for presence-only data within R. They provide a number of useful pieces of metadata, if your'e into that sort of this. For this, we're not; we just want lat and lon. #' #' > Decision: You assume the database of choice has sufficiently checked for errors in biology or typos. You know what happens when you assume... #' ## ------------------------------------------------------------------------ # get presence data pres.tmp=spocc::occ('Alliaria petiolata',from='gbif',limit=5000) # this can be slow pres=pres.tmp$gbif$data[[1]][,c('longitude','latitude')] # so just read in the result of me running this earlier #pres=read.csv('https://cmerow.github.io/YaleBGCCourses/101_assets/AP_gbif.csv')[,c('longitude','latitude')] pres=pres[complete.cases(pres),] # toss records without coords #' #' #' ## Get environmental data #' #' The `raster` package has a convenience function to get some types of data. To see more about [Worldclim](http://worldclim.org/version2) #' #' > Decision: Worldclim data describes the environment well in this region. The 'bioclim' variables are biologically relevant summaries of climate. #' ## ------------------------------------------------------------------------ # get climate data # the raster package has convenience function built in for worldclim clim=getData('worldclim', var='bio', res=10) #' #' The Bioclim variables in `clim.us` are: #' #' #' #' Variable Description #' - - #' BIO1 Annual Mean Temperature #' BIO2 Mean Diurnal Range (Mean of monthly (max temp – min temp)) #' BIO3 Isothermality (BIO2/BIO7) (* 100) #' BIO4 Temperature Seasonality (standard deviation *100) #' BIO5 Max Temperature of Warmest Month #' BIO6 Min Temperature of Coldest Month #' BIO7 Temperature Annual Range (BIO5-BIO6) #' BIO8 Mean Temperature of Wettest Quarter #' BIO9 Mean Temperature of Driest Quarter #' BIO10 Mean Temperature of Warmest Quarter #' BIO11 Mean Temperature of Coldest Quarter #' BIO12 Annual Precipitation #' BIO13 Precipitation of Wettest Month #' BIO14 Precipitation of Driest Month #' BIO15 Precipitation Seasonality (Coefficient of Variation) #' BIO16 Precipitation of Wettest Quarter #' BIO17 Precipitation of Driest Quarter #' BIO18 Precipitation of Warmest Quarter #' BIO19 Precipitation of Coldest Quarter #' #' #' #' #' ## Choose domain #' #' The 'domain' is the region of interest. It can be a political region, a biome, a park, a watershed, etc. It should include locations where the species is present and absent. Choosing relevent locations were the species does not occur is part of the art of *presence-only* modeling. #' #' > Decision: We are only asking about invasion in New England, so we constrain the domain to a bounding box around New England #' ## ------------------------------------------------------------------------ # choose domain (just the Eastern US) clim.us=raster::crop(clim,c(-76,-65,40,50)) # trim to a smaller region plot(clim.us[[1]]) # plot just the 1st variable to see domain #' #' ## Prep data #' #' Many climate variables are highly correlated with one another, which can confound statistical analyses. #' #' > Decision: Correlated predictors can make it difficult to interpret model coefficients or response curves. So we'll remove the most correlated predictores #' ## ------------------------------------------------------------------------ # check for correlated predictors cors=cor(values(clim.us),use='complete.obs') # evaluate correlations corrplot(cors,order = "AOE", addCoef.col = "grey",number.cex=.6) # plot correlations #' #' This plot nicely clumps groups of similar variables. Choose a representative variable from each clump. #' ## ------------------------------------------------------------------------ clim=clim[[c("bio1","bio2","bio13","bio14")]] # keep just reasonably uncorrelated ones clim.us=clim.us[[c('bio1','bio2','bio13','bio14')]] # keep just reasonably uncorrelated ones cors=cor(values(clim.us),use='complete.obs') # evaluate correlations corrplot(cors,order = "AOE", addCoef.col = "grey",number.cex=.6)# plot correlations #' #' Ok, tolerable. Some people advocate that correlations should be <0.7. I prefer lower, like 0.3, or 0.4 because I often forecast (as we'll do below) and one must assume that those correlations hold in new scenarios hold to make meaningful forecasts. #' #' Scaling each predictor to zero mean and unit variance is a common statistical approach to make sure the coefficents you'll estimate are comparable (on the same scale) and prevents a few other wonky things from possibly happening. #' ## ------------------------------------------------------------------------ # scale each predictor to mean=0, variance=1 clim.means=apply(values(clim.us),2,mean,na.rm=T) # means clim.sds=apply(values(clim.us),2,sd,na.rm=T) # standard devations name=names(clim.us) values(clim.us)=sapply(1:nlayers(clim.us),function(x) (values(clim.us)[,x]-clim.means[x])/clim.sds[x]) # z-scores names(clim.us)=name # get environment at pres points coordinates(pres)=c('longitude','latitude') # set coords to allow extraction (next line) pres.data=data.frame(raster::extract(clim.us,pres)) # extract data at pres locations coordinates(pres.data)=coordinates(pres) # make sure the data have coords associated pres.data=pres.data[complete.cases(pres.data@data),] # toss points without env data #' ## ------------------------------------------------------------------------ plot(clim.us) # view #' #' ## Sample background #' #' In presence-only (PO) modeling, where absence data do not exist, so-called 'background' (==jargon) points are used. In PO models, one compares the environmental conditions at occupied locations (presences) to the conditions available in the region of interest. This asks, 'how much does the species use environment x in proportion to its availability?' #' #' > Decision: The species is equally likely to be anywhere on the landscapes, so we'll compare presences to a random sample of background points. #' #' (There's a lot of subtlties about background selection, just go with it for now...) #' ## ------------------------------------------------------------------------ ## save the data table # sample background (to compare against presences) all.background=which(complete.cases(values(clim.us))) # find cells on land bg.index=sample(all.background,min(length(all.background),10000)) # take random sample of land bg.data=data.frame(values(clim.us)[bg.index,]) # get the env at these cells coordinates(bg.data)=coordinates(clim.us)[bg.index,] # define spatial object #' #' ## Statistical model #' #' > Decision: Linear and quadratic terms are sufficient to describe the species' response to the environment. #' #' Next, combine the data into a convenient form and specify a formula for the regression. #' ## ------------------------------------------------------------------------ # prep data for use in glm() all.data=rbind(data.frame(pres=1,pres.data@data),data.frame(pres=0,bg.data@data)) # line up pres & bg # specify formula (quickly to avoid writing out every name) (form=paste('pres/weight~', # lhs of eqn. paste(names(all.data)[-1], collapse = " + "),'+', # linear terms paste("I(", names(all.data)[-1], "^2)", sep = "", collapse = " + "))) # qudratic terms #' #' There are some subtle differences here compared to a regular old GLM. These weights allow one to fit a Poisson point process model with the `glm` function. If this sort of thing excites you, [this paper](http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12352/abstract) #' describes point process models well, and the appendix describes this weighting scheme. If it doesn't, just pretend this is a regular GLM for now. #' ## ------------------------------------------------------------------------ all.data$weight = all.data$pres + (1 - all.data$pres) * 10000 # these allow you to fit a Point Process mod.worst=glm(form,data=all.data,family=poisson(link='log'),weights=weight) # fit the model summary(mod.worst) # show coefficients #' #' #' ## Inspect response curves #' #' Response curves describe how the species' occurrence (y-axis) depends on a single climate variable (x-axis). There's one for each environmental variable in the model. Usually this is done by making predictions with all the other predictors set at their means. Most packages have 1-liners to make this instead, but this builds more character. We'll use these response curves to explore the consequences of model transfer later. #' ## ----results='hide'------------------------------------------------------ # check response curves # these marginal response curves are evaluated at the means of the non-focal predictor clim.ranges=apply(values(clim.us),2,range,na.rm=T) # upper and lower limits for each variable dummy.mean.matrix=data.frame(matrix(0,ncol=nlayers(clim.us),nrow=100)) #makes prediction concise below names(dummy.mean.matrix)=colnames(clim.ranges) # line up names for later reference response.curves=lapply(1:nlayers(clim.us),function(x){ # loop over each variable xs=seq(clim.ranges[1,x],clim.ranges[2,x],length=100) # x values to evaluate the curve newdata=dummy.mean.matrix # data frame with right structure newdata[,x]=xs # plug in just the values for the focal variable that differ from mean ys=predict(mod.worst,newdata=newdata) # predictions return(data.frame(xs=xs,ys=ys)) # define outputs })# ignore warnings #' #' Check out the list of lists that store this. #' ## ------------------------------------------------------------------------ str(response.curves) #structure of the object used for plotting #' ## ----width=7------------------------------------------------------------- # plot the curves par(mfrow=c(2,2),mar=c(4,5,.5,.5)) # # rows and cols for plotting for(i in 1:nlayers(clim.us)){ # loop over layers plot(response.curves[[i]]$xs,response.curves[[i]]$ys, # xs and ys type='l', # line plot bty='n',las=1, # decorations ylim=c(-20,20), # y axis limits xlab=colnames(clim.ranges)[i],ylab='occurence rate') # axis labels pres.env.range=range(pres.data[names(clim.us)[i]]@data) # find limits of fitting data abline(v=pres.env.range,col='red',lty=2) # plot limits of fitting data } #' #' #' ## Map predictions #' #' > Decision: When predicting, its ok to extrapolate beyond the range of the fitting data #' ## ---- width=7------------------------------------------------------------ # predict to US pred.r=raster::predict(clim.us,mod.worst, index=1,type="response") pred.r=pred.r/sum(values(pred.r),na.rm=T) # normalize prediction (sum to 1) plot(log(pred.r)) # plot raster plot(pres,add=T) # plot points #' #' ## Transfer to new conditions #' #' A common goal of SDMing is to transfer the models to new locations or times. Here we'll project the model to the species native habitat in Europe. One might do this to determine whether garlic mustard in New England occupies the same climate niche in Europe. If its a smaller range in the US, perhaps its still expanding. If its a larger range in the US, perhaps it has evolved. #' #' > Decision: The occurrence-environment relationship fit in New England also describes the species response to environment in Europe. #' #' ## ----width=7------------------------------------------------------------- # transfer to Europe # choose domain (just europe) clim.eu=raster::crop(clim,c(-10,55,30,75)) values(clim.eu)=sapply(1:nlayers(clim.eu),function(x) (values(clim.eu)[,x]-clim.means[x])/clim.sds[x]) names(clim.eu)=names(clim.us) # z-scores (to make values comparable to the scaeld values for fitting) transfer.r=raster::predict(clim.eu,mod.worst, index=1,type="response") transfer.r=transfer.r/sum(values(transfer.r),na.rm=T) # normalize prediction (sum to 1) plot(log(transfer.r)) # plot preds plot(pres,add=T) # plot presences #' #' ## How do predictions differ if we don't extrapolate? #' #' A trick to prevent extrapolation is just to set the predictions to NA beyond the range of the data. Here's a function I wrote to make a mask for the environmental conditions that are beyond the data range. Its not critical that you understand how I made this function at this stage, but rather how to use it. Here's what the parameters mean: #' #' * `thisEnv` a `rasterStack`, corresponding to the layers you want to mask #' * `dat` a `data.frame` of the values of each predictor at background points #' * `val` the number of standard deviations beyond the fitting data where you're willing to predict. Set to 0 for no extrapolation. #' #' ## ------------------------------------------------------------------------ extrapolationMask=function(thisEnv, dat=NULL, val=1){ vars=colnames(dat) check=names(thisEnv) %in% vars if(any(!check)) warning(paste0('Some variables were not masked because they were not included in dat: ' ,names(thisEnv)[!check])) masks=stack(lapply(vars,function(x){ e.min=min(dat[,x],na.rm=T) e.max=max(dat[,x],na.rm=T) e.sd=sd(dat[,x],na.rm=T) e.min.lim=e.min-val*e.sd e.max.lim=e.max+val*e.sd mask1=thisEnv[[x]] >= e.min.lim & thisEnv[[x]] <= e.max.lim values(mask1)[values(mask1)==0]=NA mask1 })) names(masks)=vars maskedEnv=stack(lapply(vars,function(x){ raster::mask(thisEnv[[x]],masks[[x]],maskValue=0) })) names(maskedEnv)=vars # make a report of which layers lead to masking nonNAPreMask=apply(values(thisEnv),2,function(x) sum(!is.na(x))) nonNAMasked=apply(values(maskedEnv),2,function(x) sum(!is.na(x))) print(data.frame(nonNAPreMask=nonNAPreMask,nonNAMasked=nonNAMasked, numCellsMasked=nonNAPreMask-nonNAMasked)) return(list(masks=masks,maskedEnv=maskedEnv)) } #' #' Let's use `extrapolationMask` to prevent extrapolation. #' ## ------------------------------------------------------------------------ noExtrapMask=extrapolationMask(clim.eu, all.data[,c('bio1','bio2','bio13','bio14')], val=0) plot(noExtrapMask[[2]]) # plot each of the masked env layers. #' Values of NA indicate that you're not allowed to predict there (extrapolating) #' ## ------------------------------------------------------------------------ transfer.r1=raster::predict(noExtrapMask[[2]],mod.worst, index=1,type="response") transfer.r1=transfer.r1/sum(values(transfer.r),na.rm=T) # normalize prediction (sum to 1) plot(log(transfer.r1)) # plot preds plot(pres,add=T,cex=.2) # plot presences #' #' Hmm, looks like you're extrapolating in most of Europe. Since you build a niche model, I think its reasonable to extrapolate a little bit. That is, you're assuming that your model describes some sort of general relationship, so why shouldn't that relationship hold if you're at least nearby your fitting data? Some other processes may come into play far from the fitting data, so its fair not to extrapolate too far. The trouble is, what's too far? There isn't an answer, but let's try 1 SD beyond the fitting data #' #' ## ------------------------------------------------------------------------ noExtrapMask1=extrapolationMask(clim.eu, all.data[,c('bio1','bio2','bio13','bio14')], val=1) plot(noExtrapMask[[2]]) # plot each of the masked env layers. #' Values of NA indicate that you're not allowed to predict there (extrapolating) #' ## ------------------------------------------------------------------------ transfer.r2=raster::predict(noExtrapMask[[2]],mod.worst, index=1,type="response") transfer.r2=transfer.r2/sum(values(transfer.r),na.rm=T) # normalize prediction (sum to 1) plot(log(transfer.r2)) # plot preds plot(pres,add=T,cex=.2) # plot presences #' #' How far would you have to extrapolate to make predictions across most of Europe? #' #' #' #' #' #' #' ## Homework #' #' Build an SDM for a species of your choice, following the workflow above. Transfer it from one region to another. It may be easiest to choose an invasive species so you can fit in the native range and project to the invasible range, or vice versa. Explore different extrapolation scenarios. #' Answer the following questions in less than 4 sentences: #' #' 1. Explain why you chose your species. #' #' 2. Based on your response curves, how many SDs beyond the fitting data are you willing to extrapolate #' #' 3. Describe the pros and cons of building an SDM from an expert map, as I did in the case study at the beginning of the lecture. #'