Category Archives: statistics

Adding phylopic.org silhouettes to R plots

Over at phylopic.org there is a large and growing collection of silhouette images of all manner of organisms – everything from Emus to Staphylococcus. The images are free (both in cost, and to use), are available in vector (svg) and raster (png) formats at a range of resolutions, and can be searched by common name, scientific name and (perhaps most powerfully) phylogenetically.

[EDIT: as two commenters have pointed out, not all phylopic images are totally free of all restrictions on use or reuse: some require attribution, or are only free for non-commercial use. It’s best to check before using an image, either directly at the phylopic webpage, or by using the phylopic API]

Phylopic images are useful wherever it is necessary to illustrate exactly which taxon a graphical element pertains to, as pictures always speak louder than words.

Below I provide an example of using phylopic images in R graphics. I include some simple code to automatically resize and position a phylopic png within an R plot. The code is designed to preserve the original png’s aspect ratio, and to place the image at a given location within the plot.

A plot with phylopic logos

I should also point readers to Scott Chamberlain‘s R package fylopic, which provides the ability to make use of the phylopic API from within R, including the ability to search for and download silhouettes programatically.

If you find phylopic useful, I’m sure they would appreciate you providing them with silhouettes of your study species. More information on how to submit your images can be found here.

Applying a circular moving window filter to raster data in R

The raster package for R provides a variety of functions for the analysis of raster GIS data. The focal() function is very useful for applying moving window filters to such data. I wanted to calculate a moving window mean for cells within a specified radius, but focal() did not provide a built-in option for this. The following code generates an appropriate weights matrix for implementing such a filter, by using the matrix as the w argument of focal().

require(raster)
#function to make a circular weights matrix of given radius and resolution
#NB radius must me an even multiple of res!
make_circ_filter<-function(radius, res){
  circ_filter<-matrix(NA, nrow=1+(2*radius/res), ncol=1+(2*radius/res))
  dimnames(circ_filter)[[1]]<-seq(-radius, radius, by=res)
  dimnames(circ_filter)[[2]]<-seq(-radius, radius, by=res)
  sweeper<-function(mat){
    for(row in 1:nrow(mat)){
      for(col in 1:ncol(mat)){
        dist<-sqrt((as.numeric(dimnames(mat)[[1]])[row])^2 +
          (as.numeric(dimnames(mat)[[1]])[col])^2)
        if(dist<=radius) {mat[row, col]<-1}
      }
    }
    return(mat)
  }
out<-sweeper(circ_filter)
return(out)
}

This example uses a weighs matrix generated by make_circ_filter() to compute a circular moving average on the Meuse river grid data. For a small raster like this, the function is more than adequate. For large raster datasets, it’s quite slow though.

#make a  circular filter with 120m radius, and 40m resolution
cf<-make_circ_filter(120, 40)

#test it on the meuse grid data
f <- system.file("external/test.grd", package="raster")
r <- raster(f)

r_filt<-focal(r, w=cf, fun=mean, na.rm=T)

plot(r, main="Raw data") #original data
plot(r_filt, main="Circular moving window filter, 120m radius") #filtered data


Using R for spatial sampling, with selection probabilities defined in a raster

The raster package for R provides a range of GIS-like functions for analysing spatial grid data. Together with package sp, and several other spatial analysis packages, R provide a quite comprehensive set of tools for manipulating and analysing spatial data.

I needed to randomly select some locations for field sampling, with inclusion probabilities based on values contained in a raster. The code below did the job very easily.

library(raster)

#an example raster from the raster package
f <- system.file("external/test.grd", package="raster")
r<-raster(f)

plot(r)

#make a raster defining the desired inclusion probabilities 
#for the all locations available for sampling
probrast<-raster(r)
#inclusion probability for cells with value >=400 
#will be 10 times that for cells with value <400
probrast[r>=400]<-10 
probrast[r<400]<-1
#normalise the probability raster by dividing 
#by the sum of all inclusion weights:
probrast<-probrast/sum(getValues(probrast), na.rm=T)

#confirm sum of probabilities is one
sum(getValues(probrast), na.rm=T)

#plot the raster of inclusion probabilities
plot(probrast, col=c(gray(0.7), gray(0.3)))

#a function to select N points on a raster, with 
#inclusion probabilities defined by the raster values.
probsel<-function(probrast, N){
  x<-getValues(probrast)
  #set NA cells in raster to zero
  x[is.na(x)]<-0
  samp<-sample(nrow(probrast)*ncol(probrast), size=N, prob=x)
  samprast<-raster(probrast)
  samprast[samp]<-1 #set value of sampled squares to 1
  #convert to SpatialPoints
  points<-rasterToPoints(samprast, fun=function(x){x>0}) s
  points<-SpatialPoints(points)
 return(points)
}

#select 300 sites using the inclusion probabilities 
#defined in probrast
samppoints<-probsel(probrast, 300)
plot(probrast, col=c(gray(0.7), gray(0.3)), axes=F)
plot(samppoints, add=T, pch=16, cex=0.8, col="red")

Here’s the result. Note the higher density of sampled points (red) within the parts of the raster with higher inclusion probability (dark grey).

State-space occupancy model using PyMC

I’ve continued my experimentation with PyMC, using it to fit occupancy models to wildlife survey data with imperfect detectability. Chris Fonnesbeck has provided code for a simple, single-survey occupancy model here, which provides a good starting point for experimentation. I wanted to construct my model using the alternative, state-space parameterisation for occupancy models described by Royle and Kéry(2007). Unlike the multi-season, dynamic occupancy model described by Royle and Kéry, I am only fitting a single-season occupancy model, where the site-states (occupied or unoccupied) are assumed to be constant. The model uses a hierarchical approach, where sites are occupied with probability \psi, and the true occupancy states of the sites, z are inferred from repeated surveys at each site based on a probabilistic detection model (in this case a simple Bernoulli model, with conditional probability of detection at each survey p). Fitting this model using MCMC has the advantage that a finite-sample estimate of occupancy rates among the sampled sites can be easily computed by sampling from \sum z.

from pymc import *
from numpy import *

"""
Alternative implementation of single season occupancy estimation for the Salamander data (from MacKenzie et al. 2006), using a 
state-space approach.

Modified from original example code and data provided by Chris Fonnesbeck at https://github.com/pymc-devs/pymc/wiki/Salamanders
"""


# Occupancy data - rows are sites, with replicate surveys conducted at each site
salamanders = array([[0,0,0,1,1], [0,1,0,0,0], [0,1,0,0,0], [1,1,1,1,0], [0,0,1,0,0], 
                     [0,0,1,0,0], [0,0,1,0,0], [0,0,1,0,0], [0,0,1,0,0], [1,0,0,0,0], 
                     [0,0,1,1,1], [0,0,1,1,1], [1,0,0,1,1], [1,0,1,1,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,1,0], [0,0,0,1,0], [0,0,0,0,1], [0,0,0,0,1]])

# Number of replicate surveys at each site
k = 5

#number of detections at each site (row sums of the data)
y=salamanders.sum(axis=1)

#vector of known/unknown occupancies to provide sensible starting values for latent states, z.
#Equal to 1 if at least 1 detection, otherwise zero.
z_start = y>0

# Prior on probability of detection
p = Beta('p', alpha=1, beta=1, value=0.99)

# Prior on probability of occupancy
psi = Beta('psi', alpha=1, beta=1, value=0.01)

#latent states for occupancy
z = Bernoulli('z', p=psi, value=z_start, plot=False)

#Number of truly occupied sites in the sample (finite-sample occupancy)
@deterministic(plot=True)
def Num_occ(z=z):
   out = sum(z)
   return out

#unconditional probabilities of detection at each site (zero for unoccupied sites, p for occupied sites)
@deterministic(plot=False)
def pdet(z=z, p=p):
    out = z*p
    return out

#likelihood
Y = Binomial('Y', n=k, p=pdet, value=y, observed=True)

Fitting of the model was accomplished by running the following code, which constructs the model, collects some MCMC samples for parameters of interest, and generates plots of the results:

from pylab import *
from pymc import *

import model

#get all the variables from the model file
M = MCMC(model)

#draw samples
M.sample(iter =40000, burn = 20000, thin = 5)

#plot results
Matplot.plot(M)

Her are some summary plots (traces and histograms) for the MCMC samples of the parameters \psi, p and \sum z

The same model can easily be fitted using OpenBUGS, with comparable results..

model{
psi~dbeta(1, 1)
p~dbeta(1, 1)
for(i in 1:sites){
   z[i]~dbern(psi)
   pdet[i]<-p*z[i]
  for(j in 1:surveys){
 Y[i,j]~dbern(pdet[i])
   }
  }
}

data
list(
surveys=5,
sites=39,
Y=structure(.Data = 
c(0,0,0,1,1, 
0,1,0,0,0, 
0,1,0,0,0, 
1,1,1,1,0, 
0,0,1,0,0, 
0,0,1,0,0, 
0,0,1,0,0, 
0,0,1,0,0, 
0,0,1,0,0, 
1,0,0,0,0, 
0,0,1,1,1, 
0,0,1,1,1, 
1,0,0,1,1, 
1,0,1,1,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,0,0, 
0,0,0,1,0, 
0,0,0,1,0, 
0,0,0,0,1, 
0,0,0,0,1),
 .Dim=c(39,5))
)