Category Archives: Software

Adding silhouettes to R plots

Over at there is a large and growing collection of silhouette images of all manner of organisms – everything from Emus to Staphylococcus. The images are free (both in cost, and to use), are available in vector (svg) and raster (png) formats at a range of resolutions, and can be searched by common name, scientific name and (perhaps most powerfully) phylogenetically.

[EDIT: as two commenters have pointed out, not all phylopic images are totally free of all restrictions on use or reuse: some require attribution, or are only free for non-commercial use. It’s best to check before using an image, either directly at the phylopic webpage, or by using the phylopic API]

Phylopic images are useful wherever it is necessary to illustrate exactly which taxon a graphical element pertains to, as pictures always speak louder than words.

Below I provide an example of using phylopic images in R graphics. I include some simple code to automatically resize and position a phylopic png within an R plot. The code is designed to preserve the original png’s aspect ratio, and to place the image at a given location within the plot.

A plot with phylopic logos

I should also point readers to Scott Chamberlain‘s R package fylopic, which provides the ability to make use of the phylopic API from within R, including the ability to search for and download silhouettes programatically.

If you find phylopic useful, I’m sure they would appreciate you providing them with silhouettes of your study species. More information on how to submit your images can be found here.

Using R for spatial sampling, with selection probabilities defined in a raster

The raster package for R provides a range of GIS-like functions for analysing spatial grid data. Together with package sp, and several other spatial analysis packages, R provide a quite comprehensive set of tools for manipulating and analysing spatial data.

I needed to randomly select some locations for field sampling, with inclusion probabilities based on values contained in a raster. The code below did the job very easily.


#an example raster from the raster package
f <- system.file("external/test.grd", package="raster")


#make a raster defining the desired inclusion probabilities 
#for the all locations available for sampling
#inclusion probability for cells with value >=400 
#will be 10 times that for cells with value <400
#normalise the probability raster by dividing 
#by the sum of all inclusion weights:
probrast<-probrast/sum(getValues(probrast), na.rm=T)

#confirm sum of probabilities is one
sum(getValues(probrast), na.rm=T)

#plot the raster of inclusion probabilities
plot(probrast, col=c(gray(0.7), gray(0.3)))

#a function to select N points on a raster, with 
#inclusion probabilities defined by the raster values.
probsel<-function(probrast, N){
  #set NA cells in raster to zero
  samp<-sample(nrow(probrast)*ncol(probrast), size=N, prob=x)
  samprast[samp]<-1 #set value of sampled squares to 1
  #convert to SpatialPoints
  points<-rasterToPoints(samprast, fun=function(x){x>0}) s

#select 300 sites using the inclusion probabilities 
#defined in probrast
samppoints<-probsel(probrast, 300)
plot(probrast, col=c(gray(0.7), gray(0.3)), axes=F)
plot(samppoints, add=T, pch=16, cex=0.8, col="red")

Here’s the result. Note the higher density of sampled points (red) within the parts of the raster with higher inclusion probability (dark grey).

State-space occupancy model using PyMC

I’ve continued my experimentation with PyMC, using it to fit occupancy models to wildlife survey data with imperfect detectability. Chris Fonnesbeck has provided code for a simple, single-survey occupancy model here, which provides a good starting point for experimentation. I wanted to construct my model using the alternative, state-space parameterisation for occupancy models described by Royle and Kéry(2007). Unlike the multi-season, dynamic occupancy model described by Royle and Kéry, I am only fitting a single-season occupancy model, where the site-states (occupied or unoccupied) are assumed to be constant. The model uses a hierarchical approach, where sites are occupied with probability \psi, and the true occupancy states of the sites, z are inferred from repeated surveys at each site based on a probabilistic detection model (in this case a simple Bernoulli model, with conditional probability of detection at each survey p). Fitting this model using MCMC has the advantage that a finite-sample estimate of occupancy rates among the sampled sites can be easily computed by sampling from \sum z.

from pymc import *
from numpy import *

Alternative implementation of single season occupancy estimation for the Salamander data (from MacKenzie et al. 2006), using a 
state-space approach.

Modified from original example code and data provided by Chris Fonnesbeck at

# Occupancy data - rows are sites, with replicate surveys conducted at each site
salamanders = array([[0,0,0,1,1], [0,1,0,0,0], [0,1,0,0,0], [1,1,1,1,0], [0,0,1,0,0], 
                     [0,0,1,0,0], [0,0,1,0,0], [0,0,1,0,0], [0,0,1,0,0], [1,0,0,0,0], 
                     [0,0,1,1,1], [0,0,1,1,1], [1,0,0,1,1], [1,0,1,1,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0], 
                     [0,0,0,1,0], [0,0,0,1,0], [0,0,0,0,1], [0,0,0,0,1]])

# Number of replicate surveys at each site
k = 5

#number of detections at each site (row sums of the data)

#vector of known/unknown occupancies to provide sensible starting values for latent states, z.
#Equal to 1 if at least 1 detection, otherwise zero.
z_start = y>0

# Prior on probability of detection
p = Beta('p', alpha=1, beta=1, value=0.99)

# Prior on probability of occupancy
psi = Beta('psi', alpha=1, beta=1, value=0.01)

#latent states for occupancy
z = Bernoulli('z', p=psi, value=z_start, plot=False)

#Number of truly occupied sites in the sample (finite-sample occupancy)
def Num_occ(z=z):
   out = sum(z)
   return out

#unconditional probabilities of detection at each site (zero for unoccupied sites, p for occupied sites)
def pdet(z=z, p=p):
    out = z*p
    return out

Y = Binomial('Y', n=k, p=pdet, value=y, observed=True)

Fitting of the model was accomplished by running the following code, which constructs the model, collects some MCMC samples for parameters of interest, and generates plots of the results:

from pylab import *
from pymc import *

import model

#get all the variables from the model file
M = MCMC(model)

#draw samples
M.sample(iter =40000, burn = 20000, thin = 5)

#plot results

Her are some summary plots (traces and histograms) for the MCMC samples of the parameters \psi, p and \sum z

The same model can easily be fitted using OpenBUGS, with comparable results..

psi~dbeta(1, 1)
p~dbeta(1, 1)
for(i in 1:sites){
  for(j in 1:surveys){

Y=structure(.Data = 

PyMC for Bayesian models

BUGS and JAGS have been the main tools I have used for fitting Bayesian statistical models for a long time now. Both have their strengths and weaknesses, but they are extremely useful tools, and I would anticipate that they will continue to develop their capabilities, and remain important components of my statistical toolbox for some time to come.

Recently, I’ve become aware of an alternative platform for Bayesian modelling, that has similar potential to BUGS and it’s dialects – PyMC. PyMC provides a framework for describing and fitting Bayesian statistical models using the Python programming language. Having read the paper describing the software, and consulted the user guide, I decided to have a try at building a simple linear regression model as a test, despite having very limited experience with python. I found that consulting the examples on the PyMC website, as well as the material presented in Abraham Flaxman’s blog very helpful for getting started, and for solving problems along the way.

I started by simulating some data from a very simple Gaussian linear model using R. I’m sure this could be easily done in Python, but for now R will be quicker and easier for me to code:

x<-round(runif(N, -20, 20))
y<-rnorm(N, 2*x+ 3, 10)
cat(x, sep=", ", fill=T)
cat(y, sep=", ", fill=T)

Running this code resulted in two nicely concatenated vectors of random x and y values generated from the (known) regression model y=\alpha+\beta x + \epsilon. These random values were easily transferred to the PyMC code for the Bayesian model using cut-and-paste – clumsy, but it works for me…..

Here is the python code for the ordinary linear model, with the randomly generated data (called YY and XX) pasted in. Vague normal priors were assumed for the slope and intercept parameters (\alpha and \beta), while the standard deviation of the random errors (\sigma) was assigned a Uniform prior:

## Regression
from pymc import *
from numpy import *

YY = array([-19.23776, 1.559197, 27.90364, -14.94222, -41.34614, 5.857922,  -26.24492, -1.670176, -8.349098, -24.91511, 63.86167, 20.87778, 4.176622, -35.65956, 4.482383, 36.20763, 33.60314, 23.25372, -15.52639, -25.59295, 42.48803, -29.46465, 30.25402, -5.66534, -20.92914, 44.87109, 19.07603, 22.19699, 18.89613, 2.835296, 12.68109, -17.19655, 26.60962, -28.74333, -24.69688,  -19.02279, -31.39471, -17.83819, 15.389, 40.41935, 0.972758, -36.49488,  -2.041068, 23.22597, 1.226252, 11.87125, 36.32597, 29.20536, 16.24043, -0.8978296])

XX = array([-14, -6, 19, -12, -16, 1, -15, -13, 0, -6, 15, 8, 1, -16, -5, 19, 8, 7, -11, -13, 13, -18, 10, -1, -13, 13, 13, 17, 13, 11, 4, -6, 14, -14, 3, -3, -18, -11, 6, 13, -10, -12, -2, 9, -7, -1, 14, 15, 6, -2])

sigma = Uniform('sigma', 0.0, 200.0, value=20)
alpha = Normal('alpha', 0.0, 0.001, value=0)
beta = Normal('beta', 0.0, 0.001, value=0)

def modelled_yy(XX=XX, beta=beta, alpha=alpha):
    return beta*XX + alpha

y = Normal('y', mu=modelled_yy, tau=1.0/sigma**2, value=YY, observed=True)

The python code for the model saved to a file named Generating an MCMC sample from the parameters of model was then just a matter of running the following code within a python shell:

from pylab import *
from pymc import *
import regress
M = MCMC(regress)
M.sample(10000, burn=5000)

The code also generates some summary plots (traces and histograms) for each of the parameters. So far so good – it looks like the inferred values for the parameters fairly closely match those that the random data were generated from:

I’ll move onto some more complex models soon, but so far PyMC looks quite promising as a tool for Bayesian modelling. Perhaps a useful strategy for learning will be to construct a variety models of increasing complexity, with a focus on the types of models I use for my research.

Checking river pumping restrictions using python

I have a licence from Melbourne Water to pump water from the Yarra River. Water pumped from the river is stored in a large concrete tank (20,000 liters), and used for toilet flushing, watering the garden and as a source of water for fire fighting.

For environmental reasons, pumping from the river is restricted during periods of low flow. Melbourne water publishes flow and pumping restriction data on their website, so I can check if restrictions apply before starting the pump to top up the tank. I hacked together a quick python script to get the information I need, so I can display the information on my computer’s desktop using conky.

The code makes use of the Beautiful Soup html parser for python to extract the data from Melbourne Water’s web page.

#!/usr/bin/env python
#Python script to get current pumping restrictions for Upper Yarra River.
import mechanize
import HTMLParser
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser()
data =
soup = BeautifulSoup(data)
table=soup.find("table",title="Table showing current waterway diversion status")
for row in table.findAll('tr')[1:]:
    col = row.findAll('td')
    restrict = col[1]
    ban = col[2].string
    flow = col[3].string
    avflow = col[4].string
    date = col[5].string
#Function to strip html tags from table cells (allows for episodic coloring/bolding of cell contents when bans apply!)
def stripper(data):
   data = str(data)
   count = data.count('<')
   while count:
       start = data.find('<')
       end = data.find('>')
       rem = data[start:end+1]
       data = data.replace(rem,'',1)
   out = data
   return out
#apply the stripper function to current restriction and ban statuses
restrict = stripper(restrict)
ban = stripper(ban)
print 'Restricted?  %s   Banned? %s' % (restrict, ban) 
print 'Flow: %s ML/d  AvFlow: %s ML/d' % (flow, avflow)

Bushfire danger forecast in conky

The Australian Bureau of Meteorology publishes daily forecasts of bush fire danger on its web and ftp sites. I live in an area with very high fire danger during the summer months, and I need to be aware of the forecast bushfire risk. I devised this simple bash script (called, which grabs the local weather forecast, strips out the required information and returns it.

#! /bin/bash
#BOM weather forecast for Yarra Glen
#Get the data from the web
  DATA=$(curl -s $URL)
#strip out the fire danger rating line from the raw text
  FIREDANGER=$(echo "$DATA" | grep -A1 'Fire Danger')
#print the fire danger forecast
  echo $(echo "$FIREDANGER")

Displaying this information on my computer’s desktop using conky,  is accomplished by adding the following line to my .conkyrc file:

${execi 4000 /home/michael/Scripts/}