Category Archives: R

Adding silhouettes to R plots

Over at there is a large and growing collection of silhouette images of all manner of organisms – everything from Emus to Staphylococcus. The images are free (both in cost, and to use), are available in vector (svg) and raster (png) formats at a range of resolutions, and can be searched by common name, scientific name and (perhaps most powerfully) phylogenetically.

[EDIT: as two commenters have pointed out, not all phylopic images are totally free of all restrictions on use or reuse: some require attribution, or are only free for non-commercial use. It’s best to check before using an image, either directly at the phylopic webpage, or by using the phylopic API]

Phylopic images are useful wherever it is necessary to illustrate exactly which taxon a graphical element pertains to, as pictures always speak louder than words.

Below I provide an example of using phylopic images in R graphics. I include some simple code to automatically resize and position a phylopic png within an R plot. The code is designed to preserve the original png’s aspect ratio, and to place the image at a given location within the plot.

A plot with phylopic logos

I should also point readers to Scott Chamberlain‘s R package fylopic, which provides the ability to make use of the phylopic API from within R, including the ability to search for and download silhouettes programatically.

If you find phylopic useful, I’m sure they would appreciate you providing them with silhouettes of your study species. More information on how to submit your images can be found here.

Applying a circular moving window filter to raster data in R

The raster package for R provides a variety of functions for the analysis of raster GIS data. The focal() function is very useful for applying moving window filters to such data. I wanted to calculate a moving window mean for cells within a specified radius, but focal() did not provide a built-in option for this. The following code generates an appropriate weights matrix for implementing such a filter, by using the matrix as the w argument of focal().

#function to make a circular weights matrix of given radius and resolution
#NB radius must me an even multiple of res!
make_circ_filter<-function(radius, res){
  circ_filter<-matrix(NA, nrow=1+(2*radius/res), ncol=1+(2*radius/res))
  dimnames(circ_filter)[[1]]<-seq(-radius, radius, by=res)
  dimnames(circ_filter)[[2]]<-seq(-radius, radius, by=res)
    for(row in 1:nrow(mat)){
      for(col in 1:ncol(mat)){
        dist<-sqrt((as.numeric(dimnames(mat)[[1]])[row])^2 +
        if(dist<=radius) {mat[row, col]<-1}

This example uses a weighs matrix generated by make_circ_filter() to compute a circular moving average on the Meuse river grid data. For a small raster like this, the function is more than adequate. For large raster datasets, it’s quite slow though.

#make a  circular filter with 120m radius, and 40m resolution
cf<-make_circ_filter(120, 40)

#test it on the meuse grid data
f <- system.file("external/test.grd", package="raster")
r <- raster(f)

r_filt<-focal(r, w=cf, fun=mean, na.rm=T)

plot(r, main="Raw data") #original data
plot(r_filt, main="Circular moving window filter, 120m radius") #filtered data

Mapping georss data using R and ggmap

Readers might recall my earlier efforts at using R and python for geolocation and mapping of realtime fire and emergency incident data provided as rss feeds by the Victorian Country Fire Authority (CFA). My realisation that the CFA’s rss feeds are actually implemented using georss (i.e. they already contain locational data in the form of latitudes and longitudes for each incident), makes my crude implementation of a geolocation process in my earlier python program redundant, if not an interesting learning experience.

I provide here an quick R program for mapping current CFA fire and emergency incidents from the CFA’s georss, using the excellent ggmap package to render the underlying map, with map data from google maps.

Here’s the code:


#download and parse the georss data to obtain the incident locations:
cfapoints <- sapply(getNodeSet(cfaincidents, "//georss:point"), xmlValue)
cfacoords<-colsplit(cfapoints, " ", names=c("Latitude", "Longitude"))

#map the incidents onto a google map using ggmap

#download and parse the georss data to obtain the incident locations:
cfapoints <- sapply(getNodeSet(cfaincidents, "//georss:point"), xmlValue)
cfacoords<-colsplit(cfapoints, " ", names=c("Latitude", "Longitude"))

#map the incidents onto a google map using ggmap
png("map.png", width=700, height=700)
timestring<-format(Sys.time(), "%d %B %Y, %H:%m" )
titlestring<-paste("Current CFA incidents at", timestring)
map<-get_map(location = "Victoria, Australia", zoom=7, source="google", maptype="terrain")
ggmap(map, extent="device")+ 
  geom_point(data = cfacoords, aes(x = Longitude, y = Latitude), size = 4, pch=17, color="red")+

And here’s the resulting map, showing the locations of tonight’s incidents. Note that this is a snapshot of incidents at the time of writing, and should not be assumed to represent the locations of incidents at other times, or used for anything other than your own amusement or edification. The authoritative source of incident data is always the CFAs own website and rss feeds.

Using R for spatial sampling, with selection probabilities defined in a raster

The raster package for R provides a range of GIS-like functions for analysing spatial grid data. Together with package sp, and several other spatial analysis packages, R provide a quite comprehensive set of tools for manipulating and analysing spatial data.

I needed to randomly select some locations for field sampling, with inclusion probabilities based on values contained in a raster. The code below did the job very easily.


#an example raster from the raster package
f <- system.file("external/test.grd", package="raster")


#make a raster defining the desired inclusion probabilities 
#for the all locations available for sampling
#inclusion probability for cells with value >=400 
#will be 10 times that for cells with value <400
#normalise the probability raster by dividing 
#by the sum of all inclusion weights:
probrast<-probrast/sum(getValues(probrast), na.rm=T)

#confirm sum of probabilities is one
sum(getValues(probrast), na.rm=T)

#plot the raster of inclusion probabilities
plot(probrast, col=c(gray(0.7), gray(0.3)))

#a function to select N points on a raster, with 
#inclusion probabilities defined by the raster values.
probsel<-function(probrast, N){
  #set NA cells in raster to zero
  samp<-sample(nrow(probrast)*ncol(probrast), size=N, prob=x)
  samprast[samp]<-1 #set value of sampled squares to 1
  #convert to SpatialPoints
  points<-rasterToPoints(samprast, fun=function(x){x>0}) s

#select 300 sites using the inclusion probabilities 
#defined in probrast
samppoints<-probsel(probrast, 300)
plot(probrast, col=c(gray(0.7), gray(0.3)), axes=F)
plot(samppoints, add=T, pch=16, cex=0.8, col="red")

Here’s the result. Note the higher density of sampled points (red) within the parts of the raster with higher inclusion probability (dark grey).

Rough-and-ready geolocation using python and R

The good folks at GeoScience Australia provide a comprehensive set of Australian gazetteer data for free download from their website. Using R and python, I constructed a simple geolocation application to make use of this data. I used the data in the gazetteer to determine the geographic locations of incidents reported by the Country Fire Authority in the rss feed of current incidents provided on their website.

First, I used the sqlite database facilities provided in R, to construct a new sqlite database (gazetteer.db) containing the downloaded gazetteer data. This could just as easily have been done in python, but R served my purposes well:

#R code to read in the gazetteer data and build an sqlite database table for it.
gazdata<-read.csv("Gazetteer2010_txt.csv", header=FALSE)
names(gazdata)<-c("ID_num", "ID_code", "Authority_ID", "State_ID", "Name", "Feature_Code", "Status", "Postcode", "Concise_Gazetteer", "Longitude", "LongDeg", "LongMin", "LongSec", "Latitude", "LatDeg", "LatMin", "LatSec", "Map_100K", "CGDN", "Something")
system('sqlite3 gazetteer.db', wait=FALSE)
connect<-dbConnect(driver, dbname="gazetteer.db")
dbWriteTable(connect, "places", gazdata, overwrite=T, row.names=F, eol="\r\n")

Next, I wrote a python script to download the rss feed, extract the incident locations (both using the feedparser module for python), match the locations with the place names listed in the gazetteer database (using the sqlite3 module of python), and plot a map (in png format) of the incident locations, by calling R from python, using the rpy2 module:

#! /usr/bin/env python
import feedparser
import rpy2.robjects as robjects
from sqlite3 import *
from time import strftime

#download incident feed using feedparser module
NumInc=len(feed.entries) #number of incidents
updatetime=strftime("%a, %d %b %Y %H:%M", feed.updated)  #time feed was updated

#step through incidents and extract location
for i in range(NumInc):
	inc =inc.split(',')[0] #strips out just what is before the comma (usually town/locality)
	incidents[i] =inc.title() #make first letter of each word UC.

#connect to sqlite database of Australian place names

#run query and store lats and longs of incident locations...
lat=[""]*NumInc #storage for latitudes
long=[""]*NumInc #storage for longitudes
misses=0 #counter for incident locations not matched in db.
misslist=list() #list to store locations not found in db
#query location of each incident and find latitude and longitude of best-match location
query='select Latitude,Longitude from places where \
Name LIKE ? AND State_ID="VIC" AND \
(Feature_Code="RSTA" OR Feature_Code="POPL" OR Feature_Code="SUB" OR Feature_Code="URBN" OR Feature_Code="PRSH" OR Feature_Code="BLDG")'
for k in range(NumInc):
	t=('%'+incidents[k]+'%',) #match using "like" with wild cards for prefix/suffix of string
	curs.execute(query, t)
	if get is not None: #check if any rows returned (i.e. no matched to locations), only assign result if exists
		lat[k] = get[0]
	if get is None:
missstring='\n'.join(misslist) #convert list of unmatched locations to a string

#use Rpy2 module and R to plot a nice annotated map of locations to a png file
r = robjects.r
r.png("incident_map.png", width=800, height=600)
r.points(y=lat, x=long, col="red", pch=16)
r.text(y=lat, x=long, labels=incidents, adj=1.1, col="red", cex=0.85)
r.axis(2, at=r.seq(-34, -39),labels=r.seq(34, 39), las=1)
r.title(r.paste(NumInc+1, "CFA incidents @",updatetime))
r.text(x=148.5, y=-33.6, labels=r.paste(misses," unmapped incidents:"))
r.text(x=148.5, y=-34,labels=r.paste(missstring))

The script works nicely, although some incident locations aren’t found in the database due to spelling errors, unusual formatting, or omission of locations from the geoscience australia data. I included some code to list the unmatched locations beside the map, for easy reference.

Here’s a map of tonight’s incidents:

PyMC for Bayesian models

BUGS and JAGS have been the main tools I have used for fitting Bayesian statistical models for a long time now. Both have their strengths and weaknesses, but they are extremely useful tools, and I would anticipate that they will continue to develop their capabilities, and remain important components of my statistical toolbox for some time to come.

Recently, I’ve become aware of an alternative platform for Bayesian modelling, that has similar potential to BUGS and it’s dialects – PyMC. PyMC provides a framework for describing and fitting Bayesian statistical models using the Python programming language. Having read the paper describing the software, and consulted the user guide, I decided to have a try at building a simple linear regression model as a test, despite having very limited experience with python. I found that consulting the examples on the PyMC website, as well as the material presented in Abraham Flaxman’s blog very helpful for getting started, and for solving problems along the way.

I started by simulating some data from a very simple Gaussian linear model using R. I’m sure this could be easily done in Python, but for now R will be quicker and easier for me to code:

x<-round(runif(N, -20, 20))
y<-rnorm(N, 2*x+ 3, 10)
cat(x, sep=", ", fill=T)
cat(y, sep=", ", fill=T)

Running this code resulted in two nicely concatenated vectors of random x and y values generated from the (known) regression model y=\alpha+\beta x + \epsilon. These random values were easily transferred to the PyMC code for the Bayesian model using cut-and-paste – clumsy, but it works for me…..

Here is the python code for the ordinary linear model, with the randomly generated data (called YY and XX) pasted in. Vague normal priors were assumed for the slope and intercept parameters (\alpha and \beta), while the standard deviation of the random errors (\sigma) was assigned a Uniform prior:

## Regression
from pymc import *
from numpy import *

YY = array([-19.23776, 1.559197, 27.90364, -14.94222, -41.34614, 5.857922,  -26.24492, -1.670176, -8.349098, -24.91511, 63.86167, 20.87778, 4.176622, -35.65956, 4.482383, 36.20763, 33.60314, 23.25372, -15.52639, -25.59295, 42.48803, -29.46465, 30.25402, -5.66534, -20.92914, 44.87109, 19.07603, 22.19699, 18.89613, 2.835296, 12.68109, -17.19655, 26.60962, -28.74333, -24.69688,  -19.02279, -31.39471, -17.83819, 15.389, 40.41935, 0.972758, -36.49488,  -2.041068, 23.22597, 1.226252, 11.87125, 36.32597, 29.20536, 16.24043, -0.8978296])

XX = array([-14, -6, 19, -12, -16, 1, -15, -13, 0, -6, 15, 8, 1, -16, -5, 19, 8, 7, -11, -13, 13, -18, 10, -1, -13, 13, 13, 17, 13, 11, 4, -6, 14, -14, 3, -3, -18, -11, 6, 13, -10, -12, -2, 9, -7, -1, 14, 15, 6, -2])

sigma = Uniform('sigma', 0.0, 200.0, value=20)
alpha = Normal('alpha', 0.0, 0.001, value=0)
beta = Normal('beta', 0.0, 0.001, value=0)

def modelled_yy(XX=XX, beta=beta, alpha=alpha):
    return beta*XX + alpha

y = Normal('y', mu=modelled_yy, tau=1.0/sigma**2, value=YY, observed=True)

The python code for the model saved to a file named Generating an MCMC sample from the parameters of model was then just a matter of running the following code within a python shell:

from pylab import *
from pymc import *
import regress
M = MCMC(regress)
M.sample(10000, burn=5000)

The code also generates some summary plots (traces and histograms) for each of the parameters. So far so good – it looks like the inferred values for the parameters fairly closely match those that the random data were generated from:

I’ll move onto some more complex models soon, but so far PyMC looks quite promising as a tool for Bayesian modelling. Perhaps a useful strategy for learning will be to construct a variety models of increasing complexity, with a focus on the types of models I use for my research.