Wednesday 6 February 2013

A new paradigm for spatial classes

I have a love-hate relationship with R. So many things annoy me - even things not in the R Inferno. But we can fix some of them.

For example, SpatialPolygonsDataFrames (and the other spatial data frame classes) aren't really data frames. They don't inherit from data.frame, but they almost behave like data frames. And when they don't, that's an annoyance, and you end up having to get the foo@data member which really is a data frame.

So why can't data frames have a spatial component? Partly because the columns of a data frame can only be vectors of atomic objects. You can't have a column of list objects. Or a column of model fits. Or a column of dates.. wait, yes you can...

> ds = seq(as.Date("1910/1/1"), as.Date("1999/1/1"), "years")

> df = data.frame(date=ds,i=1:90)
> head(df)
        date i
1 1910-01-01 1
2 1911-01-01 2
3 1912-01-01 3
4 1913-01-01 4
5 1914-01-01 5
6 1915-01-01 6

Dates like this are atomic. Strip away its class and a date object is just a number:

> unclass(ds[1])
[1] -21915

and its the S3 OO system that prints it nicely. So how can we store geometry in an atomic data item? We can use WKT. This is a text representation of points, polygons and so on.

So here's a function to take a SpatialPolygonsDataFrame and return a data frame with a geometry column. It adds a column called the_geom of class "geom", and attaches an attribute to the data frame so that we know where the geometry column is. Anyone who has used PostGIS will recognise this.


as.newsp.SpatialPolygonsDataFrame = function(spdf){
  require(rgeos)
  spdf$the_geom = writeWKT(spdf,byid=TRUE)
  class(spdf$the_geom) = c("geom")
  df = spdf@data
  attr(df,"the_geom") = "the_geom"
  class(df) = c("newsp","data.frame")
  df
}


Now if you convert a SPDF to a "newsp" object you get a data frame. With geometry. What can you do with it? Well, anything you can do with a data frame, because that's exactly what it is. You want to do spatial things with it? Ah, well, for now here's some code that gets the geometry columns and turns it back into a SPDF:


as.SpatialPolygonsDataFrame.newsp = function(nsp){
  geom_column= attr(nsp,"the_geom")
  geom = nsp[[geom_column]]
  class(nsp) = "data.frame"
  geom = lapply(geom,readWKT)
  glist = lapply(geom,function(p){p@polygons[[1]]})
  for(i in 1:length(glist)){
    glist[[i]]@ID=as.character(i)
  }
  SpatialPolygonsDataFrame(SpatialPolygons(glist),nsp,match.ID=FALSE)
}

This just does the reverse of the previous function. So you can write a plot method for these newsp objects:


plot.newsp = function(x,...){
  d = as.SpatialPolygonsDataFrame.newsp(x)
  plot(d,...)
}

So far, so pointless? Eventually if you developed this you'd write code to interpret the WKT (or better still, use WKB for compactness) and draw from that. This is just proof-of-concept.

These WKT strings can be very long, and that messes up printing of these data frames. In order to neaten this up, let's write some methods for the geom class to truncate the output:


as.character.geom = function(x,...){
  paste0(substr(x,1,10),"...")
}

format.geom = function(x,...){
  as.character.geom(x)
}

But now we hit a problem with subclassing any classed vector. Facepalm. R drops the class. 

> z=1:10
> class(z)="foo"
> z
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"class")
[1] "foo"
> z[3:10]
[1]  3  4  5  6  7  8  9 10

Annoying. But how does a vector of dates work around this? Well, it defines a "[" method for date vectors. We'll just copy that line for line:


"[.geom" = function (x, ..., drop = TRUE) 
{
    cl = oldClass(x)
    class(x) = NULL
    val = NextMethod("[")
    class(val) = cl
    val
}


and while we're at it, we need a print method for the geom class too:


print.geom = function(x,...){
  print(paste0(substr(x,1,10),"..."))
}

Now look what I can do (using the Columbus data from one of the spdep examples):

> cnew = as.newsp.SpatialPolygonsDataFrame(columbus[,1:4])
> head(cnew)
      AREA PERIMETER COLUMBUS_ COLUMBUS_I      the_geom
0 0.309441  2.440629         2          5 POLYGON ((...
1 0.259329  2.236939         3          1 POLYGON ((...
2 0.192468  2.187547         4          6 POLYGON ((...
3 0.083841  1.427635         5          2 POLYGON ((...
4 0.488888  2.997133         6          7 POLYGON ((...
5 0.283079  2.335634         7          8 POLYGON ((...

This is what kicked this all off. Someone couldn't understand why head(spdf) didn't do what he expected - in some cases:

> pp=data.frame(x=1:10,y=1:10,z=1:10)
> coordinates(pp)=~x+y
> head(pp)
Error in `[.data.frame`(x@data, i, j, ..., drop = FALSE) : 
  undefined columns selected

Try it. Understand it. Then tell me that spatial data frames that inherit from data.frame with the spatial data in a column aren't a better idea. Its the principle of least **BOO!** surprise. Of course it would take a lot of work to re-tool all the R spatial stuff to use this, so it's not going to happen.

Another possibility may be to drop data.frames and use data.tables instead... Anyway, just an afternoon hack when I realised you could put WKT in a column. 



No comments:

Post a Comment