Sunday 17 February 2013

An NHS Twitter Sentiment Map

One of the project proposals at the NHS Hack Day in Oxford was about doing some kind of sentiment analysis using twitter and NHS hashtags and ids. I had a brief word with the proposer since I'd recently seen something similar done in R but he was on another project.

But this weekend I thought I'd have a go at doing something. The main idea came from Jeffrey Breen's blog Mining Twitter for Airline Consumer Sentiment - there he writes some simple R functions that looks for positive and negative words in tweets to give a score. I pretty much used his code for scoring exactly like that.

I found a twitter list of NHS accounts which I quickly pulled out of Firefox. This could probably be done with the twitteR package in R but I found it just as quick to save it from Firebug and parse the HTML with Python and BeautifulSoup. Make sure you've scrolled down to see all the members, then save the list from the HTML tab in Firebug. That gave me a parsable list of accounts with id, description, and image URL for good measure.

Then I could run the sentiment analysis, looking for tweets that mention the NHS accounts, and computing the score. This would hopefully be tweets from people mentioning those accounts, and expressing their opinion of the service.
There were some problems with escape characters (the RJSONIO package worked better than the rjson package) and other encodings, but I managed to work round them. Soon I had a table where I could order the NHS accounts by sentiment.

But I'm a map geek. How could I map this? Most of the twitter accounts were geographic, but some were individual trusts or hospitals, and some were entire PCTs, SHAs, or even the new CCGs. There were also a few nationwide NHS accounts.

So I adopted a manual approach to geocoding. First I generated a shapefile with all the accounts located in the North Sea. I loaded this into a QGIS with an OpenStreetMap background layer, and then dragged each account around until it was roughly in its place. I left the national accounts floating in the sea. 156 drags later, they all had a place.

Back in R, I put the sentiment scores and counts of number of tweets back onto the geodata and saved a shapefile. But how to distribute this? I could save as a KML and people could use Google Earth, but I thought I'd give CartoDB a try. This is a cloud mapping service where you can upload shapefiles and make pretty maps out of them. With a bit of styling, I had it. Here it is, blue for positive sentiments and reds for negative ones (to the nearest whole number):


Or use the direct link to the map
Of course you need to take a lot of this with a pinch of salt. The locations are approximate. The sentiment scoring system is quite naive. The scores are based on fairly small numbers of tweets (click on a point to see). And the scores were a snapshot of when I ran the analysis. Nevertheless, its a start. It would be interesting to automate this, and see how sentiment changes over time, but I think it requires a lot more tweets to get something amenable to statistical analysis. Once I figure out a statistical distribution for these scores (difference of two Poissons maybe) I could map significance of sentiment scores, which would take into account the small samples. Exeter may look red and angry, but that's from one single tweet!

Wednesday 6 February 2013

A new paradigm for spatial classes

I have a love-hate relationship with R. So many things annoy me - even things not in the R Inferno. But we can fix some of them.

For example, SpatialPolygonsDataFrames (and the other spatial data frame classes) aren't really data frames. They don't inherit from data.frame, but they almost behave like data frames. And when they don't, that's an annoyance, and you end up having to get the foo@data member which really is a data frame.

So why can't data frames have a spatial component? Partly because the columns of a data frame can only be vectors of atomic objects. You can't have a column of list objects. Or a column of model fits. Or a column of dates.. wait, yes you can...

> ds = seq(as.Date("1910/1/1"), as.Date("1999/1/1"), "years")

> df = data.frame(date=ds,i=1:90)
> head(df)
        date i
1 1910-01-01 1
2 1911-01-01 2
3 1912-01-01 3
4 1913-01-01 4
5 1914-01-01 5
6 1915-01-01 6

Dates like this are atomic. Strip away its class and a date object is just a number:

> unclass(ds[1])
[1] -21915

and its the S3 OO system that prints it nicely. So how can we store geometry in an atomic data item? We can use WKT. This is a text representation of points, polygons and so on.

So here's a function to take a SpatialPolygonsDataFrame and return a data frame with a geometry column. It adds a column called the_geom of class "geom", and attaches an attribute to the data frame so that we know where the geometry column is. Anyone who has used PostGIS will recognise this.


as.newsp.SpatialPolygonsDataFrame = function(spdf){
  require(rgeos)
  spdf$the_geom = writeWKT(spdf,byid=TRUE)
  class(spdf$the_geom) = c("geom")
  df = spdf@data
  attr(df,"the_geom") = "the_geom"
  class(df) = c("newsp","data.frame")
  df
}


Now if you convert a SPDF to a "newsp" object you get a data frame. With geometry. What can you do with it? Well, anything you can do with a data frame, because that's exactly what it is. You want to do spatial things with it? Ah, well, for now here's some code that gets the geometry columns and turns it back into a SPDF:


as.SpatialPolygonsDataFrame.newsp = function(nsp){
  geom_column= attr(nsp,"the_geom")
  geom = nsp[[geom_column]]
  class(nsp) = "data.frame"
  geom = lapply(geom,readWKT)
  glist = lapply(geom,function(p){p@polygons[[1]]})
  for(i in 1:length(glist)){
    glist[[i]]@ID=as.character(i)
  }
  SpatialPolygonsDataFrame(SpatialPolygons(glist),nsp,match.ID=FALSE)
}

This just does the reverse of the previous function. So you can write a plot method for these newsp objects:


plot.newsp = function(x,...){
  d = as.SpatialPolygonsDataFrame.newsp(x)
  plot(d,...)
}

So far, so pointless? Eventually if you developed this you'd write code to interpret the WKT (or better still, use WKB for compactness) and draw from that. This is just proof-of-concept.

These WKT strings can be very long, and that messes up printing of these data frames. In order to neaten this up, let's write some methods for the geom class to truncate the output:


as.character.geom = function(x,...){
  paste0(substr(x,1,10),"...")
}

format.geom = function(x,...){
  as.character.geom(x)
}

But now we hit a problem with subclassing any classed vector. Facepalm. R drops the class. 

> z=1:10
> class(z)="foo"
> z
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"class")
[1] "foo"
> z[3:10]
[1]  3  4  5  6  7  8  9 10

Annoying. But how does a vector of dates work around this? Well, it defines a "[" method for date vectors. We'll just copy that line for line:


"[.geom" = function (x, ..., drop = TRUE) 
{
    cl = oldClass(x)
    class(x) = NULL
    val = NextMethod("[")
    class(val) = cl
    val
}


and while we're at it, we need a print method for the geom class too:


print.geom = function(x,...){
  print(paste0(substr(x,1,10),"..."))
}

Now look what I can do (using the Columbus data from one of the spdep examples):

> cnew = as.newsp.SpatialPolygonsDataFrame(columbus[,1:4])
> head(cnew)
      AREA PERIMETER COLUMBUS_ COLUMBUS_I      the_geom
0 0.309441  2.440629         2          5 POLYGON ((...
1 0.259329  2.236939         3          1 POLYGON ((...
2 0.192468  2.187547         4          6 POLYGON ((...
3 0.083841  1.427635         5          2 POLYGON ((...
4 0.488888  2.997133         6          7 POLYGON ((...
5 0.283079  2.335634         7          8 POLYGON ((...

This is what kicked this all off. Someone couldn't understand why head(spdf) didn't do what he expected - in some cases:

> pp=data.frame(x=1:10,y=1:10,z=1:10)
> coordinates(pp)=~x+y
> head(pp)
Error in `[.data.frame`(x@data, i, j, ..., drop = FALSE) : 
  undefined columns selected

Try it. Understand it. Then tell me that spatial data frames that inherit from data.frame with the spatial data in a column aren't a better idea. Its the principle of least **BOO!** surprise. Of course it would take a lot of work to re-tool all the R spatial stuff to use this, so it's not going to happen.

Another possibility may be to drop data.frames and use data.tables instead... Anyway, just an afternoon hack when I realised you could put WKT in a column.