Using R on Hadoop with Rhipe

I spent a while this week getting Rhipe, a java package that integrates the R environment with Hadoop, to work. Forward are pretty heavy users of Hadoop and it’s supporting ecosystem so R will be another way for the devs to interrogate the huge (and rapidly growing!) datasets we have.

Installing R

Adding the repositry

Create a new file at /etc/sources.list.d/R.list

#R repositry
deb http://rh-mirror.linux.iastate.edu/CRAN/bin/linux/ubuntu hardy/

(we are still using hardy, with the Cloudera packages)

Add the gpg keys for the repository

gpg --keyserver pgp.mit.edu --recv-key E2A11821
gpg -a --export E2A11821 | sudo apt-key add -

Install and update R

Easy:

$ sudo apt-get install r-base r-base-dev pkg-config littler
$ sudo R
> update.packages()

Set environment variables for Rhipe

Add to bottom of /etc/environment

HADOOP=/usr

create it for current session

$ export HADOOP=/usr

install protobuff

# wget http://protobuf.googlecode.com/files/protobuf-2.3.0.tar.bz2
# tar jxf protobuf-2.3.0.tar.bz2
# cd protobuf-2.3.0
# ./configure
# make
# make install
# ldconfig

install Rhipe

# wget http://www.stat.purdue.edu/~sguha/rhipe/dn/Rhipe_0.64.tar.gz
# R CMD INSTALL Rhipe_0.64.tar.gz

So all is well except that the test code here is a bit off.

For me today

> library(Rhipe)

Only works as root. It seems that

> rhwrite(list(1,2,3),"/tmp/x")

should be:

> rhwrite(list(1,2,3),"/tmp/x",1)

then

> rhread("/tmp/x")

works properly.

Also in the longer example

map <- expression({
  lapply(seq_along(map.values),function(r){
    x <- runif(map.values[[r]])
    rhcollect(map.keys[[r]],c(n=map.values[[r]],mean=mean(x),sd=sd(x)))
  })
})

## Create a job object
z <- rhmr(map, ofolder="/tmp/test", inout=c('lapply','sequence'),
          N=10,mapred=list(mapred.reduce.tasks=0),jobname='test')

## Submit the job
rhex(z)

## Read the results
res <- rhread('/tmp/test/p*')
colres  <- do.call('rbind', lapply(res,"[[",2))

colres
       n      mean        sd
 [1,]  1 0.4983786        NA
 [2,]  2 0.7683017 0.2937688
 [3,]  3 0.5936899 0.3425441
 [4,]  4 0.3699087 0.2666379
 [5,]  5 0.5179839 0.4060244
 [6,]  6 0.6278925 0.2952608
 [7,]  7 0.4920088 0.2785893
 [8,]  8 0.4592598 0.2674592
 [9,]  9 0.5734197 0.1928496
[10,] 10 0.4942676 0.2989538

Where line 16 has been changed from the original

res <- rhread('/tmp/test')

Thanks to Saptarshi Guha, the author of Rhipe for so quickly responding to my query in the group and also the authors of this discussion on setting up R in Ubuntu