Archive for June, 2011

Reith Lecture by Aung San Suu Kyi

June 28th 2011

Today I woke up to Radio4 as usual and was surprised to hear this years Reith Lecture by Aung San Suu Kyi on Securing Freedom. Very interesting, looking forward to hearing the remainder

The archive is available and I thought I would share my highlights:

Posted by tom under random | No Comments »

Walking The Great Glen Way

June 26th 2011

Over Easter, while we had all the extra days off because some chinless wonder married a model in an old church in London I went with two of my best friends and walked the 73 miles from Inverness to Fort William along the Caledonian Canal.


(picture from Wikipedia)

We did it ultralight, using kit I have blogged about before. My mate Ben got well into expedition planning mode and prepared an optimal food mix for the trip and introduced us to SCROGIN (Sultanas Chocolate Raisons Orange Ginger Imagination Nuts) and ANZAC biscuits (his lovely other half is a kiwi).
IMAG0005.jpgIMAG0006.jpg
I was pleased to fit it all in a 30L sack, made the walking much easier than it might have been.

As I had just been in Lisbon for a stag do the weekend before I was not feeling 100% when we got the sleeper to Inverness on the Monday night but we arrived somewhat fresh and started walking immediatly. The sleeper is really nice and I would deffinatly recommend it over flying if you need an early start in Scotland, see ScotRail. By the end of Tuesday we had got most of the way to Invermoriston (nearly 30 miles) but were all exhausted. We wildcamped with some stunning views.
IMAG0007.jpg

The Wednesday we walked to Fort Augustus and decided to take a B&B for the night as non of us had slept well and our legs and feet were killing. We were fortunate enough to stay at Old Pier House which was lovely and we got moving again on the Thursday with much more enthusiasm than we ended the day before.

Thursday night we got past laggan and camped at a campsite on the north of Loch Lochy.
IMAG0008.jpg

Friday was an epic day, taking in the 2 munros ( Meall na Teanga and Sròn a’ Choire Ghairbh and walking about 25 miles then (we thought) finishing the walk.

We had actually just reached Neptune’s Staircase and we wound up bivvying at the start line of Maggies Monster Bike and Hike. We must have looked quite odd…

We spent the first few hours of the Saturday finishing it off and arriving at Fort William where we ate the biggest amount of food we could.

IMAG0009.jpg

A great hike with 2 great guys and as it is a UK long distance path it is another of my 101 goals in 1001 days days ticked off

Posted by tom under 101 & hiking & scotland | No Comments »

Berlin Buzzwords

June 9th 2011

I have just returned from Berlin Buzzwords. It was a great conference and well organised so thanks to the organisers.

As all the talks will be online soon I will just mention a few things that I enjoyed.

The two keynotes were excellent, Doug Cutting on the history of Hadoop and Ted Dunning on the future. Both were very interesting and had a great feel for the community aspect of Open Source software. Ted works for MapR technologies but the talk was not a sales pitch. Ted spoke about how Hadoop fails currently to get the most out of the components and what we might get if we could. MapR are used by EMC for their new Hadoop distro, among other things I think they have reimplemented HDFS. An interesting number of companies had got some pretty big amounts of funding to build front-ends to Hadoop, DataMeer have an excel-like web frontend that looks interesting.

Talks I enjoyed were:

NODE.JS FOR HEAVY I/O
A superb intro to Node.js, with an example small enough to fit on a slide but not completely trivial.

TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS!
Shivek was awesome, engaging and enthusiastic. The topic itself was fascinating, using
Pi Calculus to reason about and design map/reduce algorithms. He made the point that most Hadoop jobs are datacentric but showed how to do some more mathscentric algorithms like FFTs

OH LEONHARD, WHERE ART THOU?
Jim Webber on graph databases in general and Neo4J in particular. Quite a nice reference to Euler in the title. If your data is a graph, why not have a database that is too?

REALTIME BIG DATA AT FACEBOOK WITH HADOOP AND HBASE
From Jonathan Gray, this talk was really interesting – amazing the throughput they are getting from HBase. I think Forward are more like Facebook than Google (more freedom within teams, choice of tech/roll your own vs Google wanting everything on BigTable. I cringed a bit at the thought of loads of servers running random C++ apps all over the place though…)

NEWER DEVELOPMENTS IN LARGE DATA TECHNIQUES
Joseph Turian from MetaOptimise gave a great overview of recent academic work on Machine Learning and Natural LAnguage Processing, buzzwords to look out for are: Deep Learning, Semantic Hashing and Semantic Parsing. Also look at GraphLab, Machine Learning on graph databases

DIGITISED DUTCH CULTURAL HERITAGE, MAHOUT & HADOOP
COMPOSING MAHOUT CLUSTERING JOBS
Two good talks on using Mahout, the first is on a Dutch Gov project, Images for the future to archive and categorise AV heritage resources. The second had a nice demo of categorising stack-overflow.

Lightning Talks:
The Lustre filesystem from Eric Barton of Whamcloud talked about how his company are developing Lustre outside Sun/Oracle and he was trying to see where it could fit in with Hadoop. Luster is the other end of the spectrum from HDFS/Hadoop, really quick but assuming fast, highly available storage behind it. I would love to see some integration with Lustre or Ceph in a Hadoop-like system.

I gave a talk on the Flume Firehose Abs and I made at Forward last week, it was OK (though I still think no-one has done a good job of selling ZeroMQ in 10 minutes!). Slides are here (I’ll do another post about it as well, quite an entertaining fallout from it over twitter.)

Posted by tom under conf & hadoop | 2 Comments »

Compressing Text Tables In Hive

June 1st 2011

At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance.

The wiki page HiveCompressedStorage lists the possibilities.

Basically you have 3 decisions:
TextFile or SequenceFile tables
TextFile

  • Can be compressed in place.
  • Can gzip/bzip before you LOAD DATA into your table
  • Only gzip/bzip are supported
  • Gzip is not splitable

SequenceFile

  • Need to create a SequenceFile table and do a SELECT/INSERT into it
  • Can use any supported compression codec
  • All compression codecs are splitable. All the cool kids use LZO or Snappy
  • Does not work- At least for me (help appreciated!)

Which compression algorithm

  • gzip – Quite slow, good compression, not splitable, supported in TextFile table
  • bzip – Slowest, best compression, splitable, supported in TextFile table
  • LZO – Not in standard distro (licensing issues), fast, splitable
  • Snappy – New from google, Not in standard distro (but licence compatable), Very fast

Block or Record compression (for SequenceFile tables)
The docs say

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here — large number of output files => more parellel map jobs => lower compression ratio.

But I got the same number of files regardless of what I selected and the total size suggested they were not even compressed so I dont know what is going on.

For simplicity I chose gziped TextFile tables because

  • It worked (always criteria zero)
  • Most of our files were not huge anyway and the technique described below keeps some of the parallelism
  • Can be done on the table in place
  • Each partition can be compressed separately
  • The space is saved incrementally and realised immediately
  • Testing showed for our load it was not much of a performance hit
  • We are feeling more pain on space than query performance at the moment, our hourly runs complete in ~20mins)

require 'rubygems'
require 'date'
require 'rbhive'

countrys = %w[at au br de dk es fr in it jp mx nl no pl pt ru se uk us za]
dates = (Date.parse('2011-01-01')..Date.parse('2011-04-30'))

RBHive.connect('hiveserver') do |con|
  dates.each do |date|
    countrys.each do |country|
      query = "insert overwrite table keywords partition (dated='#{date}', country = '#{country}')
select account,campaign,ad_group,keyword_id,keyword,match_type,status,
first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,
campaign_status,currency_code,impressions,clicks,ctr,cpc,
cost,avg_position,account_id,campaign_id,adgroup_id
from keywords where dated='#{date}' and country='#{country}'"
      begin
        con.set('mapred.output.compression.codec','org.apache.hadoop.io.compress.GzipCodec')
        con.set('hive.exec.compress.output','true')
        con.set('mapred.output.compress','true')
        con.set('mapred.compress.map.output','true')
        con.set('hive.merge.mapredfiles','true')
        con.set('hive.merge.mapfiles','true')
        con.execute(query)
      rescue => e
        puts "#########################"
        puts e.message
        puts "#########################"
      end
    end
  end
end


This will loop through the partitions (date/country) and do an INSERT OVERWRITE from/to that partition using our rbhive gem. This is good because Hive reads the old data via map/reduce jobs, writes the output to /tmp, deletes the old folder and then imports the new compressed version. You need to select the columns out as the target partition has 2 less fields (date and country are missing) As we had 2 levels of partitioning and lots of big files this ran within a day on a 2Tb table, saving us around 5Tb (replication factor is 3).

You can actually download and compress the data directly to HDFS as Hive does not know what data is inside the folders on HDFS, just their layout but I thought better to do it via hive and let Hadoop parallelise it. I would have carried on doing it this way but with other tables it was too slow (too many partitions, difficult to parallelise hive server). I stopped using rbhive, dropped to using hive -e to execute the querys and used the lovely autopartitioning in later hive versions. Notice you can SELECT * now and it automatically does what it needs to to insert results into the correct partitions.

require 'rubygems'
require 'date'

countrys = %w[at au br de dk es fr in int it jp kr mx nl no pl pt ru se uk us za]

dates = (Date.parse('2010-12-02')..Date.parse('2011-05-01'))

dates.each do |date|
  query = ""
  query += "SET hive.exec.compress.output=true;"
  query += "SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;"
  query += "set mapred.job.priority=VERY_LOW;"
  query += "set hive.exec.dynamic.partition=true;"
  query += "set mapred.output.compress=true;"
  query += "set mapred.compress.map.output=true;"
  query += "set hive.merge.mapredfiles=true;"
  query += "set hive.merge.mapfiles=true;"
  query += "insert overwrite table hourly_clicks
partition (dated='#{date}', country, hour)
select * from hourly_clicks where dated='#{date}'"
  query = "hive -e \"#{query}\""
  puts "running #{query}"
  `#{query}`
end


The key difference is partition (dated=’#{date}’, country, hour) , we have not specified a country or hour partition so hive will do it automatically. This ran loads faster than looping over the partitions, letting hive schedule lots more mapreduce jobs at once. If you set hive.exec.dynamic.partition.mode=nonstrict as well you can not specify any partition information (I did this as a test but kept the WHERE clause, I was scared to do it all at once!)

The reason I am not (very) worried about losing parallelism is that some of our partition contained big .csv’s and the output of INSERT OVERWRITE was multiple .gz files (looked to me like as many as there were mappers, for example a 700M text file became ~10 .gz files) so they will still be read in parallel by mappers as the original CSV was.

Open to suggestions about better ways to achieve this, this does not preclude doing something better later.

Posted by tom under hadoop & hive & Ruby | 3 Comments »