Forward Vegas 2011

January 8th 2012

One again Forward took all its staff to Las Vegas for the Christmas party, cheers Neil!

We stayed at the Wynn again.
IMAG0171.jpg

Beautiful view from my room
IMAG0163.jpg

On the first day we went carting.
IMAG0162.jpg

The final day I went to do a skyjump off the Stratosphere (have a DVD I’ll upload when I find it)
IMAG0166.jpg
IMAG0170.jpg

Between that some drinking and gambling…

Random head coming out of the lake in the Wynn
IMAG0172.jpg

Then home!

Posted by tom under travel | No Comments »

Day 700 of 101 goals in 1001 days

December 22nd 2011

This update is late as I have been busy.

84 – Revisit Met Museum
Went while I was at Hadoop World (pic is actually the Natural History museum but what the hell)
IMAG0110
During my recent Africa Trip:
24 – Swim with sharks
IMG_7285.JPG
33 – Safari
Visited Addo Elephant Park during my recent South Africa Trip
DSC00594.JPG
34 – Vinyard tour
I visited the boekenhoutskloof vinyard in Franschhoek South Africa
DSC00659.JPG
I’ll do a full write-up of the Africa Trip soon, it was amazing.

So now out of the 101 I have done 24 with 18 on track and only 300 days to go! Need to get a shift on and tick off as much as I can.

The dayzeroproject site is back up, I am on there as thattommyhall

Posted by tom under 101 | 2 Comments »

Reith Lecture by Aung San Suu Kyi

June 28th 2011

Today I woke up to Radio4 as usual and was surprised to hear this years Reith Lecture by Aung San Suu Kyi on Securing Freedom. Very interesting, looking forward to hearing the remainder

The archive is available and I thought I would share my highlights:

Posted by tom under random | No Comments »

Walking The Great Glen Way

June 26th 2011

Over Easter, while we had all the extra days off because some chinless wonder married a model in an old church in London I went with two of my best friends and walked the 73 miles from Inverness to Fort William along the Caledonian Canal.


(picture from Wikipedia)

We did it ultralight, using kit I have blogged about before. My mate Ben got well into expedition planning mode and prepared an optimal food mix for the trip and introduced us to SCROGIN (Sultanas Chocolate Raisons Orange Ginger Imagination Nuts) and ANZAC biscuits (his lovely other half is a kiwi).
IMAG0005.jpgIMAG0006.jpg
I was pleased to fit it all in a 30L sack, made the walking much easier than it might have been.

As I had just been in Lisbon for a stag do the weekend before I was not feeling 100% when we got the sleeper to Inverness on the Monday night but we arrived somewhat fresh and started walking immediatly. The sleeper is really nice and I would deffinatly recommend it over flying if you need an early start in Scotland, see ScotRail. By the end of Tuesday we had got most of the way to Invermoriston (nearly 30 miles) but were all exhausted. We wildcamped with some stunning views.
IMAG0007.jpg

The Wednesday we walked to Fort Augustus and decided to take a B&B for the night as non of us had slept well and our legs and feet were killing. We were fortunate enough to stay at Old Pier House which was lovely and we got moving again on the Thursday with much more enthusiasm than we ended the day before.

Thursday night we got past laggan and camped at a campsite on the north of Loch Lochy.
IMAG0008.jpg

Friday was an epic day, taking in the 2 munros ( Meall na Teanga and Sròn a’ Choire Ghairbh and walking about 25 miles then (we thought) finishing the walk.

We had actually just reached Neptune’s Staircase and we wound up bivvying at the start line of Maggies Monster Bike and Hike. We must have looked quite odd…

We spent the first few hours of the Saturday finishing it off and arriving at Fort William where we ate the biggest amount of food we could.

IMAG0009.jpg

A great hike with 2 great guys and as it is a UK long distance path it is another of my 101 goals in 1001 days days ticked off

Posted by tom under 101 & hiking & scotland | No Comments »

Berlin Buzzwords

June 9th 2011

I have just returned from Berlin Buzzwords. It was a great conference and well organised so thanks to the organisers.

As all the talks will be online soon I will just mention a few things that I enjoyed.

The two keynotes were excellent, Doug Cutting on the history of Hadoop and Ted Dunning on the future. Both were very interesting and had a great feel for the community aspect of Open Source software. Ted works for MapR technologies but the talk was not a sales pitch. Ted spoke about how Hadoop fails currently to get the most out of the components and what we might get if we could. MapR are used by EMC for their new Hadoop distro, among other things I think they have reimplemented HDFS. An interesting number of companies had got some pretty big amounts of funding to build front-ends to Hadoop, DataMeer have an excel-like web frontend that looks interesting.

Talks I enjoyed were:

NODE.JS FOR HEAVY I/O
A superb intro to Node.js, with an example small enough to fit on a slide but not completely trivial.

TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS!
Shivek was awesome, engaging and enthusiastic. The topic itself was fascinating, using
Pi Calculus to reason about and design map/reduce algorithms. He made the point that most Hadoop jobs are datacentric but showed how to do some more mathscentric algorithms like FFTs

OH LEONHARD, WHERE ART THOU?
Jim Webber on graph databases in general and Neo4J in particular. Quite a nice reference to Euler in the title. If your data is a graph, why not have a database that is too?

REALTIME BIG DATA AT FACEBOOK WITH HADOOP AND HBASE
From Jonathan Gray, this talk was really interesting – amazing the throughput they are getting from HBase. I think Forward are more like Facebook than Google (more freedom within teams, choice of tech/roll your own vs Google wanting everything on BigTable. I cringed a bit at the thought of loads of servers running random C++ apps all over the place though…)

NEWER DEVELOPMENTS IN LARGE DATA TECHNIQUES
Joseph Turian from MetaOptimise gave a great overview of recent academic work on Machine Learning and Natural LAnguage Processing, buzzwords to look out for are: Deep Learning, Semantic Hashing and Semantic Parsing. Also look at GraphLab, Machine Learning on graph databases

DIGITISED DUTCH CULTURAL HERITAGE, MAHOUT & HADOOP
COMPOSING MAHOUT CLUSTERING JOBS
Two good talks on using Mahout, the first is on a Dutch Gov project, Images for the future to archive and categorise AV heritage resources. The second had a nice demo of categorising stack-overflow.

Lightning Talks:
The Lustre filesystem from Eric Barton of Whamcloud talked about how his company are developing Lustre outside Sun/Oracle and he was trying to see where it could fit in with Hadoop. Luster is the other end of the spectrum from HDFS/Hadoop, really quick but assuming fast, highly available storage behind it. I would love to see some integration with Lustre or Ceph in a Hadoop-like system.

I gave a talk on the Flume Firehose Abs and I made at Forward last week, it was OK (though I still think no-one has done a good job of selling ZeroMQ in 10 minutes!). Slides are here (I’ll do another post about it as well, quite an entertaining fallout from it over twitter.)

Posted by tom under conf & hadoop | 2 Comments »

Compressing Text Tables In Hive

June 1st 2011

At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance.

The wiki page HiveCompressedStorage lists the possibilities.

Basically you have 3 decisions:
TextFile or SequenceFile tables
TextFile

  • Can be compressed in place.
  • Can gzip/bzip before you LOAD DATA into your table
  • Only gzip/bzip are supported
  • Gzip is not splitable

SequenceFile

  • Need to create a SequenceFile table and do a SELECT/INSERT into it
  • Can use any supported compression codec
  • All compression codecs are splitable. All the cool kids use LZO or Snappy
  • Does not work- At least for me (help appreciated!)

Which compression algorithm

  • gzip – Quite slow, good compression, not splitable, supported in TextFile table
  • bzip – Slowest, best compression, splitable, supported in TextFile table
  • LZO – Not in standard distro (licensing issues), fast, splitable
  • Snappy – New from google, Not in standard distro (but licence compatable), Very fast

Block or Record compression (for SequenceFile tables)
The docs say

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here — large number of output files => more parellel map jobs => lower compression ratio.

But I got the same number of files regardless of what I selected and the total size suggested they were not even compressed so I dont know what is going on.

For simplicity I chose gziped TextFile tables because

  • It worked (always criteria zero)
  • Most of our files were not huge anyway and the technique described below keeps some of the parallelism
  • Can be done on the table in place
  • Each partition can be compressed separately
  • The space is saved incrementally and realised immediately
  • Testing showed for our load it was not much of a performance hit
  • We are feeling more pain on space than query performance at the moment, our hourly runs complete in ~20mins)

require 'rubygems'
require 'date'
require 'rbhive'

countrys = %w[at au br de dk es fr in it jp mx nl no pl pt ru se uk us za]
dates = (Date.parse('2011-01-01')..Date.parse('2011-04-30'))

RBHive.connect('hiveserver') do |con|
  dates.each do |date|
    countrys.each do |country|
      query = "insert overwrite table keywords partition (dated='#{date}', country = '#{country}')
select account,campaign,ad_group,keyword_id,keyword,match_type,status,
first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,
campaign_status,currency_code,impressions,clicks,ctr,cpc,
cost,avg_position,account_id,campaign_id,adgroup_id
from keywords where dated='#{date}' and country='#{country}'"
      begin
        con.set('mapred.output.compression.codec','org.apache.hadoop.io.compress.GzipCodec')
        con.set('hive.exec.compress.output','true')
        con.set('mapred.output.compress','true')
        con.set('mapred.compress.map.output','true')
        con.set('hive.merge.mapredfiles','true')
        con.set('hive.merge.mapfiles','true')
        con.execute(query)
      rescue => e
        puts "#########################"
        puts e.message
        puts "#########################"
      end
    end
  end
end


This will loop through the partitions (date/country) and do an INSERT OVERWRITE from/to that partition using our rbhive gem. This is good because Hive reads the old data via map/reduce jobs, writes the output to /tmp, deletes the old folder and then imports the new compressed version. You need to select the columns out as the target partition has 2 less fields (date and country are missing) As we had 2 levels of partitioning and lots of big files this ran within a day on a 2Tb table, saving us around 5Tb (replication factor is 3).

You can actually download and compress the data directly to HDFS as Hive does not know what data is inside the folders on HDFS, just their layout but I thought better to do it via hive and let Hadoop parallelise it. I would have carried on doing it this way but with other tables it was too slow (too many partitions, difficult to parallelise hive server). I stopped using rbhive, dropped to using hive -e to execute the querys and used the lovely autopartitioning in later hive versions. Notice you can SELECT * now and it automatically does what it needs to to insert results into the correct partitions.

require 'rubygems'
require 'date'

countrys = %w[at au br de dk es fr in int it jp kr mx nl no pl pt ru se uk us za]

dates = (Date.parse('2010-12-02')..Date.parse('2011-05-01'))

dates.each do |date|
  query = ""
  query += "SET hive.exec.compress.output=true;"
  query += "SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;"
  query += "set mapred.job.priority=VERY_LOW;"
  query += "set hive.exec.dynamic.partition=true;"
  query += "set mapred.output.compress=true;"
  query += "set mapred.compress.map.output=true;"
  query += "set hive.merge.mapredfiles=true;"
  query += "set hive.merge.mapfiles=true;"
  query += "insert overwrite table hourly_clicks
partition (dated='#{date}', country, hour)
select * from hourly_clicks where dated='#{date}'"
  query = "hive -e \"#{query}\""
  puts "running #{query}"
  `#{query}`
end


The key difference is partition (dated=’#{date}’, country, hour) , we have not specified a country or hour partition so hive will do it automatically. This ran loads faster than looping over the partitions, letting hive schedule lots more mapreduce jobs at once. If you set hive.exec.dynamic.partition.mode=nonstrict as well you can not specify any partition information (I did this as a test but kept the WHERE clause, I was scared to do it all at once!)

The reason I am not (very) worried about losing parallelism is that some of our partition contained big .csv’s and the output of INSERT OVERWRITE was multiple .gz files (looked to me like as many as there were mappers, for example a 700M text file became ~10 .gz files) so they will still be read in parallel by mappers as the original CSV was.

Open to suggestions about better ways to achieve this, this does not preclude doing something better later.

Posted by tom under hadoop & hive & Ruby | 3 Comments »

Finding information on Hive tables from HDFS

May 16th 2011

I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.

current = ''
file_count = 0
total_size = 0

output = File.open('output.csv','w')

IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
  split = line.split(/\s+/)
  #permissions,replication,user,group,size,mod_date,mod_time,path
  next unless split.size == 8
  path = split[7]
  size = split[4]
  permissions = split[0]
  tablename=path.split('/')[4]
  if tablename != current
    average_size = file_count == 0 ? 0 : total_size/file_count
    result = "#{current},#{file_count},#{total_size},#{average_size}"
    unless current==''
      puts result
      output.puts result
    end
    total_size = 0
    current = tablename
    file_count = 0
  end
  file_count += 1 unless permissions[0] == 'd'
  total_size += size.to_i
end
view raw hive_info.rb This Gist brought to you by GitHub.

Lots of our files were small so I am going to experiment with different partitioning and compression schemes.

Posted by tom under hadoop & hive & Ruby | No Comments »

Running –repair on MongoDB via Upstart

May 13th 2011

One of our servers running MongoDB crashed today and we encountered the typical

old lock file: /var/lib/mongodb/mongod.lock. probably means unclean shutdown
recommend removing file and running –repair
see: http://dochub.mongodb.org/core/repair for more information

As the docs do not seem to have much of an alternative to running –repair I looked for a way to automate it from upstart. Mongo creates a mongod.lock file in the data directory with the processes PID in and on a safe shutdown removes the PID, leaving the file there.

This upstart scripts includes a pre-start script that checks if the lock file exists, reads it, makes sure there is a PID there, makes sure no mongod processes exist with that PID then performs the repair as the mongodb user.

limit nofile 20000 20000

kill timeout 300

env MONGO_DATA=/var/lib/mongodb/
env MONGO_LOGS=/var/log/mongodb/
env MONGO_EXE=/usr/bin/mongod
env MONGO_CONF=/etc/mongodb.conf

pre-start script
  mkdir -p $MONGO_DATA
  mkdir -p $MONGO_LOGS
  if [ -f $MONGO_DATA/mongod.lock ]; then
    mongo_pid=`cat $MONGO_DATA/mongod.lock`
    if [ ! -z $mongo_pid ]; then
      if [ ! `pgrep mongo | grep "$mongo_pid" | wc -l` -gt 0 ]; then
        rm $MONGO_DATA/mongod.lock
        sudo -u mongodb /usr/bin/mongod --config /etc/mongodb.conf --repair
        touch $MONGO_DATA/repaired-`date "+%Y%m%d-%H%M%S"`
      fi
    fi
  fi
end script

start on runlevel [2345]
stop on runlevel [06]

script
  if [ -f /etc/default/mongodb ]; then . /etc/default/mongodb; fi
  exec start-stop-daemon --start --quiet --chuid mongodb --exec $MONGO_EXE -- --config $MONGO_CONF
end script
view raw mongodb.conf This Gist brought to you by GitHub.

Posted by tom under devops & linux & mongodb | 1 Comment »

We are all DevOps

April 4th 2011

I gave a talk recently at the Forward Tech away day entitled We Are All DevOps and it went down quite well. Forward is an unusual environment, the devs are trusted to do lots of the typical sysadmin role and the boundary between Dev and Ops is very blurred. During my first few months in the search team I kept mindmapping stuff I wanted to talk about but only got round to making the slides the day before so it was a bit underprepared but I hope useful for people.

I borrowed ideas from John Leach’s excellent Ruby: Reinventing the Wheel talk, this DepOps: The War Is Over presentation and rambled incoherently about a talk I just saw at the UKUUG Spring Conference from the author of cfengine, see here a nice description of the project (you can see how it has influenced Puppet)

Here are the slides (first time I have used Scribd, it is excellent. Much better than slideshare)
DevOps

I like the James White Manifesto , it chimes really strongly with me.

In particular

On Infrastructure
—————–
There is one system, not a collection of systems.
The desired state of the system should be a known quantity.
The “known quantity” must be machine parseable.
The actual state of the system must self-correct to the desired state.
The only authoritative source for the actual state of the system is the system.
The entire system must be deployable using source media and text files.

Soon they will post videos and I will get to see myself give a talk for the first time.

Posted by tom under devops | No Comments »

101 goals in 1001 days – Day 400 Update

February 27th 2011

Well, day 400 of my 101 goals in was Feb 5th and I was in the midst of moving house so delayed doing this.

Completed – 16
1, Teetotalitarianism for 3 months
2, Cheeseless for 3 months
9, Read GEB
11, Reread all Dennett books
15, Proofread for Project Guttenburg
48, Create a Backblaze storage pod
53, Make Jam
66, Via Feratta in Italy
78, Learn to use Emacs
I suppose you can never fully learn it but I do use it for my development now
82, Visit Egypt
83, Re-visit Louvre
85, Visit Pergamon Museum
86, Give Carrie a British Museum Tour
92, Read “An Ode Less Travelled“, do the exercises (but not share them!)
Read it while in Egypt.
97, Be 1/3 through in 2010
100, Set success criteria / progression metrics for each goal

On Track – 16
5, Lose 2 stone
10, Write book reviews for each book I read
Where I havent yet I have added a task to rememberthemilk to do so
13, Release 303 books on bookcrossing.com
88 available here, let me know if you like any and I will post them to you.
19,Blog on average once a week
50, Move 10 people to FreeAgent

68, Complete Pimsleur German
Changed from Spanish as I now live with a lovely German lady.
72, Read “Winning Ways”
read 1/2 of part 1 (of 4)
74, Read AI: A Modern Approach
75, Watch SICP, do exercises from book
Started a book club in work, seems to have stalled but I’ll start banging the drum again now I’ve settled in my new house.
76, Do on average 1 Project Euler problem per week
77, Complete “Real World Haskell”
88, Go to the theatre on average once a month
Way ahead on this, started a monthly theatre club but we managed to schedule a dozen things for the first few months of 2011
91, Memorise 10 poems
Not quite settled on the 10 but between listening to Jorge Louis Borges, This Craft Of Verse and The Ode Less Travelled I have quite a list to choose from.
95, Pay off all credit cards
96, Let loans run course and dont get any more
101, Do 100 day updates

This is one right ;-)

Behind – 4
8, Read all the VSIs
12, Read all PG Wodehouse
81, Watch all TTC Art history DVDs
90, See all world heritage sites in the UK

Changing – 5
Lots of the work related ones dont make sense any more now that I have gone full time and moved into development so I am making the following changes.
43, Visit the rijksmuseum (was Get CCNP)
44, Visit The Uffizi in Florence (was Get CCEE)
45, Give blood every 20 weeks (was Get MCITP – Enterprise Admin)
46, Listen to Radio 4 / British Museum – A History of the World in 100 Objects and view each of them (was Get VCAP)
The above are all taken from a mate who just did his own 101 list.
47, Make a Munro bagging site in Rails (was Say to a recruiter “I dont work ” and turn down work)

Planning – 12
60, Hike on average once a month
61, Do a UK long distance path
67, Do another alpine 4000m peak
62, Do a big hike in Europe
64, Climb a continental highest mountain
33, Safari
20, Organise a big bash for my 30th

The fitness aspect of these goals is where I am behind the most (though I am still a stone lighter than when I started) so I am concentrating the next six months on these goals, ending with summiting kilimanjaro for my 30th then returning to a big party.
35, Visit 5 Michelin 3* restaurants
37, Visit porto
Will go with Petra in the spring
84, Revisit Met Museum
A good mate has just moved to NYC so this should happen as soon as he is settled.
89, Return to the Theatre by the lake
My first trip with Petra was to here and we loved it. Will be going in the spring.
94, Go to Edinburgh festival
Will go at the beginning of August.

Not Started – 48
3, Do a marathon
4, Do a triathlon
6, Attend martial arts classes for 3 months
7, Write an artice for Plus new writers
14, Read a short story for librivox
16, Send Dennett a letter
17, Send Dawkins a letter
18, Read Joyce
21, Read GTD
22, Spend 3 months in another country
23, Organise all my DVDs
24, Swim with sharks
25, Paraglide
26, Learn to play bongos
27, Skydive
28, Drive Offroad
29, Do a banger rally
30, Have a track day
31, Hire the whole of Salvos Salumeria for an evening
32, Bungee Jump
34, Vinyard tour
36, See Northern Lights
38, Take dad to an opera
39, Take Mum, Dad and Carrie to the Welsh Mountain Zoo
40, Do 1000 things in London
41, Do a standup comedy course
42, Visit Japan
49, Work only 100 days in a year
51, Investigate Visa situation for Australia
52, Investigate Visa situation for US
54, Grow mushrooms
55, Paint a water colour
56, Make beer
57, Make wine
58, Cook a 4 course meal for 20 friends
59, Do a photography course
63, Attend NIM
69, Learn to dance
70, Learn to play golf
71, Learn 10 magic tricks
73, Make a Dots and Boxes program
79, Raise £5005 for charity
80, Talk about Free Software at a school
87, Go on wine tasting course
93, Go to Melbourne Comedy Festival
99, Have a completion party
65, Volunteer for the mountain bothys association
98, Have done 2/3 by day 666

I am quite heartened by the progress to be honest, considering that I spent half of last year working outside the UK, now things are settling down I should be able to churn through them faster.

If you want to join in on some, let me know!

Posted by tom under 101 | 3 Comments »

Next »