Everything is a Ghetto

While reading this controversial link bait, consider buying my product/service

Compressing Text Tables in Hive

At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance.

The wiki page HiveCompressedStorage lists the possibilities.

Basically you have 3 decisions: TextFile or SequenceFile tables TextFile

  • Can be compressed in place.

  • Can gzip/bzip before you LOAD DATA into your table

  • Only gzip/bzip are supported

  • Gzip is not splitable

SequenceFile

  • Need to create a SequenceFile table and do a SELECT/INSERT into it

  • Can use any supported compression codec

  • All compression codecs are splitable. All the cool kids use LZO or Snappy

  • Does not work- At least for me (help appreciated!)

Which compression algorithm

  • gzip - Quite slow, good compression, not splitable, supported in TextFile table

  • bzip - Slowest, best compression, splitable, supported in TextFile table

  • LZO - Not in standard distro (licensing issues), fast, splitable

  • Snappy - New from google, Not in standard distro (but licence compatable), Very fast

**Block or Record compression (for SequenceFile tables) ** The docs say

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here – large number of output files => more parellel map jobs => lower compression ratio.

But I got the same number of files regardless of what I selected and the total size suggested they were not even compressed so I dont know what is going on.

For simplicity I chose gziped TextFile tables because

  • It worked (always criteria zero)

  • Most of our files were not huge anyway and the technique described below keeps some of the parallelism

  • Can be done on the table in place

  • Each partition can be compressed separately

  • The space is saved incrementally and realised immediately

  • Testing showed for our load it was not much of a performance hit

  • We are feeling more pain on space than query performance at the moment, our hourly runs complete in ~20mins)

This will loop through the partitions (date/country) and do an INSERT OVERWRITE from/to that partition using our rbhive gem. This is good because Hive reads the old data via map/reduce jobs, writes the output to /tmp, deletes the old folder and then imports the new compressed version. You need to select the columns out as the target partition has 2 less fields (date and country are missing) As we had 2 levels of partitioning and lots of big files this ran within a day on a 2Tb table, saving us around 5Tb (replication factor is 3).

You can actually download and compress the data directly to HDFS as Hive does not know what data is inside the folders on HDFS, just their layout but I thought better to do it via hive and let Hadoop parallelise it. I would have carried on doing it this way but with other tables it was too slow (too many partitions, difficult to parallelise hive server). I stopped using rbhive, dropped to using hive -e to execute the querys and used the lovely autopartitioning in later hive versions. Notice you can SELECT * now and it automatically does what it needs to to insert results into the correct partitions.

The key difference is partition (dated=’#{date}’, country, hour) , we have not specified a country or hour partition so hive will do it automatically. This ran loads faster than looping over the partitions, letting hive schedule lots more mapreduce jobs at once. If you set hive.exec.dynamic.partition.mode=nonstrict as well you can not specify any partition information (I did this as a test but kept the WHERE clause, I was scared to do it all at once!)

The reason I am not (very) worried about losing parallelism is that some of our partition contained big .csv’s and the output of INSERT OVERWRITE was multiple .gz files (looked to me like as many as there were mappers, for example a 700M text file became ~10 .gz files) so they will still be read in parallel by mappers as the original CSV was.

Open to suggestions about better ways to achieve this, this does not preclude doing something better later.

Finding Information on Hive Tables From HDFS

I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.

Lots of our files were small so I am going to experiment with different partitioning and compression schemes.

Running –repair on MongoDB via Upstart

One of our servers running MongoDB crashed today and we encountered the typical

old lock file: /var/lib/mongodb/mongod.lock. probably means unclean shutdown recommend removing file and running –repair see: http://dochub.mongodb.org/core/repair for more information

As the docs do not seem to have much of an alternative to running –repair I looked for a way to automate it from upstart. Mongo creates a mongod.lock file in the data directory with the processes PID in and on a safe shutdown removes the PID, leaving the file there.

This upstart scripts includes a pre-start script that checks if the lock file exists, reads it, makes sure there is a PID there, makes sure no mongod processes exist with that PID then performs the repair as the mongodb user.

We Are All DevOps

I gave a talk recently at the Forward Tech away day entitled We Are All DevOps and it went down quite well. Forward is an unusual environment, the devs are trusted to do lots of the typical sysadmin role and the boundary between Dev and Ops is very blurred. During my first few months in the search team I kept mindmapping stuff I wanted to talk about but only got round to making the slides the day before so it was a bit underprepared but I hope useful for people.

I borrowed ideas from John Leach’s excellent Ruby: Reinventing the Wheel talk, this DepOps: The War Is Over presentation and rambled incoherently about a talk I just saw at the UKUUG Spring Conference from the author of cfengine, see here a nice description of the project (you can see how it has influenced Puppet)

Here are the slides

I like the James White Manifesto, it chimes really strongly with me.

On Infrastructure

  • There is one system, not a collection of systems.
  • The desired state of the system should be a known quantity.
  • The “known quantity” must be machine parseable.
  • The actual state of the system must self-correct to the desired state.
  • The only authoritative source for the actual state of the system is the system.
  • The entire system must be deployable using source media and text files.

Soon they will post videos and I will get to see myself give a talk for the first time.

101 Goals in 1001 Days - Day 400 Update

Well, day 400 of my 101 goals in was Feb 5th and I was in the midst of moving house so delayed doing this.

Completed - 16

1, Teetotalitarianism for 3 months
2, Cheeseless for 3 months
9, Read [GEB](http://en.wikipedia.org/wiki/G%C3%B6del,Escher,Bach)
11, Reread all [Dennett](http://en.wikipedia.org/wiki/DanielDennett) books 
15, Proofread for [Project Guttenburg](http://www.gutenberg.org/wiki/MainPage)
48, [Create a Backblaze storage pod](http://ukblazers.com/2010/08/25/test-build-easier-than-i-thought/)
53, [Make Jam](http://www.thattommyhall.com/2010/04/15/making-lemon-curd/)
66, Via Feratta in Italy
78, Learn to use Emacs 
I suppose you can never fully learn it but I do use it for my development now
82, [Visit Egypt](http://www.thattommyhall.com/2010/12/16/egypt-trip/)
83, Re-visit Louvre
85, Visit Pergamon Museum
86, Give Carrie a British Museum Tour
92, Read "[An Ode Less Travelled](http://www.amazon.co.uk/Ode-Less-Travelled-Unlocking-Within/dp/009179661X)", do the exercises (but not share them!)
Read it while in Egypt.
97, Be 1/3 through in 2010
100, Set success criteria / progression metrics for each goal

On Track - 16

5, Lose 2 stone
10, Write book reviews for each book I read
Where I havent yet I have added a task to rememberthemilk to do so
13, Release 303 books on bookcrossing.com
88 available [here](http://www.bookcrossing.com/mybookshelf/thattommyhall/available), let me know if you like any and I will post them to you.
19,Blog on average once a week
50, Move 10 people to FreeAgent
68, Complete Pimsleur German
Changed from Spanish as I now live with a lovely German lady.
72, Read "Winning Ways"
read 1/2 of part 1 (of 4)
74, Read AI: A Modern Approach
75, Watch SICP, do exercises from book
Started a book club in work, seems to have stalled but I'll start banging the drum again now I've settled in my new house.
76, Do on average 1 [Project Euler](http://projecteuler.net/index.php?section=about) problem per week
77, Complete "Real World Haskell"
88, Go to the theatre on average once a month
Way ahead on this, started a monthly theatre club but we managed to schedule a dozen things for the first few months of 2011
91, Memorise 10 poems
Not quite settled on the 10 but between listening to Jorge Louis Borges, This Craft Of Verse and The Ode Less Travelled I have quite a list to choose from.
95, Pay off all credit cards
96, Let loans run course and dont get any more
101, Do 100 day updates
This is one right ;-)

Behind - 4

8, Read all the VSIs
12, Read all PG Wodehouse
81, Watch all TTC Art history DVDs
90, See all world heritage sites in the UK

Changing - 5

Lots of the work related ones dont make sense any more now that I have gone full time and moved into development so I am making the following changes.

43, Visit the [rijksmuseum](http://www.rijksmuseum.nl/) (was Get CCNP)
44, Visit The [Uffizi](http://www.uffizi.com/) in Florence (was Get CCEE)
45, Give blood every 20 weeks (was Get MCITP - Enterprise Admin)
46, Listen to Radio 4 / British Museum - [A History of the World in 100 Objects](http://www.bbc.co.uk/ahistoryoftheworld/) and view each of them (was Get VCAP)
The above are all taken from a mate who just did his own 101 list.
47, Make a Munro bagging site in Rails (was Say to a recruiter "I dont work " and turn down work)

**Planning - 12 **

60, Hike on average once a month
61, Do a UK long distance path 
67, Do another alpine 4000m peak
62, Do a big hike in Europe
64, Climb a continental highest mountain
33, Safari
20, Organise a big bash for my 30th
The fitness aspect of these goals is where I am behind the most (though I am still a stone lighter than when I started) so I am concentrating the next six months on these goals, ending with summiting kilimanjaro for my 30th then returning to a big party.
35, Visit 5 Michelin 3* restaurants
37, Visit porto
Will go with Petra in the spring
84, Revisit Met Museum
A good mate has just moved to NYC so this should happen as soon as he is settled.
89, Return to the Theatre by the lake
My first trip with Petra was to here and we loved it. Will be going in the spring.
94, Go to Edinburgh festival
Will go at the beginning of August.

Not Started - 48

3, Do a marathon 
4, Do a triathlon
6, Attend martial arts classes for 3 months
7, Write an artice for Plus new writers
14, Read a short story for librivox
16, Send Dennett a letter
17, Send Dawkins a letter
18, Read Joyce 
21, Read GTD
22, Spend 3 months in another country
23, Organise all my DVDs
24, Swim with sharks
25, Paraglide
26, Learn to play bongos
27, Skydive
28, Drive Offroad
29, Do a banger rally
30, Have a track day
31, Hire the whole of Salvos Salumeria for an evening
32, Bungee Jump
34, Vinyard tour
36, See Northern Lights
38, Take dad to an opera
39, Take Mum, Dad and Carrie to the Welsh Mountain Zoo
40, Do 1000 things in London
41, Do a standup comedy course
42, Visit Japan
49, Work only 100 days in a year
51, Investigate Visa situation for Australia
52, Investigate Visa situation for US
54, Grow mushrooms
55, Paint a water colour 
56, Make beer
57, Make wine
58, Cook a 4 course meal for 20 friends
59, Do a photography course
63, Attend NIM
69, Learn to dance
70, Learn to play golf
71, Learn 10 magic tricks
73, Make a Dots and Boxes program 
79, Raise £5005 for charity  
80, Talk about Free Software at a school
87, Go on wine tasting course
93, Go to Melbourne Comedy Festival
99, Have a completion party
65, Volunteer  for the mountain bothys association 
98, Have done 2/3 by day 666

I am quite heartened by the progress to be honest, considering that I spent half of last year working outside the UK, now things are settling down I should be able to churn through them faster.

If you want to join in on some, let me know!

Signals in Ruby / “Rescue Exception” Considered Harmful

Yesterday we had an issue with the different behaviour of “kill “ and “kill -9 “ and in the process I had to refresh my knowledge of Unix signals, learn how you handle them in Ruby and properly learn Rubys exception hierarchy.

To -9 or not to -9?

The unix kill command is perhaps strangely named as it actually sends signals to processes (see “man signal” for a full list). It defaults to sending SIGTERM to the process and the application writer can decide how to treat it by “trapping” it, allowing for a safe shutdown or debug dumps etc. “kill -9” sends SIGKILL and the signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored by your programs. I think in the first instance you should just use “kill”, give the app the chance to do the right thing then get -9 on its ass if you need to.

Catching signals in Ruby

puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  sleep 5
  puts "doing stuff"
end

Is about the simplest code that will trap the “USR1” signal (which you can send with “kill -USR1 “). The USR1 and USR2 signals are left free for you to use however you wish in your applications.

If you look at the image below you can see that it responds to the USR1 signal I send it and kill (ie sending SIGTERM) works also.

The following two code snippets are the same except one takes the default and the other catches Exception (ie any exception)

puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  begin
  puts "doing stuff"
  sleep 10
  rescue => e
    puts e.inspect
  end
end

So that still works as before and errors in our “do stuff” loop would get caught.

puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  begin
  puts "doing stuff"
  sleep 10
  rescue Exception => e
    puts e.inspect
  end
end

This fails though. You can see that SIGTERM no longer works and CTRL-C from the terminal does not work also. This is because we are catching the SignalException when we do “rescue Exception”. Kill -9 worked though, as it will kill any application as the signal cannot be caught.

Rubys Exception Heirachy

The full exception heirachy (from the excellent cheat gem) is

exceptions:
  Exception
   NoMemoryError
   ScriptError
     LoadError
     NotImplementedError
     SyntaxError
   SignalException
     Interrupt
       Timeout::Error    # require 'timeout' for Timeout::Error
   StandardError         # caught by rescue if no type is specified
     ArgumentError
     IOError
       EOFError
     IndexError
     LocalJumpError
     NameError
       NoMethodError
     RangeError
       FloatDomainError
     RegexpError
     RuntimeError
     SecurityError
     SocketError
     SystemCallError
     SystemStackError
     ThreadError
     TypeError
     ZeroDivisionError
   SystemExit
   fatal

I think you should only catch StandardError or its children, possibly some of its siblings and avoid catching Exception as you probably dont want to change how the process deals with signals (you could trap them if you need to)

Ruby on Windows - Forking Other Processes

While moving our VM deployment site written in Sinatra to a Windows machine with the VMware PowerCLI toolkit installed the only snag was where we forked a process to do the preparation of the machines. Both Kernel.fork and Process.detach seemed to have issues.

Original MRI on Linux

def build
  pid = fork { run_command }
  Process.detach(pid)
end

def run_command
  `sudo /opt/script/deployserver/setupnewserver.sh -p #{poolserver} -i #{ip} -s #{@size} -v #{@vlan} -a "#{@owner}" -n #{@name} -e "#{@email}"`
end

IronRuby

We tried IronRuby and the same bit of the script broke as on win32 MRI (though I was pleased and surprised that Sinatra worked)

def build
  WindowsProcess.start "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe", 
"-PSConsoleFile \"C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\" \"& C:\\script\\DataStoreUsage.ps1\""
end

Using the following DotNet code

class WindowsProcess
  def self.start(file, arguments)
    process = System::Diagnostics::Process.new
    process.StartInfo.FileName = file
    process.StartInfo.CreateNoWindow = true
    process.StartInfo.Arguments = arguments
    process.Start
  end
end

Workaround using Windows “start” command

I had hoped the module at win32utils would let me just use the original script but fork did not work properly still.

def build
  commandstr = "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -PSConsoleFile \"C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\" \"& C:\\Sites\\vmdeploy\\PrepNewMachine.ps1 -type #{@type} -machinename #{@name} -size #{@size} -vlan #{@vlan} -creator #{@owner} -creatoremail #{@email} -ipaddress #{ip}"
  system ("start #{commandstr} > ./log/#{@name}.log 2>&1")
end

This uses the windows “start” command and works pretty well.

Measuring Disk Usage in Linux (%iowait vs IOPS)

This occurred to me when looking at our Hadoop servers today, lots of our devs use IOWait as an indicator of IO performance but there are better measures. IOWait is a CPU metric, measuring the percent of time the CPU is idle, but waiting for an I/O to complete. Strangely - It is possible to have healthy system with nearly 100% iowait, or have a disk bottleneck with 0% iowait. A much better metric is to look at disk IO directly and you want to find the IOPS (IO Operations Per Second).

Measuring IOPS In linux I like the iostat command, though there are a few ways to get at the info. In debian/ubuntu it is in the sysstat package (ie: sudo apt-get install sysstat)

root@MACHINENAME:/home/deploy# iostat 1
Linux 2.6.24-28-server (MACHINENAME.forward.co.uk) 	18/02/11
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
    45.51    0.00    1.85    0.62       0.00       52.03

Device:        tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0     4.00       0.00       40.00          0       40
cciss/c0d1     4.00       0.00       64.00          0       64
cciss/c0d2    12.00       0.00      248.00          0       248
cciss/c0d3     0.00       0.00        0.00          0       0
cciss/c0d4    25.00       0.00      320.00          0       320
cciss/c0d5     0.00       0.00        0.00          0       0
cciss/c0d6    30.00       0.00      344.00          0       344
cciss/c0d7    42.00    3144.00        0.00         3144     0

iostat 1 refreshes everysecond, if you do it over a longer period it will average the results. tps is what you are interested in, Transactions Per Second (ie IOPS). -x will give a more detailed output and separate out reads and writes and let you know how much data is going in and out per second.

What is a good or bad number though? As with most metrics, if the first time you look at it is when you are in trouble then it’s less helpful. You should have an idea of how much IO you typically do, then if you experience issues and are doing 10x that or only getting 1/10 from the disks then you have a good candidate explanation for them.

How much can I expect from my storage? It depends how fast the disks are spinning, and how many there is. As a rule of thumb I assume for a single disk: 7.2k RPM -> ~100 IOPS 10k RPM -> ~150 IOPS 15k RPM -> ~200 IOPS Our hadoop servers were pushing about 70 IOPS to each disk at peak and they are 7.2k ones so that is in line with this estimate.

See here for a breakdown of why these are good estimates for random IOs from a single disk. Interestingly a large amount of it comes from the latency of the platter spinning, which is why SSDs do so well for random IO (Compared to a 15k disk, ~50x for writes, ~200x reads) See also: A concrete example of faster CPU causing higher %iowait while actually doing more transactions here

Extreme Linux Performance Monitoring and Tuning: Part 1 (pdf) and Part 2 (pdf) from ufsdump.org/

Running Any Executable as a Windows Service (Ruby / Sinatra)

While migrating an automated VM deployment page using a combination of Sinatra on Linux and Bash scripts using the Perl toolkit with a simpler script using the VMWare PowerCLI that I love so much I needed to create a windows service from the Sinatra App and had to do some googleing so I thought I would share how I did it.

You only need two things - the built-in “sc” command and an executable from Windows Server 2003 Resource Kit Tools called srvany (works with 2008 too). Get just that exe here (if you trust me of course ;-) )

Creating the service

Check it exists

Set Parameters In The Registry

Configure it at HKLM/SYSTEM/CurrentControlSet/Services/APPNAME/Parameters

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\VMdeploy\Parameters]
"Application"="C:\\Ruby192\\bin\\ruby"
"AppParameters"="C:\\Sites\\vmdeploy\\server.rb -p 80"
"AppDirectory"="C:\\Sites\\vmdeploy"
"AppEnvironment"=hex(7):65,00,78,00,61,00,6d,00,70,00,6c,00,65,00,3d,00,32,00,\
  37,00,00,00,62,00,6c,00,61,00,68,00,3d,00,63,00,3a,00,5c,00,74,00,65,00,6d,\
  00,70,00,66,00,69,00,6c,00,65,00,73,00,00,00,00,00

Note the AppEnvironment is a multiline string, the rest are strings. This lets you run any executable file, change the directory you run it from and pass any arguments or environment variables so should cover most use cases. I will be sharing the code for both the Sinatra app and the PowerShell deploy script in later posts.

2010 Retrospective

Well, 2010 was a great year - thanks to all the great people who made it so.

Just a brief outline:

Work: Started the year down in Kent working on a XenApp deployment on VMware at the beginning of the year, then over in Dubai building a VMware View solution, went to Libya and built the corporate infrastructure for their biggest telco, spent 2 months working on INGs next gen datacentre in the Hague and ended in London for Forward working on merging USwitch’s infrastructure after the acquisition. While this was fun, I got fatigued with traveling and living out of a suitcase and was very happy to settle down a bit in London (which I think is the greatest city in the world) and work for a great company which I am very pleased and proud to say I have joined permanently. It is exciting to work somewhere using so much great tech and with so many sound people (with huge brains).

Jollys: I managed to go to Russia, visiting Moscow and St Petersburg (OK, that was Dec 2009 but what the hell), the Czech Republic, Italy, go to Oklahoma for a friends wedding followed by a visit to two ancestral puebloan sites, Berlin, Paris, Vegas and Egypt. I feel privileged to have the opportunity to have done all this and am still a bit amazed at just how much happened.

Life: I have always thought their was an embarrassment of riches when I look at the great people I get to call my friends, I don’t get to see any of them enough and no amount is too much. Thanks to you all, I always say you are what makes the universe great for me, you conspire to make it so even though you don’t all know one another. The biggest change this year was finding someone patient enough to pair up with me full time, thanks Petra - I love you. We are moving in together soon and I look forward to having a permanent home in the greatest city in the world, with a spare room - please come visit!. Thanks to my family too, we had some bad news in 2010 but you all dealt with it with the usual aplomb, you are all ace.

Resolutions: There is no new resolutions for 2010, I will be reporting on the 101 goals in 1001 days soon.

2011: Looks set to be the best year ever, thanks in advance for helping make it so