Archive for the ‘Ruby’ Category

Compressing Text Tables In Hive

June 1st 2011

At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance.

The wiki page HiveCompressedStorage lists the possibilities.

Basically you have 3 decisions:
TextFile or SequenceFile tables
TextFile

  • Can be compressed in place.
  • Can gzip/bzip before you LOAD DATA into your table
  • Only gzip/bzip are supported
  • Gzip is not splitable

SequenceFile

  • Need to create a SequenceFile table and do a SELECT/INSERT into it
  • Can use any supported compression codec
  • All compression codecs are splitable. All the cool kids use LZO or Snappy
  • Does not work- At least for me (help appreciated!)

Which compression algorithm

  • gzip – Quite slow, good compression, not splitable, supported in TextFile table
  • bzip – Slowest, best compression, splitable, supported in TextFile table
  • LZO – Not in standard distro (licensing issues), fast, splitable
  • Snappy – New from google, Not in standard distro (but licence compatable), Very fast

Block or Record compression (for SequenceFile tables)
The docs say

The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here — large number of output files => more parellel map jobs => lower compression ratio.

But I got the same number of files regardless of what I selected and the total size suggested they were not even compressed so I dont know what is going on.

For simplicity I chose gziped TextFile tables because

  • It worked (always criteria zero)
  • Most of our files were not huge anyway and the technique described below keeps some of the parallelism
  • Can be done on the table in place
  • Each partition can be compressed separately
  • The space is saved incrementally and realised immediately
  • Testing showed for our load it was not much of a performance hit
  • We are feeling more pain on space than query performance at the moment, our hourly runs complete in ~20mins)

require 'rubygems'
require 'date'
require 'rbhive'

countrys = %w[at au br de dk es fr in it jp mx nl no pl pt ru se uk us za]
dates = (Date.parse('2011-01-01')..Date.parse('2011-04-30'))

RBHive.connect('hiveserver') do |con|
  dates.each do |date|
    countrys.each do |country|
      query = "insert overwrite table keywords partition (dated='#{date}', country = '#{country}')
              select account,campaign,ad_group,keyword_id,keyword,match_type,status,
              first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,
              campaign_status,currency_code,impressions,clicks,ctr,cpc,
              cost,avg_position,account_id,campaign_id,adgroup_id 
              from keywords where dated='#{date}' and country='#{country}'"
      begin
        con.set('mapred.output.compression.codec','org.apache.hadoop.io.compress.GzipCodec')
        con.set('hive.exec.compress.output','true')
        con.set('mapred.output.compress','true')
        con.set('mapred.compress.map.output','true')
        con.set('hive.merge.mapredfiles','true')
        con.set('hive.merge.mapfiles','true')
        con.execute(query)
      rescue => e
        puts "#########################"
        puts e.message
        puts "#########################"
      end
    end
  end 
end 

This will loop through the partitions (date/country) and do an INSERT OVERWRITE from/to that partition using our rbhive gem. This is good because Hive reads the old data via map/reduce jobs, writes the output to /tmp, deletes the old folder and then imports the new compressed version. You need to select the columns out as the target partition has 2 less fields (date and country are missing) As we had 2 levels of partitioning and lots of big files this ran within a day on a 2Tb table, saving us around 5Tb (replication factor is 3).

You can actually download and compress the data directly to HDFS as Hive does not know what data is inside the folders on HDFS, just their layout but I thought better to do it via hive and let Hadoop parallelise it. I would have carried on doing it this way but with other tables it was too slow (too many partitions, difficult to parallelise hive server). I stopped using rbhive, dropped to using hive -e to execute the querys and used the lovely autopartitioning in later hive versions. Notice you can SELECT * now and it automatically does what it needs to to insert results into the correct partitions.

require 'rubygems'
require 'date'

countrys = %w[at au br de dk es fr in int it jp kr mx nl no pl pt ru se uk us za]

dates = (Date.parse('2010-12-02')..Date.parse('2011-05-01'))

dates.each do |date|
  query = ""
  query += "SET hive.exec.compress.output=true;"
  query += "SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;"
  query += "set mapred.job.priority=VERY_LOW;" 
  query += "set hive.exec.dynamic.partition=true;"
  query += "set mapred.output.compress=true;"
  query += "set mapred.compress.map.output=true;"
  query += "set hive.merge.mapredfiles=true;"
  query += "set hive.merge.mapfiles=true;"
  query += "insert overwrite table hourly_clicks 
            partition (dated='#{date}', country, hour) 
            select * from hourly_clicks where dated='#{date}'"
  query = "hive -e \"#{query}\""
  puts "running #{query}"
  `#{query}`
end

The key difference is partition (dated=’#{date}’, country, hour) , we have not specified a country or hour partition so hive will do it automatically. This ran loads faster than looping over the partitions, letting hive schedule lots more mapreduce jobs at once. If you set hive.exec.dynamic.partition.mode=nonstrict as well you can not specify any partition information (I did this as a test but kept the WHERE clause, I was scared to do it all at once!)

The reason I am not (very) worried about losing parallelism is that some of our partition contained big .csv’s and the output of INSERT OVERWRITE was multiple .gz files (looked to me like as many as there were mappers, for example a 700M text file became ~10 .gz files) so they will still be read in parallel by mappers as the original CSV was.

Open to suggestions about better ways to achieve this, this does not preclude doing something better later.

Posted by tom under hadoop & hive & Ruby | 3 Comments »

Finding information on Hive tables from HDFS

May 16th 2011

I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.

current = ''
file_count = 0
total_size = 0

output = File.open('output.csv','w')

IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
  split = line.split(/\s+/)
  #permissions,replication,user,group,size,mod_date,mod_time,path
  next unless split.size == 8
  path = split[7]
  size = split[4]
  permissions = split[0]
  tablename=path.split('/')[4]
  if tablename != current
    average_size = file_count == 0 ? 0 : total_size/file_count
    result = "#{current},#{file_count},#{total_size},#{average_size}"
    unless current==''
      puts result
      output.puts result
    end
    total_size = 0
    current = tablename
    file_count = 0
  end
  file_count += 1 unless permissions[0] == 'd'
  total_size += size.to_i
end

Lots of our files were small so I am going to experiment with different partitioning and compression schemes.

Posted by tom under hadoop & hive & Ruby | No Comments »

Signals In Ruby / “rescue Exception” considered harmful

February 24th 2011

Yesterday we had an issue with the different behaviour of “kill ” and “kill -9 ” and in the process I had to refresh my knowledge of Unix signals, learn how you handle them in Ruby and properly learn Rubys exception hierarchy.

To -9 or not to -9?
The unix kill command is perhaps strangely named as it actually sends signals to processes (see “man signal” for a full list). It defaults to sending SIGTERM to the process and the application writer can decide how to treat it by “trapping” it, allowing for a safe shutdown or debug dumps etc. “kill -9″ sends SIGKILL and the signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored by your programs.
I think in the first instance you should just use “kill”, give the app the chance to do the right thing then get -9 on its ass if you need to.

Catching signals in Ruby

puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  sleep 5
  puts "doing stuff"
end

Is about the simplest code that will trap the “USR1″ signal (which you can send with “kill -USR1 “). The USR1 and USR2 signals are left free for you to use however you wish in your applications.

If you look at the image below you can see that it responds to the USR1 signal I send it and kill (ie sending SIGTERM) works also.

The following two code snippets are the same except one takes the default and the other catches Exception (ie any exception)

#sig-rescue.rb
puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  begin
  puts "doing stuff"
  sleep 10
  rescue => e
    puts e.inspect
  end
end


So that still works as before and errors in our “do stuff” loop would get caught.

#sig-rescue-E.rb
puts "I have PID #{Process.pid}"

Signal.trap("USR1") {puts "prodded me"}

loop do 
  begin
  puts "doing stuff"
  sleep 10
  rescue Exception => e
    puts e.inspect
  end
end


This fails though. You can see that SIGTERM no longer works and CTRL-C from the terminal does not work also. This is because we are catching the SignalException when we do “rescue Exception”. Kill -9 worked though, as it will kill any application as the signal cannot be caught.

Rubys Exception Heirachy
The full exception heirachy (from the excellent cheat gem) is

Tom-Halls-MacBook-Pro:signal tomh$ cheat exceptions
exceptions:
  Exception
   NoMemoryError
   ScriptError
     LoadError
     NotImplementedError
     SyntaxError
   SignalException
     Interrupt
       Timeout::Error    # require 'timeout' for Timeout::Error
   StandardError         # caught by rescue if no type is specified
     ArgumentError
     IOError
       EOFError
     IndexError
     LocalJumpError
     NameError
       NoMethodError
     RangeError
       FloatDomainError
     RegexpError
     RuntimeError
     SecurityError
     SocketError
     SystemCallError
     SystemStackError
     ThreadError
     TypeError
     ZeroDivisionError
   SystemExit
   fatal

I think you should only catch StandardError or its children, possibly some of its siblings and avoid catching Exception as you probably dont want to change how the process deals with signals (you could trap them if you need to)

Posted by tom under Ruby | 2 Comments »

Ruby On Windows – Forking other processes

February 20th 2011

While moving our VM deployment site written in Sinatra to a Windows machine with the VMware PowerCLI toolkit installed the only snag was where we forked a process to do the preparation of the machines. Both Kernel.fork and Process.detach seemed to have issues.

Original MRI on Linux

  def build
    pid = fork { run_command }
    Process.detach(pid)
  end

  def run_command
    `sudo /opt/script/deployserver/setupnewserver.sh -p #{poolserver} -i #{ip} -s #{@size} -v #{@vlan} -a "#{@owner}" -n #{@name} -e "#{@email}"`
  end

IronRuby
We tried IronRuby and the same bit of the script broke as on win32 MRI (though I was pleased and surprised that Sinatra worked)

  def build
    WindowsProcess.start "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe", 
"-PSConsoleFile \"C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\" \"& C:\\script\\DataStoreUsage.ps1\""
  end

Using the following DotNet code

class WindowsProcess
  def self.start(file, arguments)
    process = System::Diagnostics::Process.new
    process.StartInfo.FileName = file
    process.StartInfo.CreateNoWindow = true
    process.StartInfo.Arguments = arguments
    process.Start
  end
end

Workaround using Windows “start” command
I had hoped the module at win32utils would let me just use the original script but fork did not work properly still.

  
def build
  commandstr = "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -PSConsoleFile \"C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\" \"& C:\\Sites\\vmdeploy\\PrepNewMachine.ps1 -type #{@type} -machinename #{@name} -size #{@size} -vlan #{@vlan} -creator #{@owner} -creatoremail #{@email} -ipaddress #{ip}"

  system ("start #{commandstr} > ./log/#{@name}.log 2>&1")
end

This uses the windows “start” command and works pretty well.

Posted by tom under Ruby & VMware | No Comments »

Running Any Executable As A Windows Service (Ruby / Sinatra)

February 14th 2011

While migrating an automated VM deployment page using a combination of Sinatra on Linux and Bash scripts using the Perl toolkit with a simpler script using the VMWare PowerCLI that I love so much I needed to create a windows service from the Sinatra App and had to do some googleing so I thought I would share how I did it.

You only need two things – the built-in “sc” command and an executable from Windows Server 2003 Resource Kit Tools called srvany (works with 2008 too). Get just that exe here (if you trust me of course ;-) )

Creating the service

Check it exists

Set Parameters In The Registry
Configure it at HKLM/SYSTEM/CurrentControlSet/Services/APPNAME/Parameters

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\VMdeploy\Parameters]
"Application"="C:\\Ruby192\\bin\\ruby"
"AppParameters"="C:\\Sites\\vmdeploy\\server.rb -p 80"
"AppDirectory"="C:\\Sites\\vmdeploy"
"AppEnvironment"=hex(7):65,00,78,00,61,00,6d,00,70,00,6c,00,65,00,3d,00,32,00,\
  37,00,00,00,62,00,6c,00,61,00,68,00,3d,00,63,00,3a,00,5c,00,74,00,65,00,6d,\
  00,70,00,66,00,69,00,6c,00,65,00,73,00,00,00,00,00

Note the AppEnvironment is a multiline string, the rest are strings

This lets you run any executable file, change the directory you run it from and pass any arguments or environment variables so should cover most use cases.

I will be sharing the code for both the Sinatra app and the PowerShell deploy script in later posts.

Posted by tom under Ruby & sinatra & VMware & windows | 2 Comments »

Learning Ruby: methods vs procs (or Ruby vs Python?)

October 4th 2010

I have been meaning to learn ruby for a while and the place I am working now uses a lot so I had another look at it. I read Learn To Program, a simple but good book and found the bit on blocks and procs etc pretty good and wanted to see if I could do the stuff in Python as well. Python has anonymous “lambda” functions but they are limited to one line a subset of the syntax which is a bit annoying sometimes. My worry with methods in Ruby is that they are not first class, I think because you can omit parenthesis and so you have no way of referring to them without invoking them.

I remembered this while reading the SICP book, the question was about the difference between this program in applicative and normal order evaluation

(define (p) (p))

(define (test x y)
  (if (= x 0)
      0
      y))

It rang a bell as (define (p) p) does not go into an infinite loop if you invoke p. In lisp (p) calls the procedure p with no arguments whereas p is just a reference to the function. In python someinstance.method refers to the method, someinstance.method() calls it, Ruby seems to need Proc objects to get around this (IMHO as a beginner!, see the end for John Leach’s lovely response via email at the time)

I redid all the examples from the book in Python

Eg 1
Ruby

def maybe_do some_proc 
  if rand(2) == 0
    some_proc.call
  end 
end

def twice_do some_proc 
  some_proc.call 
  some_proc.call
end

wink = Proc.new do 
  puts '<wink>'
end

glance = Proc.new do 
  puts '<glance>'
end

Python

import random

def maybe_do(some_proc):
    if random.choice(range(2)) == 0:
        some_proc()

def twice_do(some_proc):
    some_proc()
    some_proc()

def wink():
    print 'wink'

def glance():
    print 'glance'

for i in range(5):
    print 'running for i=',i
    maybe_do(wink)

Eg2
Ruby

def do_until_false first_input, some_proc 
  input = first_input 
  output = first_input
  while output 
    input = output 
    output = some_proc.call input
  end
  input
end

build_array_of_squares = Proc.new do |array| 
  last_number = array.last 
  if last_number <= 0
    false 
  else
    # Take off the last number...
    array.pop
    # ...and replace it with its square...
    array.push last_number*last_number
    # ...followed by the next smaller number.
    array.push last_number-1
  end 
end

always_false = Proc.new do |just_ignore_me| 
  false
end

puts do_until_false([5], build_array_of_squares).inspect

yum = 'lemonade with a hint of orange blossom water' 
puts do_until_false(yum, always_false)

Python

def do_untill_false(first_input, some_proc):
    input = first_input
    output = first_input
    while output:
        input = output
        output = some_proc(input)
    return input

def build_array_of_squares(array):
    last_number = array.pop()
    if last_number <= 0:
        return False
    else:
        array.append(last_number * last_number)
        array.append(last_number - 1)
        return array

def always_false(just_ignore_me):
    return False

def just_ignore_me():
    pass

print do_untill_false([5], build_array_of_squares)
yum = 'lemonade with a hint of orange blossom water'
print do_untill_false(yum, always_false)

Eg3
Ruby

def compose proc1, proc2 
  Proc.new do |x|
    proc2.call(proc1.call(x))
  end 
end

square_it = Proc.new do |x| 
  x*x
end

double_it = Proc.new do |x| 
  x+x
end

double_then_square = compose double_it, square_it 

square_then_double = compose square_it, double_it

puts double_then_square.call(5) puts square_then_double.call(5)

Python

def compose(proc1,proc2):
    def composed(x):
        return proc2(proc1(x))
    return composed

def square_it(x):
    return x**2

def double_it(x):
    return x*2

double_then_square = compose(double_it,square_it)
square_then_double = compose(square_it,double_it)

print double_then_square(5)
print square_then_double(5)

Eg4

class Array
  def each_even(&was_a_block__now_a_proc) 
    # We start with "true" because 
    # arrays start with 0, which is even. 
    is_even = true
    self.each do |object| 
      if is_even
        was_a_block__now_a_proc.call object
      end
      # Toggle from even to odd, or odd to even.
      is_even = !is_even
    end 
  end
end

fruits = ['apple', 'bad apple', 'cherry', 'durian'] 
fruits.each_even do |fruit|
  puts "Yum! I just love #{fruit} pies, don't you?" 
end

[1, 2, 3, 4, 5].each_even do |odd_ball|
  puts "#{odd_ball} is NOT an even number!" 
end

Python

class MyArray(list):
    def each_even(self):
        for i in range(len(self)):
            if i % 2 == 0:
                yield self[i]

fruits = MyArray(['apple', 'bad apple', 'cherry', 'durian'])

for fruit in fruits.each_even():
    print 'yum! I love %s pies, dont you?' % fruit

for odd_ball in MyArray([1,2,3,4,5]).each_even():
    print '%s is NOT an even number' % odd_ball

Eg5
Ruby

def profile block_description, &block 
  start_time = Time.new 
  block.call 
  duration = Time.new - start_time 
  puts "#{block_description}: #{duration} seconds"
end

profile '25000 doublings' do 
  number = 1
  25000.times do 
    number = number + number
  end

  puts "#{number.to_s.length} digits"
  # That's the number of digits in this HUGE number.
end

profile 'count to a million' do 
  number = 0 1000000.times do
    number = number + 1
  end 
end

Python

def profile(description, function):
    import time
    start_time = time.time()
    function()
    duration = time.time() - start_time
    print '%s: %s seconds' % (description, duration)
    print function.__name__
    print 'see, "function.__name__" can be used in place of description in python'

def count_to_a_million():
    number = 0
    for i in range(1000000):
        number = number+1

profile('count to a million', count_to_a_million)

def profiled(function):
    def new_function(*args, **kwargs):
        import time 
        start_time = time.time()
        result = function(*args, **kwargs)
        print function.__name__, 'took', time.time() - start_time, 'secs'
        return result
    return new_function

@profiled
def count_to_a_million_again():
    number = 0
    for i in range(1000000):
        number = number + 1

count_to_a_million_again()

This uses decorators, a nice Python feature that uses higher order functions (and the fact functions are first class in python).

In Conclusion
IMHO, at this point in my experience of Ruby, with all the disclaimers about my non expert status etc.
Like:

  • No restriction on complexity of anonymous functions

Dont Like:

  • Methods being different from Procs/Blocs, non-uniform syntax
  • Leaving out parenthesis (though I await DSL goodness later!)
  • “end” everywhere (I know the indentation thing in python is contentious!)

John Leach’s thought provoking tuppence

Young padawan, you look but you do not see, you will learn

or rather

Yeah, but blocks are closures Tom

Tom goes to google and comes back with http://www.artima.com/intv/closures2.html
Matz

I think it’s not that useful in the daily lives of programmers. It doesn’t matter that much.

Then john came back with

I can think of one example in Rails right away where it's useful, transactions:

r = Record.new params[:record]

Record.transaction do
 r.save
 RecordLog.create(:text => "created a new record")
end


that code takes some input from a browser (in params), instantiates a
new Record object, then writes it and a RecordLog entry to the database
atomically.
All the Record.transaction does is sends a BEGIN to the db server,
executes the block, and sends a COMMIT (or a ROLLBACK if the block
errors for any reason).
The block needs access to the r object. We could have created that
inside the block, but then it'd need access to the params object. So
without real closure support, Record.transaction would have had to
support passing in arbitrary variables.
Remember, that interview with Matz was in 2003 - more people are using
Ruby for more things nowadays, for uses beyond the imagination of it's
creator I'm sure :)

Final Thoughts
I am waiting to be blown away by Ruby and Rails

Posted by tom under Python & Ruby | No Comments »

Python talk for WYLUG, Ruby envy, Haskell Joy.

December 27th 2007

I am just getting a talk ready for WYLUG on python.

I sent Dave the following blurb:

Why I love Python:

A talk on the programming language Python, in 3 parts (feel free to
leave in the interludes if you have had enough)

Part 1: Past, Present, Future.
A bit of history and the design of the language, a look at all the
implementations available today, quick tour of built-in and commonly
used modules and future plans.

Part 2: Language overview
A quick tour of the language: builtin types, control structures, using
modules etc

Part 3: Recent Magic.
Some relatively recent changes that make programming Python even more
pleasurable.
Decorators, Generators, List comprehensions, Iterators, Functools and
anything else I can fit in.
Again a whirlwind tour, but you should be impressed and want to read
up on these some more

I have been revisiting some of the Python talks I have watched over the last few years for ideas and will update my ComSci page with links.

I stumbled across some excellent video from RubyConf, particularly the Rubinius one. Rubinius is a ruby VM partially written in Ruby, taking some lessons from Python and Smalltalk. Some of the stuff he bigs up (compiling to bytecode automatically comes to mind) Python has had for ages, but the self hosting aspect is cool (not as cool as PyPy though). Rubinius seems to be doing what Avi Bryant suggested here, learn from the Smalltalk guys and the papers from the Self team that Sun spun off and later bought back to do the hotspot VM for Java. Interesting times for dynamic languages, target the JVM, CLR, self host and generate code in other languages while always writing in the same fun language. I say Ruby envy only because I think the Ruby community does a better job of looking cool and exciting people than the Python one.

Now Haskell joy. After describing working through Yet Another Haskell Tutorial to the 2 friends doing it with me as “not an obviously pleasurable experience” I had a great moment on the train the other day looking at partial application.
(\y -> y*3)
is Haskell for the anonymous function that takes y and multiplies it by 3 (I wish I had LaTeX here to draw the lamda calculus). What I like is that you can also write that as
(*3)
While this example is trivial, what is happening is interesting. The compiler knows * is an infix operator that takes 2 arguments and that is has been supplied one and “partially applies” the function, making (*3) (a function that takes one argument). One more thing is changing prefix and infix operators around using ( _ ) and ` _ ` , for example:
3 * 5
(*) 3 5

map (*2) [1,2,3]
(*2) `map` [1,2,3]

I hope this second example is clear, map usually is a prefix function that takes a function and a list and returns a list with the result of applying the function to each element (the return value here would be [2,4,6]). This flexibility is neat and is starting to make Haskell a joy to hack in.

Merry Christmas,

Posted by tom under haskell & Python & Ruby | No Comments »