Archive for May, 2011

Finding information on Hive tables from HDFS

May 16th 2011

I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.

current = ''
file_count = 0
total_size = 0

output = File.open('output.csv','w')

IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
  split = line.split(/\s+/)
  #permissions,replication,user,group,size,mod_date,mod_time,path
  next unless split.size == 8
  path = split[7]
  size = split[4]
  permissions = split[0]
  tablename=path.split('/')[4]
  if tablename != current
    average_size = file_count == 0 ? 0 : total_size/file_count
    result = "#{current},#{file_count},#{total_size},#{average_size}"
    unless current==''
      puts result
      output.puts result
    end
    total_size = 0
    current = tablename
    file_count = 0
  end
  file_count += 1 unless permissions[0] == 'd'
  total_size += size.to_i
end

Lots of our files were small so I am going to experiment with different partitioning and compression schemes.

Posted by tom under hadoop & hive & Ruby | No Comments »

Running –repair on MongoDB via Upstart

May 13th 2011

One of our servers running MongoDB crashed today and we encountered the typical

old lock file: /var/lib/mongodb/mongod.lock. probably means unclean shutdown
recommend removing file and running –repair
see: http://dochub.mongodb.org/core/repair for more information

As the docs do not seem to have much of an alternative to running –repair I looked for a way to automate it from upstart. Mongo creates a mongod.lock file in the data directory with the processes PID in and on a safe shutdown removes the PID, leaving the file there.

This upstart scripts includes a pre-start script that checks if the lock file exists, reads it, makes sure there is a PID there, makes sure no mongod processes exist with that PID then performs the repair as the mongodb user.

limit nofile 20000 20000

kill timeout 300

env MONGO_DATA=/var/lib/mongodb/
env MONGO_LOGS=/var/log/mongodb/
env MONGO_EXE=/usr/bin/mongod
env MONGO_CONF=/etc/mongodb.conf

pre-start script
  mkdir -p $MONGO_DATA
  mkdir -p $MONGO_LOGS
  if [ -f $MONGO_DATA/mongod.lock ]; then
    mongo_pid=`cat $MONGO_DATA/mongod.lock`
    if [ ! -z $mongo_pid ]; then
      if [ ! `pgrep mongo | grep "$mongo_pid" | wc -l` -gt 0 ]; then
        rm $MONGO_DATA/mongod.lock
        sudo -u mongodb /usr/bin/mongod --config /etc/mongodb.conf --repair
        touch $MONGO_DATA/repaired-`date "+%Y%m%d-%H%M%S"`
      fi
    fi
  fi  
end script

start on runlevel [2345]
stop on runlevel [06]

script
  if [ -f /etc/default/mongodb ]; then . /etc/default/mongodb; fi
  exec start-stop-daemon --start --quiet --chuid mongodb --exec  $MONGO_EXE -- --config $MONGO_CONF
end script

Posted by tom under devops & linux & mongodb | 1 Comment »