Finding information on Hive tables from HDFS
May 16th 2011
I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.
current = ''file_count = 0total_size = 0
output = File.open('output.csv','w')
IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line| split = line.split(/\s+/) #permissions,replication,user,group,size,mod_date,mod_time,path next unless split.size == 8 path = split[7] size = split[4] permissions = split[0] tablename=path.split('/')[4] if tablename != current average_size = file_count == 0 ? 0 : total_size/file_count result = "#{current},#{file_count},#{total_size},#{average_size}" unless current=='' puts result output.puts result end total_size = 0 current = tablename file_count = 0 end file_count += 1 unless permissions[0] == 'd' total_size += size.to_iendLots of our files were small so I am going to experiment with different partitioning and compression schemes.