<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>thattommyhall.com &#187; Ruby</title>
	<atom:link href="http://www.thattommyhall.com/category/ruby/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.thattommyhall.com</link>
	<description>A Random Walk Through Idea Space</description>
	<lastBuildDate>Sun, 08 Jan 2012 11:42:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Compressing Text Tables In Hive</title>
		<link>http://www.thattommyhall.com/2011/06/01/compressing-text-tables-in-hive/</link>
		<comments>http://www.thattommyhall.com/2011/06/01/compressing-text-tables-in-hive/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 10:29:36 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=690</guid>
		<description><![CDATA[At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance. The wiki page HiveCompressedStorage lists the possibilities. Basically you have 3 decisions: TextFile or SequenceFile tables TextFile Can be [...]]]></description>
			<content:encoded><![CDATA[<p>At Forward we have been using Hive for a while and started out with the default table type (uncompressed text) and wanted to see if we could save some space and not lose too much performance.</p>
<p>The wiki page <a href="http://wiki.apache.org/hadoop/Hive/CompressedStorage">HiveCompressedStorage</a> lists the possibilities. </p>
<p>Basically you have 3 decisions:<br />
<strong>TextFile or SequenceFile tables</strong><br />
TextFile</p>
<ul>
<li>Can be compressed in place. </li>
<li>Can gzip/bzip before you LOAD DATA into your table</li>
<li>Only gzip/bzip are supported</li>
<li>Gzip is not splitable</li>
</ul>
<p>SequenceFile</p>
<ul>
<li>Need to create a SequenceFile table and do a SELECT/INSERT into it</li>
<li>Can use any supported compression codec</li>
<li>All compression codecs are splitable. All the cool kids use <a href="https://github.com/toddlipcon/hadoop-lzo">LZO</a> or <a href="http://code.google.com/p/hadoop-snappy/">Snappy</a></li>
<li><strong>Does not work</strong>- At least <a href="http://mail-archives.apache.org/mod_mbox/hive-user/201105.mbox/%3CBANLkTim_VG92dnG+fxC89NTSKAJBVvKgMw@mail.gmail.com%3E">for me</a> (help appreciated!)</li>
</ul>
<p><strong>Which compression algorithm</strong></p>
<ul>
<li>gzip &#8211; Quite slow, good compression, not splitable, supported in TextFile table</li>
<li>bzip &#8211; Slowest, best compression, splitable, supported in TextFile table</li>
<li><a href="https://github.com/toddlipcon/hadoop-lzo">LZO</a> &#8211; Not in standard distro (licensing issues), fast, splitable</li>
<li><a href="http://code.google.com/p/hadoop-snappy/">Snappy</a> &#8211; New from google, Not in standard distro (but licence compatable), Very fast </li>
</ul>
<p><strong>Block or Record compression (for SequenceFile tables) </strong><br />
The docs say </p>
<blockquote><p>The value for io.seqfile.compression.type determines how the compression is performed. If you set it to RECORD you will get as many output files as the number of map/reduce jobs. If you set it to BLOCK, you will get as many output files as there were input files. There is a tradeoff involved here &#8212; large number of output files => more parellel map jobs => lower compression ratio.</p></blockquote>
<p>But I got the same number of files regardless of what I selected and the total size suggested they were not even compressed so I dont know what is going on. </p>
<p>For simplicity I chose gziped TextFile tables because</p>
<ul>
<li>It worked (always criteria zero)</li>
<li>Most of our files were not huge anyway and the technique described below keeps some of the parallelism</li>
<li>Can be done on the table in place</li>
<li>Each partition can be compressed separately </li>
<li>The space is saved incrementally and realised immediately </li>
<li>Testing showed for our load it was not much of a performance hit</li>
<li>We are feeling more pain on space than query performance at the moment, our hourly runs complete in ~20mins)</li>
</ul>
<p><div id="gist-1000863" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="nb">require</span> <span class="s1">&#39;rubygems&#39;</span></div><div class='line' id='LC2'><span class="nb">require</span> <span class="s1">&#39;date&#39;</span></div><div class='line' id='LC3'><span class="nb">require</span> <span class="s1">&#39;rbhive&#39;</span></div><div class='line' id='LC4'><br/></div><div class='line' id='LC5'><span class="n">countrys</span> <span class="o">=</span> <span class="sx">%w[at au br de dk es fr in it jp mx nl no pl pt ru se uk us za]</span></div><div class='line' id='LC6'><span class="n">dates</span> <span class="o">=</span> <span class="p">(</span><span class="no">Date</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s1">&#39;2011-01-01&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">.</span><span class="no">Date</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s1">&#39;2011-04-30&#39;</span><span class="p">))</span></div><div class='line' id='LC7'><br/></div><div class='line' id='LC8'><span class="no">RBHive</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s1">&#39;hiveserver&#39;</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">con</span><span class="o">|</span></div><div class='line' id='LC9'>&nbsp;&nbsp;<span class="n">dates</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">date</span><span class="o">|</span></div><div class='line' id='LC10'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">countrys</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">country</span><span class="o">|</span></div><div class='line' id='LC11'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">query</span> <span class="o">=</span> <span class="s2">&quot;insert overwrite table keywords partition (dated=&#39;</span><span class="si">#{</span><span class="n">date</span><span class="si">}</span><span class="s2">&#39;, country = &#39;</span><span class="si">#{</span><span class="n">country</span><span class="si">}</span><span class="s2">&#39;)</span></div><div class='line' id='LC12'><span class="s2">              select account,campaign,ad_group,keyword_id,keyword,match_type,status,</span></div><div class='line' id='LC13'><span class="s2">              first_page_bid,quality_score,distribution,max_cpc,destination_url,ad_group_status,</span></div><div class='line' id='LC14'><span class="s2">              campaign_status,currency_code,impressions,clicks,ctr,cpc,</span></div><div class='line' id='LC15'><span class="s2">              cost,avg_position,account_id,campaign_id,adgroup_id </span></div><div class='line' id='LC16'><span class="s2">              from keywords where dated=&#39;</span><span class="si">#{</span><span class="n">date</span><span class="si">}</span><span class="s2">&#39; and country=&#39;</span><span class="si">#{</span><span class="n">country</span><span class="si">}</span><span class="s2">&#39;&quot;</span></div><div class='line' id='LC17'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">begin</span></div><div class='line' id='LC18'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;mapred.output.compression.codec&#39;</span><span class="p">,</span><span class="s1">&#39;org.apache.hadoop.io.compress.GzipCodec&#39;</span><span class="p">)</span></div><div class='line' id='LC19'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;hive.exec.compress.output&#39;</span><span class="p">,</span><span class="s1">&#39;true&#39;</span><span class="p">)</span></div><div class='line' id='LC20'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;mapred.output.compress&#39;</span><span class="p">,</span><span class="s1">&#39;true&#39;</span><span class="p">)</span></div><div class='line' id='LC21'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;mapred.compress.map.output&#39;</span><span class="p">,</span><span class="s1">&#39;true&#39;</span><span class="p">)</span></div><div class='line' id='LC22'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;hive.merge.mapredfiles&#39;</span><span class="p">,</span><span class="s1">&#39;true&#39;</span><span class="p">)</span></div><div class='line' id='LC23'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">set</span><span class="p">(</span><span class="s1">&#39;hive.merge.mapfiles&#39;</span><span class="p">,</span><span class="s1">&#39;true&#39;</span><span class="p">)</span></div><div class='line' id='LC24'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">con</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">query</span><span class="p">)</span></div><div class='line' id='LC25'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">rescue</span> <span class="o">=&gt;</span> <span class="n">e</span></div><div class='line' id='LC26'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="nb">puts</span> <span class="s2">&quot;#########################&quot;</span></div><div class='line' id='LC27'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="nb">puts</span> <span class="n">e</span><span class="o">.</span><span class="n">message</span></div><div class='line' id='LC28'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="nb">puts</span> <span class="s2">&quot;#########################&quot;</span></div><div class='line' id='LC29'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">end</span></div><div class='line' id='LC30'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">end</span></div><div class='line' id='LC31'>&nbsp;&nbsp;<span class="k">end</span> </div><div class='line' id='LC32'><span class="k">end</span> </div><div class='line' id='LC33'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1000863/6ddbf406275e09a95b30a1203f522e493e0ea9e8/compress_keywords.rb" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1000863#file_compress_keywords.rb" style="float:right;margin-right:10px;color:#666">compress_keywords.rb</a>
            <a href="https://gist.github.com/1000863">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>
<br />
This will loop through the partitions (date/country) and do an INSERT OVERWRITE from/to that partition using our <a href="https://github.com/forward/rbhive">rbhive</a> gem. This is good because Hive reads the old data via map/reduce jobs, writes the output to /tmp, deletes the old folder and then imports the new compressed version. You need to select the columns out as the target partition has 2 less fields (date and country are missing) As we had 2 levels of partitioning and lots of big files this ran within a day on a 2Tb table, saving us around 5Tb (replication factor is 3).</p>
<p>You can actually download and compress the data directly to HDFS as Hive does not know what data is inside the folders on HDFS, just their layout but I thought better to do it via hive and let Hadoop parallelise it. I would have carried on doing it this way but with other tables it was too slow (too many partitions, difficult to parallelise hive server). I stopped using rbhive, dropped to using hive -e to execute the querys and used the lovely autopartitioning in later hive versions. Notice you can SELECT * now and it automatically does what it needs to to insert results into the correct partitions. </p>
<p><div id="gist-1002077" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="nb">require</span> <span class="s1">&#39;rubygems&#39;</span></div><div class='line' id='LC2'><span class="nb">require</span> <span class="s1">&#39;date&#39;</span></div><div class='line' id='LC3'><br/></div><div class='line' id='LC4'><span class="n">countrys</span> <span class="o">=</span> <span class="sx">%w[at au br de dk es fr in int it jp kr mx nl no pl pt ru se uk us za]</span></div><div class='line' id='LC5'><br/></div><div class='line' id='LC6'><span class="n">dates</span> <span class="o">=</span> <span class="p">(</span><span class="no">Date</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s1">&#39;2010-12-02&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">.</span><span class="no">Date</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="s1">&#39;2011-05-01&#39;</span><span class="p">))</span></div><div class='line' id='LC7'><br/></div><div class='line' id='LC8'><span class="n">dates</span><span class="o">.</span><span class="n">each</span> <span class="k">do</span> <span class="o">|</span><span class="n">date</span><span class="o">|</span></div><div class='line' id='LC9'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">=</span> <span class="s2">&quot;&quot;</span></div><div class='line' id='LC10'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;SET hive.exec.compress.output=true;&quot;</span></div><div class='line' id='LC11'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;&quot;</span></div><div class='line' id='LC12'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set mapred.job.priority=VERY_LOW;&quot;</span> </div><div class='line' id='LC13'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set hive.exec.dynamic.partition=true;&quot;</span></div><div class='line' id='LC14'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set mapred.output.compress=true;&quot;</span></div><div class='line' id='LC15'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set mapred.compress.map.output=true;&quot;</span></div><div class='line' id='LC16'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set hive.merge.mapredfiles=true;&quot;</span></div><div class='line' id='LC17'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;set hive.merge.mapfiles=true;&quot;</span></div><div class='line' id='LC18'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">+=</span> <span class="s2">&quot;insert overwrite table hourly_clicks </span></div><div class='line' id='LC19'><span class="s2">            partition (dated=&#39;</span><span class="si">#{</span><span class="n">date</span><span class="si">}</span><span class="s2">&#39;, country, hour) </span></div><div class='line' id='LC20'><span class="s2">            select * from hourly_clicks where dated=&#39;</span><span class="si">#{</span><span class="n">date</span><span class="si">}</span><span class="s2">&#39;&quot;</span></div><div class='line' id='LC21'>&nbsp;&nbsp;<span class="n">query</span> <span class="o">=</span> <span class="s2">&quot;hive -e </span><span class="se">\&quot;</span><span class="si">#{</span><span class="n">query</span><span class="si">}</span><span class="se">\&quot;</span><span class="s2">&quot;</span></div><div class='line' id='LC22'>&nbsp;&nbsp;<span class="nb">puts</span> <span class="s2">&quot;running </span><span class="si">#{</span><span class="n">query</span><span class="si">}</span><span class="s2">&quot;</span></div><div class='line' id='LC23'>&nbsp;&nbsp;<span class="sb">`</span><span class="si">#{</span><span class="n">query</span><span class="si">}</span><span class="sb">`</span></div><div class='line' id='LC24'><span class="k">end</span></div><div class='line' id='LC25'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/1002077/2a819261336a38866a1a9f505b9adef42ebd64c6/compress_hive_cli.rb" style="float:right;">view raw</a>
            <a href="https://gist.github.com/1002077#file_compress_hive_cli.rb" style="float:right;margin-right:10px;color:#666">compress_hive_cli.rb</a>
            <a href="https://gist.github.com/1002077">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>
<br />
The key difference is partition (dated=&#8217;#{date}&#8217;, country, hour) , we have not specified a country or hour partition so hive will do it automatically. This ran loads faster than looping over the partitions, letting hive schedule lots more mapreduce jobs at once. If you set hive.exec.dynamic.partition.mode=nonstrict as well you can not specify any partition information (I did this as a test but kept the WHERE clause, I was scared to do it all at once!)</p>
<p>The reason I am not (very) worried about losing parallelism is that some of our partition contained big .csv&#8217;s and the output of INSERT OVERWRITE was multiple .gz files (looked to me like as many as there were mappers, for example a 700M text file became ~10 .gz files) so they will still be read in parallel by mappers as the original CSV was.</p>
<p>Open to suggestions about better ways to achieve this, this does not preclude doing something better later.</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2011/06/01/compressing-text-tables-in-hive/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2011/06/01/compressing-text-tables-in-hive/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Finding information on Hive tables from HDFS</title>
		<link>http://www.thattommyhall.com/2011/05/16/hive-size-hdfs/</link>
		<comments>http://www.thattommyhall.com/2011/05/16/hive-size-hdfs/#comments</comments>
		<pubDate>Mon, 16 May 2011 16:42:07 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hive]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=686</guid>
		<description><![CDATA[I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it. Lots of our files were small so I am going to experiment with different partitioning and compression schemes. Share on Facebook]]></description>
			<content:encoded><![CDATA[<p>I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.</p>
<p><div id="gist-974792" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="n">current</span> <span class="o">=</span> <span class="s1">&#39;&#39;</span></div><div class='line' id='LC2'><span class="n">file_count</span> <span class="o">=</span> <span class="mi">0</span></div><div class='line' id='LC3'><span class="n">total_size</span> <span class="o">=</span> <span class="mi">0</span></div><div class='line' id='LC4'><br/></div><div class='line' id='LC5'><span class="n">output</span> <span class="o">=</span> <span class="no">File</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="s1">&#39;output.csv&#39;</span><span class="p">,</span><span class="s1">&#39;w&#39;</span><span class="p">)</span></div><div class='line' id='LC6'><br/></div><div class='line' id='LC7'><span class="no">IO</span><span class="o">.</span><span class="n">popen</span><span class="p">(</span><span class="s1">&#39;hadoop fs -lsr /user/hive/warehouse&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">each_line</span> <span class="k">do</span> <span class="o">|</span><span class="n">line</span><span class="o">|</span></div><div class='line' id='LC8'>&nbsp;&nbsp;<span class="nb">split</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="sr">/\s+/</span><span class="p">)</span></div><div class='line' id='LC9'>&nbsp;&nbsp;<span class="c1">#permissions,replication,user,group,size,mod_date,mod_time,path</span></div><div class='line' id='LC10'>&nbsp;&nbsp;<span class="k">next</span> <span class="k">unless</span> <span class="nb">split</span><span class="o">.</span><span class="n">size</span> <span class="o">==</span> <span class="mi">8</span></div><div class='line' id='LC11'>&nbsp;&nbsp;<span class="n">path</span> <span class="o">=</span> <span class="nb">split</span><span class="o">[</span><span class="mi">7</span><span class="o">]</span></div><div class='line' id='LC12'>&nbsp;&nbsp;<span class="n">size</span> <span class="o">=</span> <span class="nb">split</span><span class="o">[</span><span class="mi">4</span><span class="o">]</span></div><div class='line' id='LC13'>&nbsp;&nbsp;<span class="n">permissions</span> <span class="o">=</span> <span class="nb">split</span><span class="o">[</span><span class="mi">0</span><span class="o">]</span></div><div class='line' id='LC14'>&nbsp;&nbsp;<span class="n">tablename</span><span class="o">=</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">&#39;/&#39;</span><span class="p">)</span><span class="o">[</span><span class="mi">4</span><span class="o">]</span></div><div class='line' id='LC15'>&nbsp;&nbsp;<span class="k">if</span> <span class="n">tablename</span> <span class="o">!=</span> <span class="n">current</span></div><div class='line' id='LC16'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">average_size</span> <span class="o">=</span> <span class="n">file_count</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">?</span> <span class="mi">0</span> <span class="p">:</span> <span class="n">total_size</span><span class="o">/</span><span class="n">file_count</span></div><div class='line' id='LC17'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">result</span> <span class="o">=</span> <span class="s2">&quot;</span><span class="si">#{</span><span class="n">current</span><span class="si">}</span><span class="s2">,</span><span class="si">#{</span><span class="n">file_count</span><span class="si">}</span><span class="s2">,</span><span class="si">#{</span><span class="n">total_size</span><span class="si">}</span><span class="s2">,</span><span class="si">#{</span><span class="n">average_size</span><span class="si">}</span><span class="s2">&quot;</span></div><div class='line' id='LC18'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">unless</span> <span class="n">current</span><span class="o">==</span><span class="s1">&#39;&#39;</span></div><div class='line' id='LC19'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="nb">puts</span> <span class="n">result</span></div><div class='line' id='LC20'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">output</span><span class="o">.</span><span class="n">puts</span> <span class="n">result</span></div><div class='line' id='LC21'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">end</span></div><div class='line' id='LC22'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">total_size</span> <span class="o">=</span> <span class="mi">0</span></div><div class='line' id='LC23'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">current</span> <span class="o">=</span> <span class="n">tablename</span></div><div class='line' id='LC24'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">file_count</span> <span class="o">=</span> <span class="mi">0</span></div><div class='line' id='LC25'>&nbsp;&nbsp;<span class="k">end</span></div><div class='line' id='LC26'>&nbsp;&nbsp;<span class="n">file_count</span> <span class="o">+=</span> <span class="mi">1</span> <span class="k">unless</span> <span class="n">permissions</span><span class="o">[</span><span class="mi">0</span><span class="o">]</span> <span class="o">==</span> <span class="s1">&#39;d&#39;</span></div><div class='line' id='LC27'>&nbsp;&nbsp;<span class="n">total_size</span> <span class="o">+=</span> <span class="n">size</span><span class="o">.</span><span class="n">to_i</span></div><div class='line' id='LC28'><span class="k">end</span></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/974792/f07207bc776f669d93e9d669f7fee539057d5613/hive_info.rb" style="float:right;">view raw</a>
            <a href="https://gist.github.com/974792#file_hive_info.rb" style="float:right;margin-right:10px;color:#666">hive_info.rb</a>
            <a href="https://gist.github.com/974792">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>
</p>
<p>Lots of our files were small so I am going to experiment with different partitioning and compression schemes.</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2011/05/16/hive-size-hdfs/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2011/05/16/hive-size-hdfs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Signals In Ruby / &#8220;rescue Exception&#8221; considered harmful</title>
		<link>http://www.thattommyhall.com/2011/02/24/rescue-exception-harmful-signals-in-ruby/</link>
		<comments>http://www.thattommyhall.com/2011/02/24/rescue-exception-harmful-signals-in-ruby/#comments</comments>
		<pubDate>Thu, 24 Feb 2011 18:35:43 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=611</guid>
		<description><![CDATA[Yesterday we had an issue with the different behaviour of &#8220;kill &#8221; and &#8220;kill -9 &#8221; and in the process I had to refresh my knowledge of Unix signals, learn how you handle them in Ruby and properly learn Rubys exception hierarchy. To -9 or not to -9? The unix kill command is perhaps strangely [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday we had an issue with the different behaviour of &#8220;kill <OUR_APP>&#8221; and &#8220;kill -9 <OUR_APP>&#8221; and in the process I had to refresh my knowledge of Unix signals, learn how you handle them in Ruby and properly learn Rubys exception hierarchy.</p>
<p><strong>To -9 or not to -9?</strong><br />
The unix kill command is perhaps strangely named as it actually sends signals to processes (see &#8220;man signal&#8221; for a full list). It defaults to sending SIGTERM to the process and the application writer can decide how to treat it by &#8220;trapping&#8221; it, allowing for a safe shutdown or debug dumps etc. &#8220;kill -9&#8243; sends SIGKILL and the signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored by your programs.<br />
I think in the first instance you should just use &#8220;kill&#8221;, give the app the chance to do the right thing then get -9 on its ass if you need to.</p>
<p><strong>Catching signals in Ruby</strong></p>
<pre class="brush: ruby; title: ; notranslate">puts &quot;I have PID #{Process.pid}&quot;

Signal.trap(&quot;USR1&quot;) {puts &quot;prodded me&quot;}

loop do
  sleep 5
  puts &quot;doing stuff&quot;
end</pre>
<p>Is about the simplest code that will trap the &#8220;USR1&#8243; signal (which you can send with &#8220;kill -USR1 <APPNAME>&#8220;). The USR1 and USR2 signals are left free for you to use however you wish in your applications.</p>
<p>If you look at the image below you can see that it responds to the USR1 signal I send it and kill (ie sending SIGTERM) works also.<br />
<a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/1-simple-small.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/1-simple-small.png" alt="" title="1-simple-small" width="661" height="168" class="alignleft size-full wp-image-618" /></a></p>
<p>The following two code snippets are the same except one takes the default and the other catches Exception (ie <strong>any</strong> exception)</p>
<pre class="brush: ruby; title: ; notranslate">#sig-rescue.rb
puts &quot;I have PID #{Process.pid}&quot;

Signal.trap(&quot;USR1&quot;) {puts &quot;prodded me&quot;}

loop do
  begin
  puts &quot;doing stuff&quot;
  sleep 10
  rescue =&gt; e
    puts e.inspect
  end
end</pre>
<p><a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/2-rescue-small.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/2-rescue-small.png" alt="" title="2-rescue-small" width="662" height="140" class="alignleft size-full wp-image-619" /></a><br />
So that still works as before and errors in our &#8220;do stuff&#8221; loop would get caught.</p>
<pre class="brush: ruby; title: ; notranslate">#sig-rescue-E.rb
puts &quot;I have PID #{Process.pid}&quot;

Signal.trap(&quot;USR1&quot;) {puts &quot;prodded me&quot;}

loop do
  begin
  puts &quot;doing stuff&quot;
  sleep 10
  rescue Exception =&gt; e
    puts e.inspect
  end
end</pre>
<p><a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/3-rescue-E-small.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/3-rescue-E-small.png" alt="" title="3-rescue-E-small" width="664" height="212" class="alignleft size-full wp-image-620" /></a><br />
This fails though. You can see that SIGTERM no longer works and CTRL-C from the terminal does not work also. This is because we are catching the SignalException when we do &#8220;rescue Exception&#8221;. Kill -9 worked though, as it will kill any application as the signal cannot be caught.</p>
<p><strong>Rubys Exception Heirachy</strong><br />
The full exception heirachy (from the excellent <a href="http://blog.nicksieger.com/articles/2006/09/06/rubys-exception-hierarchy">cheat gem</a>) is </p>
<pre class="brush: plain; title: ; notranslate">Tom-Halls-MacBook-Pro:signal tomh$ cheat exceptions
exceptions:
  Exception
   NoMemoryError
   ScriptError
     LoadError
     NotImplementedError
     SyntaxError
   SignalException
     Interrupt
       Timeout::Error    # require 'timeout' for Timeout::Error
   StandardError         # caught by rescue if no type is specified
     ArgumentError
     IOError
       EOFError
     IndexError
     LocalJumpError
     NameError
       NoMethodError
     RangeError
       FloatDomainError
     RegexpError
     RuntimeError
     SecurityError
     SocketError
     SystemCallError
     SystemStackError
     ThreadError
     TypeError
     ZeroDivisionError
   SystemExit
   fatal
</pre>
<p>I think you should only catch StandardError or its children, possibly some of its siblings and avoid catching Exception as you probably dont want to change how the process deals with signals (you could trap them if you need to)</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2011/02/24/rescue-exception-harmful-signals-in-ruby/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2011/02/24/rescue-exception-harmful-signals-in-ruby/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ruby On Windows &#8211; Forking other processes</title>
		<link>http://www.thattommyhall.com/2011/02/20/ruby-on-windows-running-other-executables/</link>
		<comments>http://www.thattommyhall.com/2011/02/20/ruby-on-windows-running-other-executables/#comments</comments>
		<pubDate>Sun, 20 Feb 2011 23:08:11 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[VMware]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=540</guid>
		<description><![CDATA[While moving our VM deployment site written in Sinatra to a Windows machine with the VMware PowerCLI toolkit installed the only snag was where we forked a process to do the preparation of the machines. Both Kernel.fork and Process.detach seemed to have issues. Original MRI on Linux IronRuby We tried IronRuby and the same bit [...]]]></description>
			<content:encoded><![CDATA[<p>While moving our VM deployment site written in Sinatra to a Windows machine with the VMware PowerCLI toolkit installed the only snag was where we forked a process to do the preparation of the machines. Both Kernel.fork and Process.detach seemed to have issues.</p>
<p><strong>Original MRI on Linux<br />
</strong></p>
<pre class="brush: ruby; title: ; notranslate">
  def build
    pid = fork { run_command }
    Process.detach(pid)
  end

  def run_command
    `sudo /opt/script/deployserver/setupnewserver.sh -p #{poolserver} -i #{ip} -s #{@size} -v #{@vlan} -a &quot;#{@owner}&quot; -n #{@name} -e &quot;#{@email}&quot;`
  end
</pre>
<p><strong>IronRuby</strong><br />
We tried IronRuby and the same bit of the script broke as on win32 MRI (though I was pleased and surprised that Sinatra worked)</p>
<pre class="brush: ruby; title: ; notranslate">
  def build
    WindowsProcess.start &quot;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe&quot;,
&quot;-PSConsoleFile \&quot;C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\&quot; \&quot;&amp; C:\\script\\DataStoreUsage.ps1\&quot;&quot;
  end
</pre>
<p>Using the following DotNet code</p>
<pre class="brush: ruby; title: ; notranslate">
class WindowsProcess
  def self.start(file, arguments)
    process = System::Diagnostics::Process.new
    process.StartInfo.FileName = file
    process.StartInfo.CreateNoWindow = true
    process.StartInfo.Arguments = arguments
    process.Start
  end
end
</pre>
<p><strong>Workaround using Windows &#8220;start&#8221; command</strong><br />
I had hoped the module at <a href="http://win32utils.rubyforge.org/">win32utils</a> would let me just use the original script but fork did not work properly still.</p>
<pre class="brush: ruby; title: ; notranslate">
def build
  commandstr = &quot;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -PSConsoleFile \&quot;C:\\Program Files (x86)\\VMware\\Infrastructure\\vSphere PowerCLI\\vim.psc1\&quot; \&quot;&amp; C:\\Sites\\vmdeploy\\PrepNewMachine.ps1 -type #{@type} -machinename #{@name} -size #{@size} -vlan #{@vlan} -creator #{@owner} -creatoremail #{@email} -ipaddress #{ip}&quot;

  system (&quot;start #{commandstr} &gt; ./log/#{@name}.log 2&gt;&amp;1&quot;)
end
</pre>
<p>This uses the windows &#8220;start&#8221; command and works pretty well.</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2011/02/20/ruby-on-windows-running-other-executables/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2011/02/20/ruby-on-windows-running-other-executables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Running Any Executable As A Windows Service (Ruby / Sinatra)</title>
		<link>http://www.thattommyhall.com/2011/02/14/srvany-sinatra-ruby-windows-service/</link>
		<comments>http://www.thattommyhall.com/2011/02/14/srvany-sinatra-ruby-windows-service/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 13:06:52 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[sinatra]]></category>
		<category><![CDATA[VMware]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=541</guid>
		<description><![CDATA[While migrating an automated VM deployment page using a combination of Sinatra on Linux and Bash scripts using the Perl toolkit with a simpler script using the VMWare PowerCLI that I love so much I needed to create a windows service from the Sinatra App and had to do some googleing so I thought I [...]]]></description>
			<content:encoded><![CDATA[<p>While migrating an automated VM deployment page using a combination of <a href="http://www.sinatrarb.com/">Sinatra</a> on Linux and Bash scripts using the Perl toolkit with a simpler script using the VMWare PowerCLI that I <a href="http://www.thattommyhall.com/index.php?s=powercli&#038;submit=Search">love so much</a> I needed to create a windows service from the Sinatra App and had to do some googleing so I thought I would share how I did it.</p>
<p>You only need two things &#8211; the built-in &#8220;sc&#8221; command and an executable from <a href="https://www.microsoft.com/downloads/en/details.aspx?FamilyID=9d467a69-57ff-4ae7-96ee-b18c4790cffd&#038;displaylang=en">Windows Server 2003 Resource Kit Tools</a> called srvany (works with 2008 too). Get just that exe <a href="http://dl.dropbox.com/u/2039069/srvany.exe">here</a> (if you trust me of course <img src='http://www.thattommyhall.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' />  )</p>
<p><strong>Creating the service</strong><br />
<a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/1-CreateService.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/1-CreateService.png" alt="" title="1-CreateService" width="669" height="78" class="alignleft size-full wp-image-550" /></a><br />
<strong>Check it exists</strong><br />
<a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/2-service.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/2-service.png" alt="" title="2-service" width="537" height="25" class="alignleft size-full wp-image-552" /></a><br />
<strong>Set Parameters In The Registry</strong><br />
Configure it at HKLM/SYSTEM/CurrentControlSet/Services/APPNAME/Parameters<br />
<a href="http://www.thattommyhall.com/wp-content/uploads/2011/02/Screen-shot-2011-02-14-at-12.54.15.png"><img src="http://www.thattommyhall.com/wp-content/uploads/2011/02/Screen-shot-2011-02-14-at-12.54.15.png" alt="" title="Screen shot 2011-02-14 at 12.54.15" width="725" height="165" class="alignleft size-full wp-image-561" /></a></p>
<pre class="brush: plain; title: ; notranslate">Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\VMdeploy\Parameters]
&quot;Application&quot;=&quot;C:\\Ruby192\\bin\\ruby&quot;
&quot;AppParameters&quot;=&quot;C:\\Sites\\vmdeploy\\server.rb -p 80&quot;
&quot;AppDirectory&quot;=&quot;C:\\Sites\\vmdeploy&quot;
&quot;AppEnvironment&quot;=hex(7):65,00,78,00,61,00,6d,00,70,00,6c,00,65,00,3d,00,32,00,\
  37,00,00,00,62,00,6c,00,61,00,68,00,3d,00,63,00,3a,00,5c,00,74,00,65,00,6d,\
  00,70,00,66,00,69,00,6c,00,65,00,73,00,00,00,00,00</pre>
<p>Note the AppEnvironment is a multiline string, the rest are strings</p>
<p>This lets you run any executable file, change the directory you run it from and pass any arguments or environment variables so should cover most use cases.</p>
<p>I will be sharing the code for both the Sinatra app and the PowerShell deploy script in later posts.</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2011/02/14/srvany-sinatra-ruby-windows-service/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2011/02/14/srvany-sinatra-ruby-windows-service/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Learning Ruby: methods vs procs (or Ruby vs Python?)</title>
		<link>http://www.thattommyhall.com/2010/10/04/learning-ruby-methods-vs-procs-or-ruby-vs-python/</link>
		<comments>http://www.thattommyhall.com/2010/10/04/learning-ruby-methods-vs-procs-or-ruby-vs-python/#comments</comments>
		<pubDate>Mon, 04 Oct 2010 13:03:46 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/?p=292</guid>
		<description><![CDATA[I have been meaning to learn ruby for a while and the place I am working now uses a lot so I had another look at it. I read Learn To Program, a simple but good book and found the bit on blocks and procs etc pretty good and wanted to see if I could [...]]]></description>
			<content:encoded><![CDATA[<p>I have been meaning to learn ruby for a while and the place I am working now uses a lot so I had another look at it. I read Learn To Program, a simple but good book and found the bit on blocks and procs etc pretty good and wanted to see if I could do the stuff in Python as well. Python has anonymous &#8220;lambda&#8221; functions but they are limited to one line a subset of the syntax which is a bit annoying sometimes. My worry with methods in Ruby is that they are not first class, I think because you can omit parenthesis and so you have no way of referring to them without invoking them. </p>
<p>I remembered this while reading the SICP book, the question was about the difference between this program in applicative and normal order evaluation</p>
<pre class="brush: plain; title: ; notranslate">(define (p) (p))

(define (test x y)
  (if (= x 0)
      0
      y))</pre>
<p>It rang a bell as <strong>(define (p) p)</strong> does not go into an infinite loop if you invoke p. In lisp <strong>(p)</strong> calls the procedure p with no arguments whereas <strong>p</strong> is just a reference to the function. In python <strong>someinstance.method</strong> refers to the method, <strong>someinstance.method()</strong> calls it, Ruby seems to need Proc objects to get around this (IMHO as a beginner!, see the end for John Leach&#8217;s lovely response via email at the time)</p>
<p>I redid all the examples from the book in Python</p>
<p><strong>Eg 1</strong><br />
Ruby</p>
<pre class="brush: ruby; title: ; notranslate">
def maybe_do some_proc
  if rand(2) == 0
    some_proc.call
  end
end

def twice_do some_proc
  some_proc.call
  some_proc.call
end

wink = Proc.new do
  puts '&lt;wink&gt;'
end

glance = Proc.new do
  puts '&lt;glance&gt;'
end
</pre>
<p>Python</p>
<pre class="brush: python; title: ; notranslate">
import random

def maybe_do(some_proc):
    if random.choice(range(2)) == 0:
        some_proc()

def twice_do(some_proc):
    some_proc()
    some_proc()

def wink():
    print 'wink'

def glance():
    print 'glance'

for i in range(5):
    print 'running for i=',i
    maybe_do(wink)
</pre>
<p><strong>Eg2</strong><br />
Ruby</p>
<pre class="brush: ruby; title: ; notranslate">
def do_until_false first_input, some_proc
  input = first_input
  output = first_input
  while output
    input = output
    output = some_proc.call input
  end
  input
end

build_array_of_squares = Proc.new do |array|
  last_number = array.last
  if last_number &lt;= 0
    false
  else
    # Take off the last number...
    array.pop
    # ...and replace it with its square...
    array.push last_number*last_number
    # ...followed by the next smaller number.
    array.push last_number-1
  end
end

always_false = Proc.new do |just_ignore_me|
  false
end

puts do_until_false([5], build_array_of_squares).inspect

yum = 'lemonade with a hint of orange blossom water'
puts do_until_false(yum, always_false)
</pre>
<p>Python</p>
<pre class="brush: python; title: ; notranslate">
def do_untill_false(first_input, some_proc):
    input = first_input
    output = first_input
    while output:
        input = output
        output = some_proc(input)
    return input

def build_array_of_squares(array):
    last_number = array.pop()
    if last_number &lt;= 0:
        return False
    else:
        array.append(last_number * last_number)
        array.append(last_number - 1)
        return array

def always_false(just_ignore_me):
    return False

def just_ignore_me():
    pass

print do_untill_false([5], build_array_of_squares)
yum = 'lemonade with a hint of orange blossom water'
print do_untill_false(yum, always_false)
</pre>
<p><strong>Eg3</strong><br />
Ruby</p>
<pre class="brush: ruby; title: ; notranslate">
def compose proc1, proc2
  Proc.new do |x|
    proc2.call(proc1.call(x))
  end
end

square_it = Proc.new do |x|
  x*x
end

double_it = Proc.new do |x|
  x+x
end

double_then_square = compose double_it, square_it 

square_then_double = compose square_it, double_it

puts double_then_square.call(5) puts square_then_double.call(5)
</pre>
<p>Python</p>
<pre class="brush: python; title: ; notranslate">
def compose(proc1,proc2):
    def composed(x):
        return proc2(proc1(x))
    return composed

def square_it(x):
    return x**2

def double_it(x):
    return x*2

double_then_square = compose(double_it,square_it)
square_then_double = compose(square_it,double_it)

print double_then_square(5)
print square_then_double(5)
</pre>
<p><strong>Eg4</strong></p>
<pre class="brush: ruby; title: ; notranslate">
class Array
  def each_even(&amp;was_a_block__now_a_proc)
    # We start with &quot;true&quot; because
    # arrays start with 0, which is even.
    is_even = true
    self.each do |object|
      if is_even
        was_a_block__now_a_proc.call object
      end
      # Toggle from even to odd, or odd to even.
      is_even = !is_even
    end
  end
end

fruits = ['apple', 'bad apple', 'cherry', 'durian']
fruits.each_even do |fruit|
  puts &quot;Yum! I just love #{fruit} pies, don't you?&quot;
end

[1, 2, 3, 4, 5].each_even do |odd_ball|
  puts &quot;#{odd_ball} is NOT an even number!&quot;
end
</pre>
<p>Python</p>
<pre class="brush: python; title: ; notranslate">
class MyArray(list):
    def each_even(self):
        for i in range(len(self)):
            if i % 2 == 0:
                yield self[i]

fruits = MyArray(['apple', 'bad apple', 'cherry', 'durian'])

for fruit in fruits.each_even():
    print 'yum! I love %s pies, dont you?' % fruit

for odd_ball in MyArray([1,2,3,4,5]).each_even():
    print '%s is NOT an even number' % odd_ball
</pre>
<p><strong>Eg5</strong><br />
Ruby</p>
<pre class="brush: ruby; title: ; notranslate">
def profile block_description, &amp;block
  start_time = Time.new
  block.call
  duration = Time.new - start_time
  puts &quot;#{block_description}: #{duration} seconds&quot;
end

profile '25000 doublings' do
  number = 1
  25000.times do
    number = number + number
  end

  puts &quot;#{number.to_s.length} digits&quot;
  # That's the number of digits in this HUGE number.
end

profile 'count to a million' do
  number = 0 1000000.times do
    number = number + 1
  end
end
</pre>
<p>Python</p>
<pre class="brush: python; title: ; notranslate">
def profile(description, function):
    import time
    start_time = time.time()
    function()
    duration = time.time() - start_time
    print '%s: %s seconds' % (description, duration)
    print function.__name__
    print 'see, &quot;function.__name__&quot; can be used in place of description in python'

def count_to_a_million():
    number = 0
    for i in range(1000000):
        number = number+1

profile('count to a million', count_to_a_million)

def profiled(function):
    def new_function(*args, **kwargs):
        import time
        start_time = time.time()
        result = function(*args, **kwargs)
        print function.__name__, 'took', time.time() - start_time, 'secs'
        return result
    return new_function

@profiled
def count_to_a_million_again():
    number = 0
    for i in range(1000000):
        number = number + 1

count_to_a_million_again()
</pre>
<p>This uses <a href="http://en.wikipedia.org/wiki/Python_syntax_and_semantics#Decorators">decorators</a>, a nice Python feature that uses higher order functions (and the fact functions are first class in python).</p>
<p><strong>In Conclusion</strong><br />
IMHO, at this point in my experience of Ruby, with all the disclaimers about my non expert status etc.<br />
Like: </p>
<ul>
<li>No restriction on complexity of anonymous functions</li>
</ul>
<p>Dont Like: </p>
<ul>
<li>Methods being different from Procs/Blocs, non-uniform syntax</li>
<li>Leaving out parenthesis (though I await DSL goodness later!)</li>
<li>&#8220;end&#8221; everywhere (I know the indentation thing in python is contentious!)</li>
</ul>
<p><strong>John Leach&#8217;s thought provoking tuppence</strong></p>
<blockquote><p>Young padawan, you look but you do not see, you will learn</p></blockquote>
<p>or rather</p>
<blockquote><p>Yeah, but blocks are closures Tom</p></blockquote>
<p>Tom goes to google and comes back with <a href="http://www.artima.com/intv/closures2.html">http://www.artima.com/intv/closures2.html</a><br />
Matz</p>
<blockquote><p>I think it&#8217;s not that useful in the daily lives of programmers. It doesn&#8217;t matter that much.</p></blockquote>
<p><strong>Then john came back with </strong></p>
<p><code>I can think of one example in Rails right away where it's useful, transactions:</code></p>
<pre class="brush: ruby; title: ; notranslate">
r = Record.new params[:record]

Record.transaction do
 r.save
 RecordLog.create(:text =&gt; &quot;created a new record&quot;)
end
</pre>
<p><code><br />
that code takes some input from a browser (in params), instantiates a<br />
new Record object, then writes it and a RecordLog entry to the database<br />
atomically.<br />
All the Record.transaction does is sends a BEGIN to the db server,<br />
executes the block, and sends a COMMIT (or a ROLLBACK if the block<br />
errors for any reason).<br />
The block needs access to the r object. We could have created that<br />
inside the block, but then it'd need access to the params object.  So<br />
without real closure support, Record.transaction would have had to<br />
support passing in arbitrary variables.<br />
Remember, that interview with Matz was in 2003 - more people are using<br />
Ruby for more things nowadays, for uses beyond the imagination of it's<br />
creator I'm sure <img src='http://www.thattommyhall.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </code></p>
<p><strong>Final Thoughts</strong><br />
I am waiting to be blown away by Ruby and Rails</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2010/10/04/learning-ruby-methods-vs-procs-or-ruby-vs-python/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2010/10/04/learning-ruby-methods-vs-procs-or-ruby-vs-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Python talk for WYLUG, Ruby envy, Haskell Joy.</title>
		<link>http://www.thattommyhall.com/2007/12/27/python-talk-for-wylug-ruby-envy-haskell-joy/</link>
		<comments>http://www.thattommyhall.com/2007/12/27/python-talk-for-wylug-ruby-envy-haskell-joy/#comments</comments>
		<pubDate>Thu, 27 Dec 2007 12:05:38 +0000</pubDate>
		<dc:creator>tom</dc:creator>
				<category><![CDATA[haskell]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://www.thattommyhall.com/2007/12/27/python-talk-for-wylug-ruby-envy-haskell-joy/</guid>
		<description><![CDATA[I am just getting a talk ready for WYLUG on python. I sent Dave the following blurb: Why I love Python: A talk on the programming language Python, in 3 parts (feel free to leave in the interludes if you have had enough) Part 1: Past, Present, Future. A bit of history and the design [...]]]></description>
			<content:encoded><![CDATA[<p>I am just getting a talk ready for WYLUG on python.</p>
<p>I sent Dave the following blurb:</p>
<blockquote><p> Why I love Python:</p>
<p>A talk on the programming language Python, in 3 parts (feel free to<br />
leave in the interludes if you have had enough)</p>
<p>Part 1: Past, Present, Future.<br />
A bit of history and the design of the language, a look at all the<br />
implementations available today, quick tour of built-in and commonly<br />
used modules and future plans.</p>
<p>Part 2: Language overview<br />
A quick tour of the language: builtin types, control structures, using<br />
modules etc</p>
<p>Part 3: Recent Magic.<br />
Some relatively recent changes that make programming Python even more<br />
pleasurable.<br />
Decorators, Generators, List comprehensions, Iterators, Functools and<br />
anything else I can fit in.<br />
Again a whirlwind tour, but you should be impressed and want to read<br />
up on these some more</p></blockquote>
<p>I have been revisiting some of the Python talks I have watched over the last few years for ideas and will update my ComSci page with links.</p>
<p>I stumbled across some excellent video from RubyConf, particularly the <a href="http://rubyconf2007.confreaks.com/d2t1p3_rubinius.html" target="_blank">Rubinius</a> one. Rubinius is a ruby VM partially written in Ruby, taking some lessons from Python and Smalltalk. Some of the stuff he bigs up (compiling to bytecode automatically comes to mind) Python has had for ages, but the self hosting aspect is cool (not as cool as PyPy though). Rubinius seems to be doing what Avi Bryant suggested <a href="http://itc.conversationsnetwork.org/shows/detail3432.html" target="_blank">here,</a> learn from the Smalltalk guys and the <a href="http://research.sun.com/self/papers/papers.html" target="_blank">papers</a> from the Self team that Sun spun off and later bought back to do the hotspot VM for Java. Interesting times for dynamic languages, target the JVM, CLR, self host and generate code in other languages while always writing in the same fun language. I say Ruby envy only because I think the Ruby community does a better job of looking cool and exciting people than the Python one.</p>
<p>Now Haskell joy. After describing working through Yet Another Haskell Tutorial to the 2 friends doing it with me as &#8220;not an obviously pleasurable experience&#8221; I had a great moment on the train the other day looking at partial application.<br />
<code>(\y -&gt; y*3)</code><br />
is Haskell for the anonymous function  that takes y and multiplies it by 3 (I wish I had LaTeX here to draw the lamda calculus). What I like is that you can also write that as<br />
<code>(*3)</code><br />
While this example is trivial, what is happening is interesting. The compiler knows * is an infix operator that takes 2 arguments and that is has been supplied one and &#8220;partially applies&#8221; the function, making (*3) (a function that takes one argument). One more thing is changing prefix and infix operators around using ( _ ) and ` _ ` , for example:<br />
<code>3 * 5<br />
(*) 3 5<br />
</code><br />
<code>map (*2) [1,2,3]<br />
(*2) `map` [1,2,3]<br />
</code><br />
I hope this second example is clear, map usually is a prefix function that takes a function and a list and returns a list with the result of applying the function to each element (the return value here would be [2,4,6]). This flexibility is neat and is starting to make Haskell a joy to hack in.</p>
<p>Merry Christmas,</p>
<p class="facebook"><a href="http://www.facebook.com/share.php?u=http://www.thattommyhall.com/2007/12/27/python-talk-for-wylug-ruby-envy-haskell-joy/" target="_blank" title="Share on Facebook">Share on Facebook</a></p>]]></content:encoded>
			<wfw:commentRss>http://www.thattommyhall.com/2007/12/27/python-talk-for-wylug-ruby-envy-haskell-joy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

