Berlin Buzzwords

I have just returned from Berlin Buzzwords. It was a great conference and well organised so thanks to the organisers.

As all the talks will be online soon I will just mention a few things that I enjoyed.

The two keynotes were excellent, Doug Cutting on the history of Hadoop and Ted Dunning on the future. Both were very interesting and had a great feel for the community aspect of Open Source software. Ted works for MapR technologies but the talk was not a sales pitch. Ted spoke about how Hadoop fails currently to get the most out of the components and what we might get if we could. MapR are used by EMC for their new Hadoop distro, among other things I think they have reimplemented HDFS. An interesting number of companies had got some pretty big amounts of funding to build front-ends to Hadoop, DataMeer have an excel-like web frontend that looks interesting.

Talks I enjoyed were:

NODE.JS FOR HEAVY I/O A superb intro to Node.js, with an example small enough to fit on a slide but not completely trivial.

TIME SERIES OR CAUSAL ANALYSIS WITHOUT LIMITS! Shivek was awesome, engaging and enthusiastic. The topic itself was fascinating, using Pi Calculus to reason about and design map/reduce algorithms. He made the point that most Hadoop jobs are datacentric but showed how to do some more mathscentric algorithms like FFTs

OH LEONHARD, WHERE ART THOU? Jim Webber on graph databases in general and Neo4J in particular. Quite a nice reference to Euler in the title. If your data is a graph, why not have a database that is too?

REALTIME BIG DATA AT FACEBOOK WITH HADOOP AND HBASE From Jonathan Gray, this talk was really interesting - amazing the throughput they are getting from HBase. I think Forward are more like Facebook than Google (more freedom within teams, choice of tech/roll your own vs Google wanting everything on BigTable. I cringed a bit at the thought of loads of servers running random C++ apps all over the place though…)

NEWER DEVELOPMENTS IN LARGE DATA TECHNIQUES Joseph Turian from MetaOptimise gave a great overview of recent academic work on Machine Learning and Natural LAnguage Processing, buzzwords to look out for are: Deep Learning, Semantic Hashing and Semantic Parsing. Also look at GraphLab, Machine Learning on graph databases

DIGITISED DUTCH CULTURAL HERITAGE, MAHOUT & HADOOP COMPOSING MAHOUT CLUSTERING JOBS Two good talks on using Mahout, the first is on a Dutch Gov project, Images for the future to archive and categorise AV heritage resources. The second had a nice demo of categorising stack-overflow.

Lightning Talks: The Lustre filesystem from Eric Barton of Whamcloud talked about how his company are developing Lustre outside Sun/Oracle and he was trying to see where it could fit in with Hadoop. Luster is the other end of the spectrum from HDFS/Hadoop, really quick but assuming fast, highly available storage behind it. I would love to see some integration with Lustre or Ceph in a Hadoop-like system.

I gave a talk on the Flume Firehose Abs and I made at Forward last week, it was OK (though I still think no-one has done a good job of selling ZeroMQ in 10 minutes!). Slides are here (I’ll do another post about it as well, quite an entertaining fallout from it over twitter.)