Usually I am in favour of testing, when people say “weighing a cow does not make it any heavier” I reply that if you measure nothing, you will never see any improvements. Usually people can come up with cogent arguments against the particular things examined but it never seems to be the case that measuring nothing is the answer, some better metric needs to be created.
I stumbled across a site that reminded me of when I used to read tomshardware to see how fast new PC parts really were (or look at nice graphs and read a conclusion telling me). Their benchmarks of Ubuntu vs Fedora and The Last 12 Linux Kernels strike me as pretty pointless. Pages of graphs saying things are the same bar noise, as you would probably expect. The conclusion of the distro shootout at least says you should (could?) not base your decision on the results. The Linux one is interesting:
The only benchmark where there was a definitive improvement was with the network performance. Granted, these 12 Linux kernels were only tested on one system and in eight different benchmarks. We will continue benchmarking the Linux kernel in different environments and report back once we have any new findings.
Of all the benchmarks, I would have predicted in advance that the network one would be noisiest. I would suggest that you would get different results on different machines, perhaps they should use different clocks too.
Do these people creating the graphs know about repeating measurements, averaging and statistical significance tests? I think not, so what is the point?