CSV Reader (and Type Inference and Data Conversion) Benchmarks (Faster, Fasterer, Fastest)

Gerald Bauer

2018-11-22 15:01:33 UTC

Hello,

I've put together some basic csv reader / parser benchmarks [1].
The "Raw" Read Benchmark returns all strings - no type inference or
data conversion (*)
and the Numerics Benchmark returns all numbers - simple type inference
or data conversion -
it's all numbers - all the time (except for the header row).

Here's the result for the numerics benchmark using the weather
station data from
the University of Waterloo, Ontario, Canada:

n = 100
user system total real
std: 20.781000 0.234000 21.015000 ( 21.039186)
split: 1.531000 0.063000 1.594000 ( 1.582496)
split(table): 2.000000 0.015000 2.015000 ( 2.016913)
reader: 63.500000 0.203000 63.703000 ( 63.691851)
reader(table): 37.407000 0.188000 37.595000 ( 37.601160)
reader(numeric): 40.421000 0.141000 40.562000 ( 40.595467)
reader(json): 1.125000 0.062000 1.187000 ( 1.191145)
reader(yaml): 38.485000 15.672000 54.157000 ( 54.229705)

And the winner is...

Of course - nothing is faster than "plain" string#split (with "simple
csv", that is,
no escape rules and edge cases):

def read_faster_csv( path, sep: ',' )
recs = []
File.open( path, 'r:utf-8' ) do |f|
f.each_line do |line|
line = line.chomp( '' )
values = line.split( sep )
recs << values
end
end
recs
end

(*) Note: YAML and JSON - of course - always use YAML and JSON
encoding (and data conversion) rules :-).

Happy data wrangling with ruby. Cheers. Prost.

[1] https://github.com/csvreader/benchmarks

Unsubscribe: <mailto:ruby-talk-***@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>