By taking a fairly lightweight approach and focusing on not creating objects during parsing unless necessary, the parser is able to achieve some very good performance results even when compared to Univocity and Jackson parsers. The API exposes an ability to get to the raw buffer and byte offsets of the parsed values, which gives a significant performance boost (see the "no String" result below).
Performance results - fastest kid on the block
Now for the performance results: First CSV-Game and then Quick-CSV-Streamer results.
If you are already using a Streams based API I would definitely recommend using QuickCSV, it's going to be easier to adapt and probably "good enough".
The below results don't really highlight the end-to-end processing benefits you get from using zero through your entire processing chain but are nonetheless interesting to look at. Zero easily beats most of the general purpose parsers, however QuickCSV is actually ahead using paralel processing. On sequential processing, Zero can beat Quick when reading from a byte[] but not from an InputStream. Using the 'no String' API ensures Zero is faster than Quick.
CSV-Game result CSV Count
------------------------------------
CSVeed 4.871 -
BeanIO 1.126 -
Commons 0.848 -
Java 0.328 -
OpenCSV 0.390 0.429
Univocity 0.272 0.313
ZeroCsv 0.141 0.161
And using the Quick-CSV test set. What's interesting in this set is that quite a bit of the total time is in the String / Double parsing rather than actual CSV work. Quick and Zero are both very quick.
Quick-CSV Test Set Parse time
----------------------------------------------
OpenCSV 1870 +- 86 ms/op
ZeroCsv (I'Stream) 786 +- 29 ms/opZeroCsv (byte[]) 512 +- 32 ms/op
ZeroCsv (no String) 371 +- 33 ms/op
QuickCSV Parallel 232 +- 11 ms/op
// provide a filename, the temporary buffer size,
// and the number of columns we should anticipate
String filename = "hello.csv"
int BUFFER_SIZE = 512*1024;
int MAX_COLS = 15;
// create the tokenizer
CsvTokenizer csv = new CsvTokenizer(new FileInputStream(filename), BUFFER_SIZE, MAX_COLS);
// start parsing
while (csv.parseLine()) {
// fetch column count
System.out.println("column count = " + csv.getCount());
// fetch a value as a string (creates garbage)
System.out.println("3rd column value = " + csv.getAsString(2));
// or better, perform a write to System.out (without a String)
System.out.write(csv.getLineBuffer(), csv.getOffset(2), csv.getLength(2));
}
Working with Disruptor
// in main loop
while(csv.parseLine()) {
ringBuffer.publishEvent((line, seq, t) ->
CityZ.copyTo(t, line), csv));
}
// see the CityZ code for an example
Other helpers are available such as getAsLong, getAsInt, getAsDouble. This allows for a nice API to work with the existing Java APIs that take a byte array, offset, and length. If the API gets cumbersome it is simple to switch to using Strings etc. This also allows you to skip deserialisation of columns that are not required.
If you are using an InputStream to process huge files, the Zero parser will allocate smaller buffers and rotate through them for you. As long as you eventually complete processing through disruptor the buffers will be released. (This will in fact generate garbage).
What's special
A significant part of the difference is in a few very small areas:
- Custom double and int parsing implementation
- Surprise branching shortcuts
- Avoid copying of buffers and instantiation of objects where possible
- Take some shortcuts with error checking
Further investigation
QuickCSV also uses a custom Double parsing implementation.
In addition, QuickCSV seems to perform much quicker than ZeroCSV on the InputStream'd inputs. I'd like to investigate why that is.
The code?
Code is on github at https://github.com/jasonk000/zerocsv