I am going to play the contrarian here.

Parquet is not *always* faster than JSON.

The (almost unique) case where it is better to leave data as JSON (or
whatever) is when the average number of times that a file is read is equal
to or less than roughly 1.

The point is that to convert read the files n times in Parquet format, you
have to read the JSON once, write the Parquet and then read the Parquet n
times. The cost of reading the JSON n times is simply n times the cost of
reading the JSON (neglecting caches and such). As such, if n <= 1+epsilon,
JSON wins.

This isn't as strange a case as it might seem. For security logs, it is
common that the files are never read until you need them. That means that n
is nearly zero on average and n << 1 in any case. For incoming data, it is
common that there is an immediate transformation into an alternative form.
That might be pruning data or elaborating or aggregating. The point is that
the original data need not ever be re-written into Parquet format since it
is only ever read once. Transforming the format would wast time and space.

The other case of importance is where the read time is near zero for JSON.
Transforming to any other format will take near zero time and reading from
any other format will also be near zero. The win for transforming will be
near zero as well.


Having said all that, I agree that reading from Parquet will almost
certainly be faster and combining a bunch of small JSON files together into
a larger parquet file will be a real boon for frequently read data. It just
that faster isn't always better if there is a fixed cost.

On Mon, Jun 11, 2018 at 6:42 AM Padma Penumarthy <[EMAIL PROTECTED]>