Currently I am working on the implementation of the Parquet page index for
(design doc is here if you are interested:https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing
During our discussions it came up that DataPageHeaderV2 states that page
boundaries are also record boundaries:https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532
DataPageHeader(V1) doesn't have this statement, which means that in theory
it allows records to span through multiple pages. Is it really the case, or
is it something that is missing from the specification?
I ask this because filtering pages based on the page index is much more
simple if page boundaries are record boundaries as well.