Just following up on below to see if anyone has any suggestions. Appreciate your help in advance.
On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah <[EMAIL PROTECTED]> wrote:
> Hi All, > > I use the following to read a set of parquet file paths when files are > scattered across many many partitions. > > paths = ['p1', 'p2', ... 'p10000'] > df = spark.read.parquet(*paths) > > Above method feels like is sequentially reading those files & not really > parallelizing the read operation, is that correct? > > If I put all these files in a single path and read like below - works > faster: > > path = 'consolidated_path' > df = spark.read.parquet(path) > > Is my observation correct? If so, is there a way to optimize reads from > multiple/specific paths ? > > -- > Regards, > > Rishi Shah >