HDFS is simply a better place to make performant reads and on top of that
the data is closer to your spark job. The databricks link from above will
show you that where they find a 6x read throughput difference between the
If your HDFS is part of the same Spark cluster than it should be an
incredibly fast read vs reaching out to S3 for the data.
They are different types of storage solving different things.
Something I have seen in workflows is something other people have suggested
above, is a stage where you load data from S3 into HDFS, then move on to
you other work with it and maybe finally persist outside of HDFS.
On Fri, May 29, 2020 at 2:09 PM Bin Fan <[EMAIL PROTECTED]> wrote:
I appreciate your time,