Subject: Spark dataframe hdfs vs s3


You can’t play much if it is a streaming job. But in case of batch jobs, sometimes teams will copy their S3 data to HDFS in prep for the next run :D

From: randy clinton <[EMAIL PROTECTED]>
Date: Thursday, May 28, 2020 at 5:50 AM
To: Dark Crusader <[EMAIL PROTECTED]>
Cc: Jörn Franke <[EMAIL PROTECTED]>, user <[EMAIL PROTECTED]>
Subject: Re: Spark dataframe hdfs vs s3

See if this helps

"That is to say, on a per node basis, HDFS can yield 6X higher read throughput than S3. Thus, given that the S3 is 10x cheaper than HDFS, we find that S3 is almost 2x better compared to HDFS on performance per dollar."

https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
On Wed, May 27, 2020, 9:51 PM Dark Crusader <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi Randy,

Yes, I'm using parquet on both S3 and hdfs.

On Thu, 28 May, 2020, 2:38 am randy clinton, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Is the file Parquet on S3 or is it some other file format?

In general I would assume that HDFS read/writes are more performant for spark jobs.

For instance, consider how well partitioned your HDFS file is vs the S3 file.

On Wed, May 27, 2020 at 1:51 PM Dark Crusader <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hi Jörn,

Thanks for the reply. I will try to create a easier example to reproduce the issue.

I will also try your suggestion to look into the UI. Can you guide on what I should be looking for?

I was already using the s3a protocol to compare the times.

My hunch is that multiple reads from S3 are required because of improper caching of intermediate data. And maybe hdfs is doing a better job at this. Does this make sense?

I would also like to add that we built an extra layer on S3 which might be adding to even slower times.

Thanks for your help.

On Wed, 27 May, 2020, 11:03 pm Jörn Franke, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Have you looked in Spark UI why this is the case ?
S3 Reading can take more time - it depends also what s3 url you are using : s3a vs s3n vs S3.

It could help after some calculation to persist in-memory or on HDFS. You can also initially load from S3 and store on HDFS and work from there .

HDFS offers Data locality for the tasks, ie the tasks start on the nodes where the data is. Depending on what s3 „protocol“ you are using you might be also more punished with performance.

Try s3a as a protocol (replace all s3n with s3a).

You can also use s3 url but this requires a special bucket configuration, a dedicated empty bucket and it lacks some ineroperability with other AWS services.

Nevertheless, it could be also something else with the code. Can you post an example reproducing the issue?
--
I appreciate your time,

~Randy