Thanks for the quick reply, Zhang.
I don't think that we have too much data skew, and if we do, there isn't much of a way around it - we need to groupby this specific column in order to run aggregates.
I'm running this with PySpark, it doesn't look like the groupBy() function takes a numPartitions column. What other options can I explore?
From: ZHANG Wei <[EMAIL PROTECTED]>
Sent: Thursday, May 7, 2020 1:34 AM
To: Gautham Acharya <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG()
CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.
AFAICT, there might be data skews, some partitions got too much rows, which caused out of memory limitation. Trying .groupBy().count() or .aggregateByKey().count() may help check each partition data size.
If no data skew, to increase .groupBy() parameter `numPartitions` is worth a try.
On Wed, 6 May 2020 00:07:58 +0000
Gautham Acharya <[EMAIL PROTECTED]> wrote:
To unsubscribe e-mail: [EMAIL PROTECTED]