Not sure it would help and answer your question at 100%, but number of
partitions is supposed to be at least roughly double of your number of
cores (surprised to not see this point in your list), and can easily grow
up to 10x, until you may notice a too large overhead – but that's not
programmatic, as you asked ;-).
We did lots of experiences and I'd say that the ideal impact is hit when
"number of records of input data * maximum record size / number of
partitions" is far smaller than the available memory per core (to give
Spark enough headroom to properly work, for serialization among others, as
you mentioned in #6).
Since most of the time, we'll have many cores sharing not (and never)
enough memory, it looks interested to partition aggressively. Shuffling
overhead looks easier to observe and debug than the memory efficiency.
Interested in more accurate answers as well :-)
On 4 September 2015 at 05:20, Isabelle Phan <[EMAIL PROTECTED]> wrote:
Head of Backend/Infrastructure
50, avenue Montaigne - 75008 Paris