Subject: Spark streaming when a node or nodes go down


Hi,

I know this is a basic question but someone enquired about it and I just
wanted to fill my knowledge gap so to speak.

Within the context of Spark streaming, the RDD is created from the incoming
topic and RDD is partitioned and each node of Spark is operating on a
partition at that time. OK This series of operations are merged together
and create a DAG. That means DAG keeps track of operations performed?

If a node goes down, the driver (application master) knows about it. Then,
it tries to assign another node to continue the processing at the same
place on that partition of RDD. This works but with reduced performance.
However, to be able to handle the lost partition,* the data has to be
available to all nodes from the beginning.* So we are talking about spark
streaming not spark reading from HDFS, Hive table etc. The assumption is
that the streaming data is cached in every single node? Is that correct!

Thanks

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.