Subject: Spark Structured Streaming for Twitter Streaming data


The code uses the format "socket" which is only for text sent over a simple
socket, which is completely different from how Twitter APIs works. So this
wont work at all.
Fundamentally, for Structured Streaming, we have focused only on those
streaming sources that have the capabilities record-level tracking offsets
(e.g. Kafka offsets) and replayability in order to give strong exactly-once
fault-tolerance guarantees. Hence we have focused on files, Kafka, Kinesis
(socket is just for testing as is documented). Twitter APIs as a source
does not provide those, hence we have not focused on building one. In
general, for such sources (ones that are not perfectly replayable), there
are two possible solutions.

1. Build your own source: A quick google search shows that others in the
community have attempted to build structured-streaming sources for Twitter.
It wont provide the same fault-tolerance guarantees as Kafka, etc. However,
I dont recommend this now because the DataSource APIs to build streaming
sources are not public yet, and are in flux.

2. Use Kafka/Kinesis as an intermediate system: Write something simple that
uses Twitter APIs directly to read tweets and write them into
Kafka/Kinesis. And then just read from Kafka/Kinesis.

Hope this helps.

TD

On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot <[EMAIL PROTECTED]>
wrote: