Subject: Tasks run only on one machine


Sure

    var columns = mc.textFile(source).map { line => line.split(delimiter) }

Here “source” is a comma delimited list of files or directories. Both the textFile and .map tasks happen only on the machine they were launched from.

Later other distributed operations happen but I suspect if I can figure out why the fist line is run only on the client machine the rest will clear up too. Here are some subsequent lines.

    if(filterColumn != -1) {
      columns = columns.filter { tokens => tokens(filterColumn) == filterBy }
    }

    val interactions = columns.map { tokens =>
      tokens(rowIDColumn) -> tokens(columnIDPosition)
    }

    interactions.cache()

On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <[EMAIL PROTECTED]> wrote:

Will you be able to paste code here?

On 23 April 2015 at 22:21, Pat Ferrel <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The rdd is created using sc.textFile(“list of thousand files”)

What would cause the read to occur only on the machine that launched the driver.

Do I need to do something to the RDD after reading? Has some partition factor been applied to all derived rdds?
---------------------------------------------------------------------