Subject: Processing .wav files in PySpark

I need to process .wav files in Pyspark.  If the files are in local file system, I am able to process them.  Once I store them on HDFS, I am facing issues.  For example,

I run a sox program on a wav file like this.

sox ext2187854_03_27_2014.wav -n stats  <-- works fine

sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats   <-- Does not work as sox cannot read HDFS file.

So, I do like this.

hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | sox -t wav - -n stats  <-- This works fine

But, I am not able to do this in PySpark.

wavfile = sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(['sox', '-t' 'wav', '-', '-n', 'stats']))

I tried different options like sc.binaryFiles and sc.pickleFile.

Any thoughts?

Venkat Ankam

