Subject: Processing .wav files in PySpark

I need to process .wav files in Pyspark.  If the files are in local file system, I am able to process them.  Once I store them on HDFS, I am facing issues.  For example,

I run a sox program on a wav file like this.

sox ext2187854_03_27_2014.wav -n stats  <-- works fine

sox hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav -n stats   <-- Does not work as sox cannot read HDFS file.

So, I do like this.

hadoop fs -cat hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav | sox -t wav - -n stats  <-- This works fine

But, I am not able to do this in PySpark.

wavfile = sc.textFile('hdfs://xxxxxxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(['sox', '-t' 'wav', '-', '-n', 'stats']))

I tried different options like sc.binaryFiles and sc.pickleFile.

Any thoughts?

Venkat Ankam

This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender by reply e-mail and destroy all copies of the communication and any attachments.
  Davies Liu 2015-01-16, 23:49