Hi E
veryone,
I'm a bit new to Avro format, trying to process slightly large avro file w. 300 columns 200K rows with python 3.

However, it's a bit slow and I would like to try processing individual parts of file with 5 process.

I wonder if there is any easy way to seek within an avro file without causing data corruption, rather than looping through each record sequentially ?

I believe it is a splittable format since it can be processed via mapreduce/spark in parallel, but I'm not sure if python avro module supports jumping within a file to find a safe position to start reading from.

Currently all I can do is to process it row by row, which doesn't help for parallelisation;

reader = DataFileReader(open("users.avro", "rb"), DatumReader())
    i = 0
    for user in reader:
        i += 1
        if (i>10000):
          break

or should I switch to C or Java to process bigger files if not spark/mapreduce ?

Thanks,