As I know many of you don't read / are not part of the user list. I'll make
a summary of what happened at the summit:
We discussed some needs we get in order to start serving our predictions
with Spark. We mostly talked about alternatives to this work and what we
could expect in these areas.
I'm going to share mine here, hoping it will trigger further discussion. We
- Use Spark as an ETL tool, followed by
- a Python (numpy/pandas based) pipeline to preprocess information and
- use Tensorflow for training our Neural Networks
What we'd love to, and why we don't:
- Start using Spark for our full preprocessing pipeline. Because type
safety. And distributed computation. And catalyst. Buy mainly because
Our main issue:
- We want to use the same code for online serving. We're not willing
to duplicate the preprocessing operations. Spark is not
- If we want it to preprocess online, we need to copy/paste our
custom transformations to MLeap.
- It's an issue to communicate with a Tensorflow API to give it the
preprocessed data to serve.
- Use Spark to do hyperparameter tunning.
- GPU Integration with Spark, letting us achieve finer tuning.
- Better TensorFlow integration
Now that I'm on the @dev, do you think that any of this issues could be
addressed? We talked at the summit about PFA (Portable Format for
Analytics) and how we would expect it to cover some issues. Another
discussion I remember was about *encoding operations (functions/lambdas) in
PFA itself. *And I don't remember having smoked anything at that point,
although we could as well have.
Oh, and @Holden Karau <[EMAIL PROTECTED]> insisted that she would be
much happier with us if we started helping with code reviews. I'm willing
to make some time for that.
Sorry again for the delay in replying to this email *(and now sorry for the
length), *looking forward to following up on this topic
El mar., 3 jul. 2018 a las 15:37, Saikat Kanjilal (<[EMAIL PROTECTED]>)