This is my first email to this mailing list, so I apologize if I made any errors.
My team's going to be building an application and I'm investigating some options for distributed compute systems. We want to be performing computes on large matrices.
The requirements are as follows:
1. The matrices can be expected to be up to 50,000 columns x 3 million rows. The values are all integers (except for the row/column headers).
2. The application needs to select a specific row, and calculate the correlation coefficient ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
) against every other row. This means up to 3 million different calculations.
3. A sorted list of the correlation coefficients and their corresponding row keys need to be returned in under 5 seconds.
4. Users will eventually request random row/column subsets to run calculations on, so precomputing our coefficients is not an option. This needs to be done on request.
I've been looking at many compute solutions, but I'd consider Spark first due to the widespread use and community. I currently have my data loaded into Apache Hbase for a different scenario (random access of rows/columns). I've naively tired loading a dataframe from the CSV using a Spark instance hosted on AWS EMR, but getting the results for even a single correlation takes over 20 seconds.