This is with regard to the Kubernetes Scheduler Backend and scaling the
process to accept contributions. Given we're moving past upstreaming
changes from our fork, and into getting *new* patches, I wanted to start
this discussion sooner than later. This is more of a post-2.3 question -
not something we're looking to solve right away.
While unit tests are handy, they're not nearly as good at giving us
confidence as a successful run of our integration tests against
single/multi-node k8s clusters. Currently, we have integration testing
setup at https://github.com/apache-spark-on-k8s/spark-integration
running continuously against apache/spark:master in pepperdata-jenkins
minikube) & k8s-testgrid
GKE clusters). Now, the question is - how do we make integration-tests part
of the PR author's workflow?
1. Keep the integration tests in the separate repo and require that
contributors run them, add new tests prior to accepting their PRs as a
policy. Given minikube <https://github.com/kubernetes/minikube>
is easy to
setup and can run on a single-node, it would certainly be possible.
Friction however, stems from contributors potentially having to modify the
integration test code hosted in that separate repository when
adding/changing functionality in the scheduler backend. Also, it's
certainly going to lead to at least brief inconsistencies between the two
2. Alternatively, we check in the integration tests alongside the actual
scheduler backend code. This would work really well and is what we did in
our fork. It would have to be a separate package which would take certain
parameters (like cluster endpoint) and run integration test code against a
local or remote cluster. It would include least some code dealing with
accessing the cluster, reading results from K8s containers, test fixtures,
I see value in adopting (2), given it's a clearer path for contributors and
lets us keep the two pieces consistent, but it seems uncommon elsewhere.
How do the other backends, i.e. YARN, Mesos and Standalone deal with
accepting patches and ensuring that they do not break existing clusters? Is
there automation employed for this thus far? Would love to get opinions on
(1) v/s (2).