Subject: [DISCUSS] Lucene-Solr split (Solr promoted to TLP)


Dear Lucene and Solr developers!

A few days ago, I initiated a discussion among PMC members about
potential pros and cons of splitting the project into separate Lucene
and Solr entities by promoting Solr to its own top-level Apache
project (TLP). Let me share with you the motivation for such an action
and some follow-up thoughts I heard from other PMC members so far.

Please read this e-mail carefully. Both the PMC and I look forward to
hearing your opinion. This is a DISCUSS thread and it will be followed
next week by a VOTE thread. This is our shared project and we should
all shape its future responsibly.

The big question is this: “Is this the right time to split Solr and
Lucene into two independent projects?”.

Here are several technical considerations that drove me to ask the
question above (in no order of priorities):

1) Precommit/ test times. These are crazy high. If we split into two
projects we can pretty much cut all of Lucene testing out of Solr (and
likewise), making development a bit more fun again.

2) Build system itself and source release packaging. The current
combined codebase is a *beast* to maintain. Working with gradle on
both projects at once made me realise how little the two have in
common. The code layout, the dependencies, even the workflow of people

working on these projects... The build (both ant and gradle) is full
of Solr and Lucene-specific exceptions and hooks that could be more
elegantly solved if moved to each project independently.

3) Packaging. There is no single source distribution package for
Solr+Lucene. They are already "independent" there. Why should Lucene
and Solr always be released at the same pace? Does it always make
sense?

4) Solr is essentially taking in Lucene and its dependencies as a
whole (so is Elasticsearch and many other projects). In my opinion
this makes Lucene eligible for refactoring and

maintenance as a separate component. The learning curve for people
coming to each project separately is going to be gentler than trying
to dive into the combined codebase.

5) Mailing lists, build servers. Mailing lists for users are already
separated. I think this is yet another indication that Solr is
something more than a component within Lucene. It is perceived as an
independent entity and used as an independent product. I would really
like to have separate mailing lists for these two projects (this
includes build and test results) as it would make life easier: if your
focus is more on Lucene (or Solr), you would only need to track half
of the current traffic.
As I already mentioned, the discussion among PMC members highlighted
some initial concerns and reasons why the project should perhaps
remain glued together. These are outlined below with some of the
counter-arguments presented under each concern to avoid repetition of
the same content from the PMC mailing list (they’re copied from the
private discussion list).

1) Both projects may gradually split their ways after the separation
and even develop “against” each other like it used to be before the
merge.

Whether this is a legitimate concern is hard to tell. If Solr goes TLP
then all existing Lucene committers will automatically become Solr
committers (unless they opt not to) so there will be both procedural
ways to prevent this from happening (vetoes) as well as common-sense
reasons to just cooperate.

2) Some people like parallel version numbering (concurrent Solr and
Lucene releases) as it gives instant clarity which Solr version uses
which version of Lucene.

This can still be done on Solr side (it is Solr’s decision to adapt
any versioning scheme the project feels comfortable with). I
personally (DW) think this kind of versioning is actually more
confusing than helpful; Solr should have its own cadence of releases
driven by features, not sub-component changes. If the “backwards
compatibility” is a factor then a solution might be to sync on major
version releases only (e.g., this is how Elasticsearch is handling
this).

3) Solr tests are the first “battlefield” test zone for Lucene changes
- if it becomes TLP this part will be gone.

Yes, true. But realistically Solr will have to adopt some kind of
snapshot-based dependency on Lucene anyway (whether as a git submodule
or a maven snapshot dependency). So if there are bugs in Lucene they
will still be detected by Solr tests (and fairly early).

4) Why split now if we merged in the first place?

Some of you may wonder why split the project that was initially
*merged* from two independent codebases (around 10 years ago). In
short, there was a lot of code duplication and interaction between
Solr and Lucene back then, with patches flying back and forth.
Integration into a single codebase seemed like a great idea to clean
things up and make things easier. In many ways this is exactly what
did happen: we have cleaned up code dependencies and reusable
components (on Lucene side) consumed by not just Solr but also other
projects (downstream from Lucene).

The situation we find ourselves now is different to what it was
before: recent and ongoing development for the most part falls within
Solr or Lucene exclusively.
This e-mail is for discussing the idea and presenting arguments/
counter-arguments for or against the split. It will be followed by a
separate VOTE thread e-mail next Monday. If the vote passes then there
are many questions about how this process should be arranged and
orchestrated. There are past examples even within Lucene [1] that we
can learn from, and there are people who know how to do it - the
actual process is of lesser concern at the moment, what we mostly want
to do is to reach out to you, signal the idea and ask about your
opinion. Let us know what you think.

[1] https://lists.apache.org/thread.html/15bf2dc6d6ccd25459f8a43f0122751eedd3834caa31705f790844d7%401270142638%40%3Cuser.nutch.apache.org%3E