There are steps defined in the README and INSTALL for setting up Accumulo from the binary distribution. It sounds reasonable to add additional steps for downloading dependencies into the lib/ folder to these documents, as long as the instructions are clearly explained. Bonus points for including a helper script.
Organizations that manage dependencies in an offline environment will still have guidance for what artifacts and which versions to obtain and where to put them.
I suggest delaying the change to 2.0.0, since some users would be surprised by the switch.
We should also remember the aphorism "if it ain't broke, don't fix it." Are users running into compatibility problems with the classpath or class loaders? If yes, then this is important. If not, it could still be a good thing, and I would support it if someone is energized to implement the change. It sounds like this would make the release process and future maintenance easier, especially on the legal end.
Projects like accumulo-quickstart(1) are critical for allowing users to play around with Accumulo at low entry barrier. Let's make sure that (or a similar) project still works.
On Thu, Jun 30, 2016 at 8:13 PM Dylan Hutchison <[EMAIL PROTECTED]> wrote:
Yep. README will be key. Agreed. I preemptively labeled the issue for 2.0.0, because I figured traction any sooner would be impossible. :) The impetus for this was that I recently bumped our commons-math dependency to commons-math3, and it was such a time sink to try to track down even just that one bundled dependencies LICENSE/NOTICE modifications. I seriously doubt our LICENSE/NOTICE files are fully up-to-date and in sync with other bundled deps which have been updated over time.
But, to the question of whether it's broke... I've seen several cases where a version in our lib directory caused a problem with a version of the same classes elsewhere in the user's system. The user thought they could just avoid any dependency convergence/reconciliation on their part, because they thought Accumulo would just work... and when it didn't, they blamed Accumulo when it was their specific environment which was the problem. If we communicate that responsibility up front, perhaps we wouldn't get blamed when users fail to do their due diligence to converge their dependencies or when they use wildcards excessively in their classpath configs. Agreed.
Targeting for 2.0, including updates in the README, and having mean for helping the downstream user find the appropriate licensing information makes me much more comfortable with this.
I have to ask though, why not just do source only releases? Or source + publishing the binary jars to maven central needed for the public API? On Thu, Jun 30, 2016 at 8:03 PM, Christopher <[EMAIL PROTECTED]> wrote:
This reasoning seems like avoiding the real problem, which seems distinct from not bundling 3rd party works. It's our job as a community to keep accurate track of our dependency licensing, even if we don't need to make a document about it, because we have to ensure that cat-x is kept out*.
Changes needed in our LICENSE/NOTICE for a bundled dependency change should be getting handled by whomever does each dependency change. Folks who review changes (even in our commit-then-review process) should be pointing out where due diligence hasn't been done. We spent a ton of time getting our LICENSE/NOTICE files correct back in September. It'd be super disappointing if that impact of that effort atrophied.
If the downstream users are going to be fulfilling dependencies themselves, should we try to provide an accurate range of versions that we properly work with? * barring maneuvers related to "optional" deployment dependencies, natch.
On Fri, Jul 1, 2016 at 10:44 AM, Sean Busbey <[EMAIL PROTECTED]> wrote: Not having the binary release would suck. Its nice to be able to easily test the latest version of Accumulo on a cluster. Would not be able to easily run our own cluster test suites against release candidates.
On Fri, Jul 1, 2016 at 10:37 AM, Keith Turner <[EMAIL PROTECTED]> wrote:
We don't need to have artifacts in the release to do this though. We could have a nightly build job (for use on dev@accumulo) that makes the binary artifacts needed. That job can take a git ref and default to HEAD. if we want to grab e.g. release candidates to deploy we could then use it.
If these test clusters are going to have to run some script to pull down 3rd party jars, what's the difference in having that script either build the accumulo jars or download them from a jenkins job?
It would suck to not have the binary artifact and I wouldn't be surprised if, by changing this, we break downstream projects (just as an observation).
However, with additional tooling/infrastructure, we could probably get back to a reasonable position for ease of use in what we have now.
This leads me to wonder: what problem are we trying to solve? By avoiding the binary release, we're making our lives easier to release code (the continual L&N work). The build becomes a bit simpler with only a source-release.
If this is *really* about ease-of-use for downstream packagers (which seemed to be your original intent, Christopher), is there a different way we could solve this problem that would meet your needs (again, assuming you're trying to make life easier as a package maintainer for Fedora) that would not involve completely removing the binary tarball?
On Fri, Jul 1, 2016 at 12:07 PM, Sean Busbey <[EMAIL PROTECTED]> wrote:
To be clear, I would not want to just drop the binary tar ball w/o a suitable replacement. If all we had was the source tar ball I would have write my own scripts create something for testing a release on a cluster.
On Fri, Jul 1, 2016 at 10:44 AM Sean Busbey <[EMAIL PROTECTED]> wrote:
I'd actually prefer source-only + jars in Maven... but I don't think that could reach consensus. I figured a more limited approach, still doing binary tarball but with less bundling had a better chance at getting buy-in. As I see it, the problem is an artificial one. Tracking these additional things are the result of ensuring we document and communicate our rights and our user's rights to redistribute binary artifacts produced by other entities. It's a problem we create by a choice to bundle. If we're not redistributing these other artifacts because we're not bundling them, then it's not a problem.
That said, it's still nice to try to communicate the redistribution rights our users will have with our dependencies, so they don't have to track them down individually. But, this isn't ultimately our responsibility. It's just a nice thing to do for our user's convenience.
I agree... and it's not that hard either, but it's a huge time sink when a dep version is bumped from 1.0.1 to 1.0.2 for a quick bug fix, to check to see if it's one of the jars we're bundling, download it from Maven Central (because that's the one we're going to bundle), unpack it, extract the essential docs, determine what's changed, correct errors and determine what doesn't need to be copied, and figure out how to copy/paste the required updates into the structure of our LICENSE/NOTICE files' sections. I agree. I don't want this to atrophy... but given my experience updating just commons-math, I find it hard to imagine that it won't. Either that, or we'll just avoid updating bugfixes, security fixes, and adding new features... and we suffer from that angle instead.
This is hard to know enough to communicate. I think it'd be better to establish a baseline, saying "we've tested with X" (and personally, I think X should be relatively modern/recent), and then if users diverge to be compatible with older/newer software, they'll know that doing so comes with a risk that they may need to patch for that updated or previous library. This is extremely common in downstream packaging/integration. Look at all the CentOS-specific, etc. patches which exist solely for dependency convergence (For fun, look at the Fedora-specific Hadoop patches: https://apps.fedoraproject.org/packages/hadoop/sources/). Trying to test, document, and communicate a range is much harder, and it conflates upstream development with integration tasks a bit (IMO).
Not every user gets their software from an intermediate like CentOS, RHEL, Fedora, Ubuntu, Apache BigTop, Cloudera, or Hortonworks distros. Some users prefer to get their stuff direct... but these users are typically more advanced, and should understand that doing so means they take on some integration responsibilities for their custom environment. The intermediate packagers/integrators are in a better position to drive widespread adoption, though, and they are already performing these tasks regardless of what we're bundling in our binary tarball.
On Fri, Jul 1, 2016 at 12:34 PM Josh Elser <[EMAIL PROTECTED]> wrote: I'm finding it difficult to express clearly the one or two top problems I'm trying to solve. I think this is one of those things that addresses several smaller problems, each of which on their own aren't that important, but add up. Some of those are:
* Reduce developer workload so that we can more easily bump dependencies when needed for features, bugfixes, and security fixes.
* Reduce the technical and licensing debt on our part (current and future), because we're taking on unnecessary bundling tasks which are prone to faulty assumptions.
* Better communicate downstream responsibilities for integration so upstream Accumulo is not harmed by negative perceptions when it's not our fault (we made faulty assumptions and the user didn't reconcile them).
* Refocus/narrow our responsibilities to the upstream project, and draw a distinction with additional integration responsibilities we might voluntarily take on, so that we can provide a better experience for integrators and ease/encourage greater adoption.
* In general, encourage making fewer upstream assumptions about downstream use cases, so we can better support a wider audience of users.
* Prefer extensible tools for users to customize their integration experience, rather than hard-code decisions for them.
FWIW, it was reported to me today that a user ran into an issue where my recent update of commons-configuration caused an integration problem because our scripts/packaging do not bundle commons-configuration and we just assume it will work with the version provided by Hadoop lib directory. That's the kind of thing I'd like to avoid... users should understand that assumptions in our packaging may not work for them, and we're creating work for ourselves while failing to communicate that when we try to bundle everything for them.
If we were a self-contained application, we could even go the opposite way, and bundle everything. But, we're not. We're picking and choosing what to bundle, and our choices might not be right. We should make it easier for the users to choose, instead.
On Fri, Jul 1, 2016 at 3:07 PM William Slacum <[EMAIL PROTECTED]> wrote:
That could be a potentially huge number of profiles, and it would add a lot of complexity which is certainly going to suffer from lack of maintenance over time. I really think this kind of thing (integration) is a distinct responsibility better suited to external/downstream tasks than internal/upstream. Yes. That's my opinion.
Perhaps tangential (but maybe not?): I would love to get to a point where we prevent the ability for users to be depending on Accumulo for dependencies. While there are security reasons that we would want to really sandbox iterators entirely, it would be good to encourage a model of iterator deployment which doesn't push users to developing a dependence on the JARs that we bundle.
IMO, the jars we bundle should only be used by us. Users shouldn't know about them and they should include the necessary things they require.
On Thu, Jun 30, 2016 at 5:43 PM Christopher <[EMAIL PROTECTED]> wrote: Reading back through all the replies, I don't see a *strong* consensus, but it does seem like there's some acceptance of my proposal (perhaps with some reservations).
It seems people are mostly "okay" with this, so long as it's pushed off to the future (2.0+), and is accompanied by some automated way of downloading dependency jars, and collating their LICENSE/NOTICE files.
So, unless there's more discussion here, my intention is to proceed to create a pull request against the 2.0 branch (currently: master) which replaces our assembly bundling with a downloader script. That way, if there's any additional feedback on the specific implementation, folks will be able to comment directly on that.
I've had quite the foray into ASF release policies over the past two days which brings me back to this.
I really don't believe that the amount of effort you claim we will save will actually be beneficial overall. Our dependencies do not frequently change which means that our L&N also do not frequently change.
Even if I do concede that it will make things more simple for Accumulo in the short-term, you're forcing change from N organizations which already integrate Accumulo in its current state (you would force all downstream to change). I would rather solve this once in Accumulo.
If you want to create such a script to help users build their own artifact for their specific installation: great. I believe that the argument that such a script would save time in Accumulo in managing our L&N is false.
On Mon, Jul 18, 2016 at 2:24 PM Josh Elser <[EMAIL PROTECTED]> wrote: I know it would have saved me a ton of time (and sanity) moving to commons-math3. How often it saves us time is debatable, agreed. But, that's not a primary motivation. It's just a slight benefit, which might reduce the burden of bumping dep versions.
I have a PR ready to push... not sure I'm 100% happy with it, because of the way it downloads deps one at a time (might be easier to download then all at once using maven... but with some complication), and some of the changes need to be pushed as a separate commit anyway. So perhaps you'll be able to see better what I'm thinking when you can see the changeset.
As I said before, this isn't really about a single (or a few) big benefit(s). It's about numerous tiny ones, which are admittedly hard to measure. Whether it pays off in the long-run is hard to tell, but that's what I'm targeting... the long-term, though there may be some road bumps in the short-term. I'm convinced this is the right thing to do, but I can understand the reluctance to accept my conclusion, when I've not done a good job of articulating dramatic, easy-to-see benefits. :(
Would it be better for me to wait for your push before continuing discussion? I feel like it's hard to talk over hypotheticals and might just be distracting :). With changes, we can outline positives/negatives rather than feelings.
No hurry. This is only for the 2.0 branch (master) for which there are no immediate or near-term release plans. In addition, Josh has expressed some concerns which make it clear there isn't consensus here (at least, not among the discussion participants). So, take your time. I won't press until we're nearer an actual release schedule for 2.0.
On Mon, Jul 25, 2016 at 11:51 AM Sean Busbey <[EMAIL PROTECTED]> wrote: