So I've been looking into options for providing encryption at rest, and it seems like what Accumulo has is abandonware from a project perspective. There is no official documentation on how to perform encryption at rest, and the best information from its status comes from year (or greater) old ticket comments about how the feature is still experimental. Recently there was a talk that described using HDFS encryption zones as an alternative.
From my perspective, this is what I see as the current situation:
1- Encryption at rest in Accumulo isn't actively being worked on 2- Encryption at rest in Accumulo isn't part of the public API or marketed capabilities 3- Documentation for what does exist is scattered throughout Jira comments or presentations 4- A viable alternative exists that appears to have feature parity in HDFS encryption 5- HBase has finer grained encryption capabilities that extend beyond what HDFS provides
Moving forward, what's the consensus for supporting this feature? Personally, I see two options:
1- Start going down a path to bring the feature into the forefront and start providing feature parity with HBase
2- Remove the feature and place emphasis on upstream encryption offerings
I'm only smart enough to know that I'm not smart enough to build a distributed database *and* encrypt it securely. I'd much prefer to defer to the people up the stack.
The one thing we'd miss out on is things like column-family-level encryption control (which I think HBase has), but I'd much rather have a complete encryption story before worrying about the fine-grained support.
Some colleagues have expressed interest in examining the current state of our rfile encryption with the expectation to suggest improvements or contributions to close the gap. I don't know a timeline for any of that, if that interest even bears out in terms of concrete action, so I don't know how much that should count right now. However, I wouldn't wrote off our encryption at rest just yet.
That said, I agree it's currently in a dead state, and if it continues that way, without progress to close the gaps in its implementation, it might be best to scrap it for now until such time as we can refactor the RFile API to ease improvements in that area of the code all around. As is, its API is way more complex than it needs to be, and I'd bet it's scaring of potential contributions.
On Fri, Oct 30, 2015, 16:37 Josh Elser <[EMAIL PROTECTED]> wrote:
if anyone wants to pick it back up later, we can always pull it back out of the git history. how would implementation work? I know it's not in the public API, but if there are folks relying on it we'd essentially be locking them out of upgrades. would we provide migration tools? On Fri, Oct 30, 2015 at 3:22 PM, William Slacum <[EMAIL PROTECTED]> wrote:
There's another way to look at the state of Accumulo's encryption at rest: 1. Encryption at rest works great for what it does, and the code being "at rest" isn't necessarily a problem 2. Several organizations are using Accumulo's encryption at rest effectively in operations 3. Encryption at rest has been a supported configuration option for over two years with established plugin interfaces, and therefore it should be considered part of the public API 4. Upstream alternatives (to my knowledge) have not been analyzed for performance or security
The given option #2 would at least require an analysis of alternatives, and we would have to decide what to do about backwards compatibility for users using custom key stores and encryption strategies that may or may not be supported by upstream alternatives.
As far as option #1 goes, I can get behind encouraging people to take up projects to improve Accumulo's encryption. I think we're already going down this path, but without having identified resources to do the improvements. Any volunteers?
Adam On Fri, Oct 30, 2015 at 4:22 PM, William Slacum <[EMAIL PROTECTED]> wrote:
1. I'm not sure I'd call an incomplete solution 'great'. What it does is provide partial encryption-at-rest protection (unless you're running without walogs, and have good integration with some external secure key management faculty, and then it's probably fine).
2. I'm concerned that anybody using Accumulo's E-A-R don't necessarily realize its current shortcomings, or its lack of upstream maintenance support (which it has not been receiving). It may be the case that these users have support from an intermediary, and do understand the shortcomings... I don't know, but it's a concern.
3. Correction: it has been an explicitly experimental feature and an incomplete one, which hasn't really been touched in two years, and has been explicitly excluded by the community for being public API because of its incompleteness. Age doesn't determine public API status. The community does.
4. Has Accumulo's been evaluated for security and performance? By whom? Is it published?
On Sun, Nov 1, 2015, 08:55 Adam Fuchs <[EMAIL PROTECTED]> wrote:
On Nov 1, 2015 9:58 AM, "Christopher" <[EMAIL PROTECTED]> wrote:
The only thing that doesn't get encrypted is a temporary WAL recovery file. That is a project we should take on, but it does not imply that the existing features are not valuable. With HDFS encryption options this would now be a much easier project to take on. Also, the users I know that use encryption at rest do so with a more secure key store than the default. Anybody that creates a secure system has to analyze the security of the system as a whole. Accumulo's encryption at rest is one part of the solution. Taking away the tool without providing an alternative does nothing to improve the security of systems built on Accumulo.
People are using it, so we have to consider the implications of whatever changes we make and weigh against the benefits. I believe the last bug fix was done this year, so I would argue it is being maintained. Changes to our encryption at rest implementation will have consequences for those users. There had better be a clear benefit if we break their systems. Yes, there have been several talks at meetups and conferences that discuss the security and performance of the current solution.
rest: "at and users down improvements. wrote: and perspective. rest, old alternative. offerings
Is "the code being 'at rest'" you making a funny about active development? Making sure I haven't lost my ability to get jokes :)
I see two reasons why the code would be inactive: the feature is good enough as is or it's not interesting enough to attract attention. Considering it's not public API, there are no discussions to bring into the public API, and there's no effort to document how to use it, my intuition tells me that there isn't enough interest in it from a project perspective.
From a user perspective, I've been getting asked about it when I work with Accumulo users. My recommendation, exclusively, is to use HDFS encryption because I can go to Hadoop's website and find documentation on it. When I go to find documentation on Accumulo's offerings, any usability information comes from vendor SlideShares. Most mentions of the feature on official Apache Accumulo channels echo Christopher's sentiments on the feature being experimental and not being officially recommended for use.
I wouldn't want to rip out the feature first and then figure things out later. Sean already alluded to it, but a roadmap should contain something (tool or documentation) to help users migrate if we go down that route.
What I'm trying to figure out is, when the question of "How do I do encryption at rest in Accumulo?" comes up, what is our community's answer?
If we went down the route of using HDFS encryption zones, can we offer the same features? At the very least, we'd be offering the same database-level encryption scheme. I don't know the details of "more advanced key stores", but it seems like we could potentially take any custom implementation and map it to a KeyProvider . I could also envision table level encryption being implementable via zones, but probably not down to the column family level.
On Mon, Nov 2, 2015 at 12:27 PM, William Slacum <[EMAIL PROTECTED]> wrote: Where does the decryption happen with DFS, is it in the DFS client? If so, using HDFS level encryption seems to offer the same functionality???
Has anyone written a tool that takes an Accumulo-encrypted-HDFS-unencrypted-RFile and rewrites it is as an Accumulo-unencrypted-HDFS-encrypted-RFile? Wondering if there are any unexpected gotchas w/ this.
On Mon, Nov 2, 2015 at 12:27 PM, William Slacum <[EMAIL PROTECTED]> wrote:
Glad somebody got it! :)
From my perspective it's working fine and hasn't needed attention. We do a lot of bulk loading, so we haven't been pushing for the additional work on the WAL recovery channel. We use this encryption by default, and haven't had problems with it.
I think it's entirely plausible that HDFS encryption works well. I would want to see recommended configuration settings, some numbers on performance impact, and some indications of long-term reliability before making the switch. Totally agree. Thanks for making that point clear. What do we expect to gain from column-level encryption over instance-level encryption?
On Mon, Nov 2, 2015 at 1:37 PM, Keith Turner <[EMAIL PROTECTED]> wrote: I was discussing my questions w/ Christopher today and he mentioned an experiment that I thought was interesting. What is the random seek performance of Accumulo-encrypted-HDFS-unencrypted-RFile vs Accumulo-unencrypted-HDFS-encrypted-RFile?
@Adam, column family level encryption can be useful for multi-tenant environments, and I think it maps pretty well to the document partitioning/sharding/wikisearch style tables. Things are trickier in Accumulo than in HBase since there isn't a 1:1 mapping between column families and files. The built in RFile encryption scheme seems better suited to this.
@Christopher & Keith, it's something we can evaluate. Is there a good test harness for just writing an RFile, opening a reader to it, and just poking around? I was looking at the constructors and they didn't seem straightforward enough for me to comprehend them within a few seconds.
Yup, #2. I also don't know if it's worth the effort for that specific feature. It might be easier to add something like per-namespace and/or per-table encryption, then define common access patterns for applications that want to use multiple keys for encryption.
On Wed, Nov 4, 2015 at 8:10 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
+1 I think this is the right step. My hunch is that some of the common data access patterns that we have in Accumulo (over HBase) is that the per-colfam encryption isn't quick as common a design pattern as it is for HBase (please tell me I'm wrong if anyone disagrees -- this is mostly a gut reaction). I think our users would likely benefit more from a per-namespace/table encryption control like you suggest.
Implementing RFile encryption at HDFS level (e.g. tie a specific zone/key for a table) is probably straightforward. Changing the TServer's WAL use would likely be trickier to get right (a tserver would have multiple WALs, one for each unique zone/key from Tablet it happens to host). Maybe worrying about that is getting ahead of things -- just thought about it and figured I'd mention it :)
My main concern using HDFS encryption vs. built-in Accumulo implementation is possibly performance with respect to seeks. If we encrypt our indexed blocks independently (as we do now), I suspect our seeks would be more performant than relying on HDFS encryption, whose encrypted blocks may not fall on our index boundaries. If this is a small difference, it might still be worth it for convenience and simpler maintenance, but I suspect the difference will be somewhat substantial.
On Thu, Nov 5, 2015 at 12:11 PM Josh Elser <[EMAIL PROTECTED]> wrote:
JIRAs are fine, but I thought this thread was mostly addressing the fact that there doesn't seem to be a sustained interest in actually working on any of the JIRAs addressing that area of code. Am I wrong? Is there willingness from anybody to expend effort on this code? Even if not, we can still make JIRAs, but they'll probably just be ignored. So, the question for me is: which JIRAs should we make? Are we going to pursue phasing out the code, or pursue improving it? Those are very different JIRA text.
On Thu, Nov 5, 2015 at 12:22 PM Mike Drob <[EMAIL PROTECTED]> wrote:
I think you have misidentified the two camps. There is a camp that believes we should phase out the code in favour of the HDFS encryption, and a camp that believes the code is sufficiently mature. I don't think there is a group that is interested in improving the state of things.
On Thu, Nov 5, 2015 at 12:02 PM, Christopher <[EMAIL PROTECTED]> wrote:
Perhaps. I had interpreted some of Adam's comments ("The only thing that doesn't get encrypted is a temporary WAL recovery file. That is a project we should take on..."), as favoring improvements to the current state of things. As that has also been the focus of previous conversations about the state of Accumulo's encryption-at-rest, I assumed that third camp also existed. Perhaps I was wrong.
On Thu, Nov 5, 2015 at 1:11 PM Mike Drob <[EMAIL PROTECTED]> wrote:
On Thu, Nov 5, 2015 at 12:17 PM, Christopher <[EMAIL PROTECTED]> wrote: Very good point, Chris. This is especially important if we allow users to pick their own encryption algorithms. As I understand it, cipher block chaining (CBC) is important to keep most crypto algorithms secure, and it has a big effect on where you need to start decrypting. There are ways of doing CBC that let you seek pretty close to any point in a file and decrypt from there, and there are other ways that require you to start from the beginning. The current RFile implementation ensures that you can start decrypting at the beginning of an RFile block, which matches where we start decompressing and where we currently seek in HDFS. The performance difference is likely to be much more pronounced for certain crypto settings.
Does anybody have a good diagram showing the architecture of HDFS encryption?
Related: Can we collect the digram and design docs from the various implementation JIRAs and put them up on the Accumulo website? Every time that I've needed to reference them it's been a giant pain to go find them. Maybe brush up the contents if there happen to be differences between design and implementation.
On Thu, Nov 5, 2015 at 12:21 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
Camps two and three are the same camp, really. If we can identify a clear roadmap (eventually via the right set of tickets), then it comes down to whether people have energy and inclination to do the work. I don't think the roadmap ends here.
On Thu, Nov 5, 2015 at 1:18 PM, Christopher <[EMAIL PROTECTED]> wrote:
Just to moonwalk back a bit, I see a few things happening concurrently now. First is trying to get a consensus on where we want to go with the encryption at rest story in Accumulo.
I see us having established that what we have is scoped down to working for WALs and RFiles, and if you happen to have written it, you are satisfied. However, as a project, we haven't pulled it into the public API and haven't provided documentation, so if you haven't written it, the process of finding out how to configure and use the feature is indirect.
There is some consensus about moving to using HDFS encryption to achieve the same features, but we want to test and see if the performance is comparable between it and Accumulo's RFile encryption capability. There may be caveats based on how you encrypt the data. We want to explore this space. Mike would like a Jira ticket to outline this.
For adding features to Accumulo, we could potentially add encryption at the column level. Questions about this involve the level of effort for supporting this because, compared to other solutions, dynamic locality groups make this a more difficult task when compared to products with a 1:1 mapping between locality groups and column families (as well as an extra mapping to files).
Did I miss anything?
On Thu, Nov 5, 2015 at 1:27 PM, Adam Fuchs <[EMAIL PROTECTED]> wrote:
With regards to adding features, it probably makes sense to talk about adding table/namespace crypto configuration separately from column-level encryption. Column-level encryption would require big changes to how we partition data, how we organize configuration information, and how we handle crypto-related errors. Table/namespace configuration would be much more readily achievable, and should be considered separately.
Adam On Thu, Nov 5, 2015 at 2:11 PM, William Slacum <[EMAIL PROTECTED]> wrote: