Completely agree, awesome job Nick.
I will definitely try this week as well.
On 3/18/18, 2:47 PM, "David Meikle" <[EMAIL PROTECTED]> wrote:
Nice one Nick! Will take a look this week.
On 14 March 2018 at 17:38, Nick Burch <[EMAIL PROTECTED]> wrote:
> Hi All
> As promised, I've finally had a go to try and implement my ideas for
> TIKA-1509 / https://wiki.apache.org/tika/CompositeParserDiscussion
> breaking 2.x parser change
> My work so far is in this github branch, and is ready for review!
> It seems to work fine for the Fallback case, and for the Supplemental
> case. You can set a policy that controls how clashing metadata is handled,
> currently "first one to set a key wins", "last one to set a key wins",
> "ignore previous parsers", and "keep old and new unique values"
> I've also done a proof of concept for "pick best" case, to try running the
> text parser with a specified set of different charsets, capture the text
> from each, "pick the best" (hard coded 1st...) then run for real with that
> Key TODOs - Support InputStreamFactory, properly work out what mimetypes
> to claim to support, Tika Config XML friendly helper for the metadata clash
> policy, review ContentHandlerFactory signature and tweak if needed.
> Proposed breaking 2.x change - add second parse method that takes
> ContentHandlerFactory instead of ContentHandler, with most parsers getting
> that just grabbing a single one and using that as before
> Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
> I stop? Carry on? Modify it? Other?