Completely agree, awesome job Nick.

I will definitely try this week as well.

Thank you!


On 3/18/18, 2:47 PM, "David Meikle" <[EMAIL PROTECTED]> wrote:

    Nice one Nick!  Will take a look this week.
    On 14 March 2018 at 17:38, Nick Burch <[EMAIL PROTECTED]> wrote:
    > Hi All
    > As promised, I've finally had a go to try and implement my ideas for
    > TIKA-1509 / /
    > breaking 2.x parser change
    > My work so far is in this github branch, and is ready for review!
    > It seems to work fine for the Fallback case, and for the Supplemental
    > case. You can set a policy that controls how clashing metadata is handled,
    > currently "first one to set a key wins", "last one to set a key wins",
    > "ignore previous parsers", and "keep old and new unique values"
    > I've also done a proof of concept for "pick best" case, to try running the
    > text parser with a specified set of different charsets, capture the text
    > from each, "pick the best" (hard coded 1st...) then run for real with that
    > one.
    > Key TODOs - Support InputStreamFactory, properly work out what mimetypes
    > to claim to support, Tika Config XML friendly helper for the metadata clash
    > policy, review ContentHandlerFactory signature and tweak if needed.
    > Proposed breaking 2.x change - add second parse method that takes
    > ContentHandlerFactory instead of ContentHandler, with most parsers getting
    > that just grabbing a single one and using that as before
    > Before I do any more though... Thoughts? Comments? Ideas? Changes? Should
    > I stop? Carry on? Modify it? Other?
    > Nick