[TIKA-738] Tika fails to extract text from PDF annotations - Tika - [issue]
...Spinoff from TIKA-717....    Author: Michael McCandless , 2011-11-26, 19:54
[TIKA-742] PDF2XHTML fails to insert <p> nor space around page marker - Tika - [issue]
...I have a test document (unfortunately not committable) whose pagenumbers are rendered with no separator (<p> nor space) before the nextword.  So I have words like: 1Massachusetts ...    Author: Michael McCandless , 2011-10-05, 10:43
[TIKA-751] Small improvements to how embedded docs are parsed in AbstractPOIFSExtractor.handleEmbeddedOfficeDoc - Tika - [issue]
...I noticed some minor things in this method: It does too much work (writes the tmpFile out) if the    EmbeddedDocumentExtractor didn't want to actually parse file    file....    Author: Michael McCandless , 2011-10-12, 19:19
[TIKA-753] Improve performance when parsing embedded Office docs - Tika - [issue]    Author: Michael McCandless , 2011-10-20, 12:37
[TIKA-757] Address TODOs when we upgrade to next POI release (3.8 beta 5) - Tika - [issue]
...I'm opening a blanket issue to remind us all to address the TODOs in the sources for when we upgrade to the next POI.I think this (a single blanket issue) is better than keeping separate iss...    Author: Michael McCandless , 2012-07-01, 21:23
[TIKA-758] Address TODOs when we upgrade to next PDFBox release - Tika - [issue]
...Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox....    Author: Michael McCandless , 2015-03-02, 20:51
[TIKA-767] Enable controlling of PDFBOX's setSuppressDuplicateOverlappingText from PDFParser - Tika - [issue]
...Given that there are some problems with how overlapping text isremoved (slow performance: PDFBOX-956; some chars incorrectly skipped:PDFBOX-1155), I think we should make this controllable fr...    Author: Michael McCandless , 2011-11-04, 16:28
[TIKA-1010] Embedded documents in RTF are not extracted - Tika - [issue]
...When an RTF doc embeds a doc it looks like this:{\object\objemb\objw628\objh765{\*\objclass Package}{\*\objdata 0105000002000000080000005061636b61676500000000000000000066000000020048772e7478...    Author: Michael McCandless , 2016-02-01, 16:32
[TIKA-1011] Exception (Null charset name) processing .mhtml file - Tika - [issue]
...This small test.mhtml file:From: <Saved by Windows Internet Explorer 8>Subject: Index PagesDate: Tue, 28 Aug 2012 09:53:28 +0300MIME-Version: 1.0Content-Type: multipart/related; type="...    Author: Michael McCandless , 2012-10-26, 21:07
[TIKA-1015] Word (.doc) embedded files don't set relationship ID in the Metadata - Tika - [issue]    Author: Michael McCandless , 2012-10-31, 15:07