I’m sorry for my delay. As a first pass at an answer…We have roughly three mechanisms for file id:
1. mime patterns (magic mime)
2. package detection
3. parse-time sub-type detection
4. file name extension (completely useless for your purposes)
1. You should be able to use the mime patterns in a buffered single read. Buffer the first 1024 bytes or so and run our mime detection.
2. We are currently opening the zip/package file and looking for particular files within the zip/package files e.g. docx, xlsx…etc, which requires the whole file and cannot be done by our current methods in a streaming fashion. I don’t see a way around parsing the package/container file
3. IIRC, some of our parsers update the mime based on knowledge of that particular format’s subtypes/actually parsing the file (doc, ppt and …?) …so these would be a non-starter.
Regrettably, AFAIK, at least from a Tika perspective, there is no silver bullet.
Instead of having to spool the complete file to memory (or disk) and then run detection (or having Tika do that) for every file, I wonder if you could run 1) (mime magic detection) on the stream, and, if that returns something obvious, go with that, otherwise spool to disk and then run regular Tika on that subset of files.
Nick Burch will probably have better insight on this than my ramblings above.
From: Martin Todorov [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 4, 2018 8:48 PM
To: [EMAIL PROTECTED]
Subject: How to implement an InputStream that dynamically guesses the extension of a file that is streamed using Apache Tika?
I have asked this on Stackoverflow<https://stackoverflow.com/questions/48102004/how-to-implement-an-inputstream-that-dynamically-guesses-the-extension-of-a-file>
and was pointed here, with the hope that more people would be able to help.
We have a custom implementation of an InputStream that can currently update multiple MessageDigest-s and while reading the data. This allows for a single reading and processing of the data and avoids having to re-read files in order to calculate their checksums. This is quite efficient and saves time (and is implemented in here<https://github.com/strongbox/strongbox/blob/9dcb13255512cd396e63f712bb5ce82bb632726c/strongbox-storage/strongbox-storage-core/src/main/java/org/carlspring/strongbox/io/ArtifactInputStream.java>
As a follow-up step, we'd like to use Apache Tika to guess the file extension from the stream, which is sent over HTTP. I know some of you will suggest simply setting the Content-Type header and requiring that it's set, but, unfortunately, for various reasons, we cannot rely on this, or enforce it. Hence, I'm looking for a way to guess the extension based on the InputStream, while it's being sent.
We also need to be able to guess complex extension types (such as tar.gz, tar.bz2 and other similar ones that aren't easy to guess by just doing a substring from the last index of the dot until the end of the string).
What is the most-efficient way to do this? We cannot afford to read the whole files in memory, as the application will have to be able to handle a large number of concurrent requests. Could somebody please provide an example, of how this could be done?
We have an open issue<https://github.com/strongbox/strongbox/issues/370>
and a pull request here<https://github.com/strongbox/strongbox/pull/468/files#diff-8024b836036b6f5fb567a3ce48c2a4d6R221>
, if anyone would like to have a closer look and help out.
Looking forward to your suggestions and replies!