Make PDF files searchable
Subject of this ticket is making PDF files searchable by their text content.
- Indexing:
- Both needs to be indexed, local PDF files (single-user installation), and uploaded PDF files (client/server installation).
- Indexing takes place automatically whenever a File topic is created.
- This ticket addresses only PDF that contain extractable text. In contrast, PDFs containing only (scanned) images is not part of this ticket.
- Frontend:
- Existing Webclient search dialog transparently also searches PDF content. Result are File topics. No UI change needed.
- Visualization of actual match positions within PDF file is not part of this ticket.
- Implementation:
- Extend Core's
DMXStorage
interface by dedicated indexing method. - Provide indexing functionality as separate module dmx-pdf-search.
- For text extraction Apache PDFBox could be used. The indexing facade for various files types, Apache Tika, is possibly not needed at the moment (for PDF it relies on PDFBox anyways).
- Extend Core's
Needed for dmx-projects/lqdn#2
@jpn FYI