new pdf search module

myntara

Joined: 2006-03-06
Posts: 1
Posted: Mon, 2006-03-06 23:39

Hi all,

I'm developing a PDF full-text search and display module for 2.1. I'm just starting out and was hoping that you could point me in the right directions...The module will index PDF files full-text and make them available for search. Furthermore, it should also create a thumbnail of every page in the PDF file and show all these seperate page thumbnails in the detail view for that PDF.

Here's what I've got so far:

- class PdfModule extends GalleryModule [...] $ret = GalleryCoreApi::registerFactoryImplementation('GalleryItem', 'GalleryPdfItem', 'GalleryPdfItem', 'modules/pdf/classes/GalleryPdfItem.class', 'pdf', array('application/pdf'), 0);
- class GalleryPdfItem extends GalleryDataItem {...}

so far so good. For the thumbnails, I basically want a table that stores the following:

object_id(of the pdf), page_number(page number of the PDF this row is about), full text (of that page), thumbnail (of that page).

I'd like to create those rows when the PDF is added and be able to search them. I'd also like to be able to display all the thumbnails associated with a PDF on the PDF's detail page. I was looking at the thumbnail module, but I'm a little confused as to why that extends GalleryEntity and not GalleryDerivative...

Any pointers in the right directions for these issues would be appreciated.

Thanks!

Felix

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Sat, 2006-03-11 08:14

the thumbnail module isn't used in g2 to generate thumbnails. this module does some additional, optional thumbnail stuff like the option to specify a custom thumbnail for each item.

also note that g2 already can generate thumbnails for pdf files with the imagemagick module. but only for one page per document.
with the thumbpage module, you can specify which page.

also, what's your plan for the GalleryPdfItem?
i guess you want to change the "render" method of this item to display multiple thumbnails.
also, maybe have some summary info as a class member, e.g. "numberOfPages"

what you should do:
- find a binary or php extension to extract the text from pdf files
- decide what kind of things individual pages should be in G2 nomenclature. normal GalleryDerivativeItems? or photos? or add another entity type, e.g. GalleryPdfPage? or no entity at all, some bad hack?
- create a db table ( a Map, see the exif module for an example) to have a documentId (entityId of the pdf document), pageId, text
- register a toolkit implementation to handle pdfs and this toolkit then has to extract the text from it and create the pages
- tweak GalleryPdfItem function render to show multiple (but hopefully not all) pages at once.
- implement the search interface, see the comment module as an example

 
ruigato

Joined: 2006-07-26
Posts: 27
Posted: Wed, 2007-01-31 14:47

any news on this? i would like to make an online archive of my local newspaper.. maybe gallery can handel it