I’ve scanned all my documents (incomming invoices) in TIFF but also have a lot of documents in PDF and DOC(X) format. Now when something gets broke and I need to return it to the shop I’m often looking for that invoice, sometimes I don’t even know at which shop I bought it. Imagine I could just type: “HP 8000” or “light bulb” and it shows all documents with those words on it, I find the document within a second and print it out.
I was trying to get this to work the past years but never succeeded so I gave it one last try and now it works!

Minimum OS should be Windows Server 2008 R2 or higher (I used 2012R2).
Minimum SQL Server version should be SQL 2008R (I used SQL2014).

1. Install IFilters

  • For PDF: Download and install the “Adobe PDF IFilter 11.0.01”
  • For DOC/XLS/etc.: Download and install “Office 2013 Filter Pack SP2”
  • For TIF/TIFF: Enable the “Windows TIFF IFILTER” in Windows, follow this guide, don’t forget to make the Policy setting to force indexing for every page in a TIFF document

2. Extra FIX
I had to add the path: C:\Program Files\Adobe\Adobe PDF iFilter 11 for 64-bit platforms\bin\ to my Registry!:
HKEY_LOCAL_MACHINE -> SYSTEM -> CurrentControlSet -> Control -> Session Manager -> Environment -> PATH
(also see this forum topic)

3. Enable filters and configute SQL Server
Execute the following commands on SQL Server:

EXEC sp_fulltext_service @action='load_os_resources', @value=1; 
EXEC sp_fulltext_service 'verify_signature', 0; 
EXEC sp_fulltext_service 'update_languages'; 
Exec sp_fulltext_service 'ft_timeout', 600000;
Exec sp_fulltext_service 'ism_size',@value=16;
EXEC sp_fulltext_service 'restart_all_fdhosts'; 
EXEC sp_help_fulltext_system_components 'filter'; 
reconfigure with override

(This sets the ism_size to the MAX and the ft_timeout to 10 minutes, that is maximum the time the fdhosts.exe process will spend on a single document)

4. Create Filestream/Full text Search Catalog and upload some documents
Follow this guide to install a Full Text Search and the Filestream.
Then upload a document using a query like:

INSERT INTO tbl_supplier_documents (document_filename, document_extension, document_file )
	SELECT 'test' AS document_filename, '.pdf' AS document_extension, BulkColumn FROM OPENROWSET (BULK 'c:\test.pdf', SINGLE_BLOB) AS document_file

Asume this table for the example above:

tbl_documents
document_id		INT, auto_increment
document_filename	VARCHAR(255)
document_extension	VARCHAR(10)
document_file		VARBINARY(MAX)

5. See the index getting filled
Check the Indexing LOGS. In my case there were in: E:\Microsoft SQL Server\MSSQL12.MSSQLSERVER\MSSQL\Log\SQLFT0001400007.LOG
Also click properties of the Full Text Catalog in the SQL Management Studio under -> Storage -> Full Text Catalog ->

6. Test/Search
SELECT * FROM tbl_documents WHERE CONTAINS(document_file, ‘XXXX’)

7. The .doc issue
Some people are having an issue with .doc documents no getting indexed, the soluton was to open the document and save it again. Microsoft knows this is a problem but doesn’t have a hotfix for it. The technical problem is that there is a header inside the file which is not readable (not really corrupt), how it got it there… maybe an old Office version..?
A better solution might be to convert the .doc. documents to .docx, there are many tools to do that for whole folders or in batches.

References