copy

Monday, 5 May 2014



PDF Search Through VBA


Introduction



The motive behind this post came from an email question that I received from a blog reader during the previous weekend. Jason wrote: I am trying to create a link from Excel to search in a PDF file”. So, in this post I will try to give an answer to this question. In general, there two possible solutions in this problem (OK, maybe there are other solutions that I am not aware of), both of which having their own advantages and disadvantages.

The FindText method

Syntax: object.FindText(text to find, case sensitive, whole words only, beginning)

Description: The FindText method returns true if the text was found or false if it was not. If the return value is true, it finds the specified text (the first instance), scrolls so that it is visible and highlights it.

Here are the 4 arguments of this method:
Text to find: The text that is to be found.
Case sensitive: If true, the search is case-sensitive. If false, it is case-insensitive.
Whole words only: If true, the search matches only whole words. If false, it matches partial words.
Beginning: If true, the search begins on the first page of the document. If false, it begins on the current page.

Pros: Useful when searching a text phrase in the PDF document (more than one word).
Cons: In some cases it doesn't work (doesn't highlight the text). Although it might be an easy and fast method, unfortunately is not 100% reliable.

The “JSO approach”

Unlike FindText, the JSO approach doesn’t use a “native method”, but, in reality, is two loops one inside the other. The idea is to loop through all the words of the entire PDF document and compare each word with the text we are searching for. If the comparison is true, the word is highlighted, otherwise the next word is proceeded. The name of this solution comes from the Java Script Object (JSO) that performs all the hard work.

Pros: Useful when searching a SINGLE WORD in the PDF document (not phrase). It’s a quite reliable method.
Cons: In case you search two words for example in the PDF it doesn't find anything. In large PDFs it might be considerably slow.

Unfortunately, there is no straight solution to this problem, but only a sort of compromise. You will either go with the unreliable FindText method or with the slow JSO approach (only if you are searching a single word). Needless to say that the VBA code that you will find below works ONLY with Adobe Professional. If you try to use it with the Adobe Reader you will get an error.



VBA code



The FindTextInPDF macro uses the FindText method to find a text phrase inside a PDF document.

Option Explicit

Sub FindTextInPDF()
      
    '----------------------------------------------------------------------------------------
    'This macro can be used to find a specific TEXT (more than one word) in a PDF document.
    'The macro opens the PDF, finds the specified text (the first instance), scrolls so
    'that it is visible and highlights it.
    'The macro uses the FindText method (see the code below for more info).
    
    'Note that in some cases it doesn't work (doesn't highlight the text), so in those
    'cases prefer the SearchTextInPDF macro, if you have only ONE WORD to find!

    'The code uses late binding, so no reference to external library is required.
    'However, the code works ONLY with Adobe Professional, so don't try to use it with
    'Adobe Reader because you will get an "ActiveX component can't create object" error.
    
    'Written by:    Christos Samaras
    'Date:          04/05/2014
    'e-mail:        xristos.samaras@gmail.com
    'site:          http://www.myengineeringworld.net
    '----------------------------------------------------------------------------------------

    'Declaring the necessary variables.
    Dim TextToFind  As String
    Dim PDFPath     As String
    Dim App         As Object
    Dim AVDoc       As Object
               
    'Specify the text you wawnt to search.
    'TextToFind = "Christos Samaras"
    'Using a range:
    TextToFind = ThisWorkbook.Sheets("PDF Search").Range("C5").Value
           
    'Specify the path of the sample PDF form.
    'Full path example:
    'PDFPath = "C:\Users\Christos\Desktop\How Software Companies Die.pdf"
    'Using workbook path:
    'PDFPath = ThisWorkbook.Path & "\" & "How Software Companies Die.pdf"
    'Using a range:
    PDFPath = ThisWorkbook.Sheets("PDF Search").Range("C7").Value
   
    'Check if the file exists.
    If Dir(PDFPath) = "" Then
        MsgBox "Cannot find the PDF file!" & vbCrLf & "Check the PDF path and retry.", _
                vbCritical, "File Path Error"
        Exit Sub
    End If
   
    'Check if the input file is a PDF file.
    If LCase(Right(PDFPath, 3)) <> "pdf" Then
        MsgBox "The input file is not a PDF file!", vbCritical, "File Type Error"
        Exit Sub
    End If
    
    On Error Resume Next
    
    'Initialize Acrobat by creating the App object.
    Set App = CreateObject("AcroExch.App")
    
    'Check if the object was created. In case of error release the object and exit.
    If Err.Number <> 0 Then
        MsgBox "Could not create the Adobe Application object!", vbCritical, "Object Error"
        Set App = Nothing
        Exit Sub
    End If
    
    'Create the AVDoc object.
    Set AVDoc = CreateObject("AcroExch.AVDoc")
    
    'Check if the object was created. In case of error release the objects and exit.
    If Err.Number <> 0 Then
        MsgBox "Could not create the AVDoc object!", vbCritical, "Object Error"
        Set AVDoc = Nothing
        Set App = Nothing
        Exit Sub
    End If
    
    On Error GoTo 0
    
    'Open the PDF file.
    If AVDoc.Open(PDFPath, "") = True Then
        
        'Open successful, bring the PDF document to the front.
        AVDoc.BringToFront
        
        'Use the FindText method in order to find and highlight the desired text.
        'The FindText method returns true if the text was found or false if it was not.
        'Here are the 4 arguments of the FindText methd:
        'Text to find:          The text that is to be found (in this example the TextToFind variable).
        'Case sensitive:        If true, the search is case-sensitive. If false, it is case-insensitive (in this example is True).
        'Whole words only:      If true, the search matches only whole words. If false, it matches partial words (in this example is True).
        'Search from 1st page:  If true, the search begins on the first page of the document. If false, it begins on the current page (in this example is False).
        If AVDoc.FindText(TextToFind, True, True, False) = False Then

            'Text was not found, close the PDF file without saving the changes.
            AVDoc.Close True
            
            'Close the Acrobat application.
            App.Exit
               
            'Release the objects.
            Set AVDoc = Nothing
            Set App = Nothing
            
            'Inform the user.
            MsgBox "The text '" & TextToFind & "' could not be found in the PDF file!", vbInformation, "Search Error"
            
        End If
        
    Else
        
        'Unable to open the PDF file, close the Acrobat application.
        App.Exit

        'Release the objects.
        Set AVDoc = Nothing
        Set App = Nothing
        
        'Inform the user.
        MsgBox "Could not open the PDF file!", vbCritical, "File error"
        
    End If
    
End Sub

And here is the code for the second macro SearchWordInPDF, which uses the JSO approach.

Option Explicit

Sub SearchWordInPDF()
      
    '----------------------------------------------------------------------------------------
    'This macro can be used to find a specific WORD in a PDF document (one word ONLY -> in
    'case you search two words for example it doesn't find anything, just opens the file).
    'The macro opens the PDF, finds the first appearance of the specified word, scrolls
    'so that it is visible and highlights it.

    'The code uses late binding, so no reference to external library is required.
    'However, the code works ONLY with Adobe Professional, so don't try to use it with
    'Adobe Reader because you will get an "ActiveX component can't create object" error.
    
    'Written by:    Christos Samaras
    'Date:          04/05/2014
    'e-mail:        xristos.samaras@gmail.com
    'site:          http://www.myengineeringworld.net
    '--------------------------------------------------------------------------------------

    'Declaring the necessary variables.
    Dim WordToFind  As String
    Dim PDFPath     As String
    Dim App         As Object
    Dim AVDoc       As Object
    Dim PDDoc       As Object
    Dim JSO         As Object
    Dim i           As Long
    Dim j           As Long
    Dim Word        As Variant
    Dim Result      As Integer

    'Specify the text you want to search.
    'WordToFind = "Engineering"
    'Using a range:
    WordToFind = ThisWorkbook.Sheets("PDF Search").Range("C12").Value
    
    'Specify the path of the sample PDF form.
    'Full path example:
    'PDFPath = "C:\Users\Christos\Desktop\How Software Companies Die.pdf"
    'Using workbook path:
    'PDFPath = ThisWorkbook.Path & "\" & "How Software Companies Die.pdf"
    'Using a range:
    PDFPath = ThisWorkbook.Sheets("PDF Search").Range("C14").Value
    
    'Check if the file exists.
    If Dir(PDFPath) = "" Then
        MsgBox "Cannot find the PDF file!" & vbCrLf & "Check the PDF path and retry.", _
                vbCritical, "File Path Error"
        Exit Sub
    End If
   
    'Check if the input file is a PDF file.
    If LCase(Right(PDFPath, 3)) <> "pdf" Then
        MsgBox "The input file is not a PDF file!", vbCritical, "File Type Error"
        Exit Sub
    End If
    
    On Error Resume Next
    
    'Initialize Acrobat by creating the App object.
    Set App = CreateObject("AcroExch.App")
    
    'Check if the object was created. In case of error release the objects and exit.
    If Err.Number <> 0 Then
        MsgBox "Could not create the Adobe Application object!", vbCritical, "Object Error"
        Set App = Nothing
        Exit Sub
    End If
    
    'Create the AVDoc object.
    Set AVDoc = CreateObject("AcroExch.AVDoc")
    
    'Check if the object was created. In case of error release the objects and exit.
    If Err.Number <> 0 Then
        MsgBox "Could not create the AVDoc object!", vbCritical, "Object Error"
        Set AVDoc = Nothing
        Set App = Nothing
        Exit Sub
    End If
    
    On Error GoTo 0
    
    'Open the PDF file.
    If AVDoc.Open(PDFPath, "") = True Then
        
        'Open successful, bring the PDF document to the front.
        AVDoc.BringToFront
        
        'Set the PDDoc object.
        Set PDDoc = AVDoc.GetPDDoc
        
        'Set the JS Object - Java Script Object.
        Set JSO = PDDoc.GetJSObject
        
        'Search for the word.
        If Not JSO Is Nothing Then
        
            'Loop through all the pages of the PDF.
            For i = 0 To JSO.numPages - 1
            
                'Loop through all the words of each page.
                For j = 0 To JSO.getPageNumWords(i) - 1
                    
                    'Get a single word.
                    Word = JSO.getPageNthWord(i, j)
                    
                    'If the word is string...
                    If VarType(Word) = vbString Then
                        
                        'Compare the word with the text to be found.
                        Result = StrComp(Word, WordToFind, vbTextCompare)
                        
                        'If both strings are the same.
                        If Result = 0 Then
                            'Select the word and exit.
                            Call JSO.selectPageNthWord(i, j)
                            Exit Sub
                        End If
                        
                    End If
                    
                Next j
                
            Next i
            
            'Word was not found, close the PDF file without saving the changes.
            AVDoc.Close True
            
            'Close the Acrobat application.
            App.Exit
               
            'Release the objects.
            Set JSO = Nothing
            Set PDDoc = Nothing
            Set AVDoc = Nothing
            Set App = Nothing
            
            'Inform the user.
            MsgBox "The word '" & WordToFind & "' could not be found in the PDF file!", vbInformation, "Search Error"
            
        End If
        
    Else
                
        'Unable to open the PDF file, close the Acrobat application.
        App.Exit

        'Release the objects.
        Set AVDoc = Nothing
        Set App = Nothing
        
        'Inform the user.
        MsgBox "Could not open the PDF file!", vbCritical, "File error"
        
    End If
    
End Sub

Both macros were tested using a PDF file that was created based on this article.



Downloads



Download

The zip file contains an Excel file and a sample PDF file. The Excel file can be opened with Excel 2007 or newer. Please enable macros before using it.

Did you like this post? If yes, then share it with your friends. Thank you!



Categories:


Mechanical Engineer (Ph.D. cand.), M.Sc. Cranfield University, Dipl.-Ing. Aristotle University, Thessaloniki - Greece.
Communication: e-mail, Facebook, Twitter, Google+ and Linkedin. More info