Read text from a PDF with Powershell

The other day I helped a co worker with a script he was working on. He needed to read text from a PDF with Powershell. I had done this in the past with autoit but that wasn’t going to be an option this time. There are a lot of posts about this online but they almost all lead to itext7 I don’t if my co worker and I are just dumb but we just could not get their module installed. I did end up finding a different way to get this done. You really just need this DLL that has the library to deal with PDF files. I cant upload it here but you can get it easily.

Get DLL

You can download the .DLL file from this site. When you get to the site click the “Download Archive” button. This will give you a zip file. Extract it, inside the folder open sourceCode, Main, Libraries. There you will find itextsharp.dll. Copy this file to C:\PS\ (this is where our script will look).

Read text from PDF file

I made this into a function so it is easy to use in a larger script. here it is:

function convert-PDFtoText {
	param(
		[Parameter(Mandatory=$true)][string]$file
	)	
	Add-Type -Path "C:\ps\itextsharp.dll"
	$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
	for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
		$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
		Write-Output $text
	}	
	$pdf.Close()
}

This is is an example of how to run it and display the results to the screen.

$file = "C:\Path\To\PDF.pdf"

convert-PDFtoText $file

With this example we set the text into a variable for later use.

$file = "C:\Path\To\PDF.pdf"
$text = convert-PDFtoText $file
Tagged :

9 thoughts on “Read text from a PDF with Powershell

  1. Really nice, simple solution for PDF text ingest!

    There is a small typo in your example code:
    $tex=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)

    Should read:
    $text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)

    Thanks again 🙂

  2. Dear Chris,
    I have been looking for something like that but for some reason I am stuck 🙁
    If I understood correctly, I have to create the PS folder and copy the itextshart.dll to it. Doing some research, I unlocked the file through the properties but I am stuck with the command.

    When I try to run:

    $text = convert-PDFtoText $file

    I end up with an error:

    Add-Type : Could not load file or assembly ‘file:///C:\ps\itextsharp.dll’ or one of its dependencies. Operation is not
    supported. (Exception from HRESULT: 0x80131515)
    At line:5 char:2
    + Add-Type -Path “C:\ps\itextsharp.dll”
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : NotSpecified: (:) [Add-Type], FileLoadException
    + FullyQualifiedErrorId : System.IO.FileLoadException,Microsoft.PowerShell.Commands.AddTypeCommand

    New-Object : Cannot find type [iTextSharp.text.pdf.pdfreader]: verify that the assembly containing this type is loaded.
    At line:6 char:9
    + $pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $fi …
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo : InvalidType: (:) [New-Object], PSArgumentException
    + FullyQualifiedErrorId : TypeNotFound,Microsoft.PowerShell.Commands.NewObjectCommand

    You cannot call a method on a null-valued expression.
    At line:11 char:2
    + $pdf.Close()
    + ~~~~~~~~~~~~
    + CategoryInfo : InvalidOperation: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvokeMethodOnNull

    Do I need to copy other files into the PS folder or what am I doing wrong?
    Thanx for the help,
    Mike

    1. Hmm I just followed this guide on another PC and it just worked… Are you running powershell as admin? This is the line in the function that is broken for you. If you can get this to run the rest of the script should work: Add-Type -Path “C:\ps\itextsharp.dll”

      The only thing I put in the C:\ps folder was that itextsharp.dll so maybe the dll itself you have is bad? here is the one I just tested: https://1drv.ms/u/s!Al3V0Ewdxn5Kk9t0b2xpNFlSHCwF5w?e=iYTy7Z

      1. He never replied back to say if what I sent fixed the problem, but I’m pretty sure he just has a bad copy of itextsharp.dll. Did you try the one I linked for him? The easiest way to test if the dll is working is to run this command:

        Add-Type -Path “C:\ps\itextsharp.dll”

        if you don’t get an error the rest of the script should work.

Leave a Reply

Your email address will not be published. Required fields are marked *