Read text from a PDF with Powershell

The other day I helped a co worker with a script he was working on. He needed to read text from a PDF with Powershell. I had done this in the past with autoit but that wasn’t going to be an option this time. There are a lot of posts about this online but they almost all lead to itext7 I don’t if my co worker and I are just dumb but we just could not get their module installed. I did end up finding a different way to get this done. You really just need this DLL that has the library to deal with PDF files. I cant upload it here but you can get it easily.

Get DLL

You can download the .DLL file from this site. When you get to the site click the “Download Archive” button. This will give you a zip file. Extract it, inside the folder open sourceCode, Main, Libraries. There you will find itextsharp.dll. Copy this file to C:\PS\ (this is where our script will look).

Read text from PDF file

I made this into a function so it is easy to use in a larger script. here it is:

function convert-PDFtoText {
	param(
		[Parameter(Mandatory=$true)][string]$file
	)	
	Add-Type -Path "C:\ps\itextsharp.dll"
	$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
	for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
		$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
		Write-Output $text
	}	
	$pdf.Close()
}

This is is an example of how to run it and display the results to the screen.

$file = "C:\Path\To\PDF.pdf"

convert-PDFtoText $file

With this example we set the text into a variable for later use.

$file = "C:\Path\To\PDF.pdf"
$text = convert-PDFtoText $file
Tagged :