Read text from a PDF with Powershell

The other day I helped a co worker with a script he was working on. He needed to read text from a PDF with Powershell. I had done this in the past with autoit but that wasn’t going to be an option this time. There are a lot of posts about this online but they almost all lead to itext7 I don’t if my co worker and I are just dumb but we just could not get their module installed. I did end up finding a different way to get this done. You really just need this DLL that has the library to deal with PDF files. I cant upload it here but you can get it easily.

Get DLL

You can download the .DLL file from this site (UPDATE 10-19-21: the original .dll is no longer at the original link. Another commenter pointed out its on github here also I created a share link for the .dll I use for this and I know this one works. That link is here) . When you get to the site click the “Download Archive” button. This will give you a zip file. Extract it, inside the folder open sourceCode, Main, Libraries. There you will find itextsharp.dll. Copy this file to C:\PS\ (this is where our script will look).

Read text from PDF file

I made this into a function so it is easy to use in a larger script. here it is:

function convert-PDFtoText {
	param(
		[Parameter(Mandatory=$true)][string]$file
	)	
	Add-Type -Path "C:\ps\itextsharp.dll"
	$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
	for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
		$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
		Write-Output $text
	}	
	$pdf.Close()
}

This is is an example of how to run it and display the results to the screen.

$file = "C:\Path\To\PDF.pdf"

convert-PDFtoText $file

With this example we set the text into a variable for later use.

$file = "C:\Path\To\PDF.pdf"
$text = convert-PDFtoText $file
Tagged :