The other day I helped a co worker with a script he was working on. He needed to read text from a PDF with Powershell. I had done this in the past with autoit but that wasn’t going to be an option this time. There are a lot of posts about this online but they almost all lead to itext7 I don’t if my co worker and I are just dumb but we just could not get their module installed. I did end up finding a different way to get this done. You really just need this DLL that has the library to deal with PDF files. I cant upload it here but you can get it easily.
Get DLL
You can download the .DLL file from this site (UPDATE 10-19-21: the original .dll is no longer at the original link. Another commenter pointed out its on github here also I created a share link for the .dll I use for this and I know this one works. That link is here) . When you get to the site click the “Download Archive” button. This will give you a zip file. Extract it, inside the folder open sourceCode, Main, Libraries. There you will find itextsharp.dll. Copy this file to C:\PS\ (this is where our script will look).
Read text from PDF file
I made this into a function so it is easy to use in a larger script. here it is:
function convert-PDFtoText {
param(
[Parameter(Mandatory=$true)][string]$file
)
Add-Type -Path "C:\ps\itextsharp.dll"
$pdf = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $file
for ($page = 1; $page -le $pdf.NumberOfPages; $page++){
$text=[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($pdf,$page)
Write-Output $text
}
$pdf.Close()
}
This is is an example of how to run it and display the results to the screen.
$file = "C:\Path\To\PDF.pdf"
convert-PDFtoText $file
With this example we set the text into a variable for later use.
$file = "C:\Path\To\PDF.pdf"
$text = convert-PDFtoText $file