Searching for text in PDFs at increasing scale
I had the interesting challenge of searching for text within a large number of PDFs recently. This was to assist a finance team in automating the organising and categorising of some of their existing documents. When I said large number, it was around 350,000 PDF documents, so quite a few! I iterated through a few different solutions and tried to focus on delivering something optimal and efficient. I tested each on a smaller scenario to benchmark how they might perform at increasing scale - the results can be found at the end of the article.
Getting started with PyPDF2
With Python being my usual go to Swiss Army Knife for many things, I first installed this very useful package to give it a go:
pip install PyPDF2
I had read about PyPDF2 in Automate the Boring Stuff with Python so at least I had a starting point. PyPDF2 has also has some changes in the latest version 3.0.1 which you can read about in the documentation and migration guide so some of the functions have changed. I put together the following CLI tool using the PyPDF2 package:
import PyPDF2
import re
import time
import sys
def main():
if len(sys.argv) != 2:
sys.exit("Usage: python pdf_searcher.py filename.pdf")
filename = sys.argv[1]
file = open(filename, "rb")
pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file)
number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages()
start = time.time()
print("Type your search term and hit enter")
print("You can add as many search terms as you like")
print("Once you're done, hit enter to continue...")
search_terms = get_search_terms_from_user(search_terms = [])
for i in range(0, number_of_pages):
page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i)
page_content = page.extract_text() # Formerly page.extractText()
for search_term in search_terms:
if re.search(search_term, page_content):
print(f"Matched '{search_term}' on page {i}")
print(f"Program took {time.time() - start} seconds")
def get_search_terms_from_user(search_terms: list) -> list:
search_term = str(input("Search term: "))
if search_term != "":
search_terms.append(search_term)
return get_search_terms_from_user(search_terms)
else:
return search_terms
if __name__ == "__main__":
main()
This accepted a filename as a command line argument, followed by a prompt to enter search terms.
Optimising the PyPDF2 script
So this was a good start and a fun program for searching a single PDF but some optimisations were needed. In addition, the program needed to search an entire directory of files so it needed extending. The program didn't need to find every word that matched the search criteria in the given document, just that it does in fact occur in there at least once. So to optimise based on that use case, once it's certain that the search term does exist for the given document, it doesn't have to look for that word again saving time.
import PyPDF2
import re
import time
import sys
import os
import glob
def main():
directory = os.path.dirname(os.path.abspath(__file__))
pdf_filepaths = glob.glob("**/*.pdf", recursive=True)
start = time.time()
results = {}
for filepath in pdf_filepaths:
print(f"Searching document {filepath}")
search_terms = ["hurricanes", "walt", "avenue", "disney", "mercedes"]
filename = os.path.basename(filepath)
found_terms = {}
file = open(filepath, "rb")
pdf_reader = PyPDF2.PdfReader(file) # Formerly PyPDF2.PdfFileReader(file)
number_of_pages = len(pdf_reader.pages) # Formerly pdf_reader.getNumPages()
for i in range(0, number_of_pages):
page = pdf_reader.pages[i] # Formerly pdf_reader.getPage(i)
page_content = page.extract_text() #Formerly page.extractText()
for term in search_terms:
if term in found_terms.keys():
continue
if re.search(term.lower(), page_content.lower()):
print(f"Found '{term}' in document '{filename}'")
found_terms[term] = 1
if filename in results.keys():
results[filename].append(term)
else:
results[filename] = [term]
print(f"Program took {time.time() - start} seconds")
print(results)
if __name__ == "__main__":
main()
Alternative approach with pdftotext subprocess
The second solution called the pdftotext program in a Python subprocess to receive the text as the subprocess output. It did exactly the same thing as the previous script but might be faster - we'll compare the speed of each approach later.
import os
import subprocess
import re
import time
import glob
def main():
directory = os.path.dirname(os.path.abspath(__file__))
pdf_filepaths = glob.glob("**/*.pdf", recursive=True)
start = time.time()
results = {}
for filepath in pdf_filepaths:
print(f"Searching document {filepath}")
search_terms = ["hurricanes", "epcot", "daimler", "disney", "mercedes"]
filename = os.path.basename(filepath)
found_terms = {}
args = ["pdftotext",
'-enc',
'UTF-8',
filepath, # Example: "pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021.pdf"
'-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')
for term in search_terms:
if term in found_terms.keys():
continue
if re.search(term.lower(), output.lower()):
print(f"Found '{term}' in document '{filename}'")
found_terms[term] = 1
if filename in results.keys():
results[filename].append(term)
else:
results[filename] = [term]
print(f"Program took {time.time() - start} seconds")
print(results)
if __name__ == "__main__":
main()
Trying out C# and iTextSharp
I thought I'd switch to C# and investigate the iTextSharp NuGet package for reading and searching PDFs. I was pleasantly surprised at how well this package worked. It was also quick to install and get started with. Here is the program I put together using it:
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFSearcherSharp
{
class Program
{
static void Main(string[] args)
{
var stopwatch = new System.Diagnostics.Stopwatch();
stopwatch.Start();
string directory = @"C:/Users/shedloadofcode/source/repos/PDFSearcherSharp/pdfs/";
string[] files = Directory.GetFiles(directory, "*.pdf");
List<string> searchTerms = new List<string>() { "hurricanes", "epcot", "daimler", "disney", "mercedes" };
foreach (var filename in files)
{
Console.WriteLine($"Searching document {filename}");
StringBuilder stringBuilder = new StringBuilder();
string filePath = System.IO.Path.Combine(directory, filename);
using (PdfReader reader = new PdfReader(filePath))
{
List<string> foundTerms = new List<string>();
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(
ASCIIEncoding.Convert(
Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(text)
)
);
stringBuilder.Append(text);
foreach (string term in searchTerms)
{
if (foundTerms.Contains(term))
{
continue;
}
if (text.ToLower().Contains(term.ToLower()))
{
Console.WriteLine($"Found '{term}' in document '{filename}'");
foundTerms.Add(term);
}
}
}
}
// Console.WriteLine(stringBuilder.ToString());
}
stopwatch.Stop();
Console.WriteLine($"Program took {stopwatch.ElapsedMilliseconds / 1000} seconds");
}
}
}
A last approach with C++ and pdftotext
The fourth and final approach involved calling the pdftotext executable again, but this time with the main script written in C++. I was curious to see how to put together a solution for this in C++ more than anything. I couldn't figure out a way to return the output from the pdftotext executable to stdout in-process, so resorted to converting the PDFs to text files first, searching the text files and then finally deleting them - this created added overhead so will likely slow it down.
#include <Windows.h>
#include <fstream>
#include <iostream>
#include <string>
#include <regex>
#include <map>
#include <filesystem>
#include <vector>
#include <thread>
using namespace std;
using std::filesystem::directory_iterator;
void DeleteTextFile(string filePath)
{
string fileName = filePath;
fileName = fileName.substr(0, fileName.size() - 4);
fileName = fileName + ".txt";
const char* file = fileName.c_str();
if (remove(file) != 0)
cout << "Error deleting file " << fileName << endl;
else
cout << "File " << fileName << " successfully deleted" << endl;
}
string TransformLineToLowercase(string line)
{
std::for_each(line.begin(), line.end(), [](char& c)
{
c = ::tolower(c);
});
return line;
}
void SearchTextFile(string fileName, string searchTerms[], int searchTermsLength)
{
map<string, bool> foundSearchTerms;
for (int i = 0; i < searchTermsLength; i++)
{
foundSearchTerms[searchTerms[i]] = false;
}
fstream textFile;
textFile.open(fileName, ios::in);
if (textFile.is_open())
{
int totalNumberOfMatches = 0;
string line;
while (getline(textFile, line))
{
string lowercaseLine = TransformLineToLowercase(line);
for (int i = 0; i < searchTermsLength; i++)
{
bool searchTermAlreadyFound = foundSearchTerms[searchTerms[i]] == 1;
if (searchTermAlreadyFound)
{
continue;
}
int indexOfMatch = lowercaseLine.find(searchTerms[i]);
if (indexOfMatch > -1)
{
cout << "Found search term " << searchTerms[i] << "in " << fileName << " at ";
cout << "position " << indexOfMatch << " in line" << lowercaseLine << endl;
foundSearchTerms[searchTerms[i]] = 1;
}
}
}
textFile.close();
}
}
vector<std::filesystem::path> GetAllFileNamesInDirectory()
{
string path = "pdfs/";
vector<std::filesystem::path> filePaths;
for (const auto& file : directory_iterator(path))
{
filePaths.push_back(file.path());
}
return filePaths;
}
void GenerateTextFile(string filePath)
{
STARTUPINFO startupInfo;
PROCESS_INFORMATION processInformation;
ZeroMemory(&startupInfo, sizeof(startupInfo));
ZeroMemory(&processInformation, sizeof(processInformation));
wstring filePathWs = wstring(filePath.begin(), filePath.end());
wstring commandLineArgs = L"pdftotext.exe -enc UTF-8 \"" + filePathWs + L"\"";
wstring commandLineArgsWs = wstring(commandLineArgs.begin(), commandLineArgs.end()).c_str();
std::wstring commandLineInput(commandLineArgsWs);
// This was the first attempt
// wchar_t commandLineInput[] = TEXT("pdftotext.exe -enc UTF-8 \"pdfs/United-Kingdom-Strategic-Export-Controls-Annual-Report-2021 - Copy - Copy (7).pdf\"");
bool output = CreateProcess(
NULL, // Application name
&commandLineInput[0], // Command line arguments
NULL, // Process attributes
NULL, // Thread attributes
TRUE, // Inherit handles
0, // No creation flags
NULL, // Environment
NULL, // Current directory
&startupInfo, // Startup information
&processInformation // Process information
);
if (output == FALSE)
{
cout << "Generating text file for PDF " << filePath << " failed" << endl;
}
else
{
cout << "Generating text file for PDF " << filePath << endl;
// cout << "Process ID: " << processInformation.dwProcessId << endl;
}
WaitForSingleObject(processInformation.hProcess, INFINITE);
CloseHandle(processInformation.hProcess);
CloseHandle(processInformation.hThread);
}
int main()
{
clock_t start = clock();
vector<std::filesystem::path> filePaths = GetAllFileNamesInDirectory();
for (int i = 0; i < filePaths.size(); i++)
{
string filePath = filePaths[i].string();
GenerateTextFile(filePath);
}
for (int i = 0; i < filePaths.size(); i++)
{
string filePath = filePaths[i].string();
string fileName = filePath;
fileName = fileName.replace(0, 5, "");
fileName = fileName.substr(0, fileName.size() - 4);
string searchTerms[5] = { "hurricanes", "epcot", "daimler", "disney", "mercedes" };
string textFilePath = "pdfs/" + fileName + ".txt";
SearchTextFile(textFilePath, searchTerms, (sizeof(searchTerms) / sizeof(*searchTerms)));
}
for (int i = 0; i < filePaths.size(); i++)
{
string filePath = filePaths[i].string();
DeleteTextFile(filePath);
}
double duration = (clock() - start) / (double)CLOCKS_PER_SEC;
cout << "Program took " << duration << " seconds" << endl;
system("pause > 0");
return 0;
}
Test exercise and speed benchmarks
So we now have four (almost) equivalent programs in terms of logic and desired output. It was time to run all of the solutions above through a scenario to see how they perform. The scenario was a directory /pdfs
containing around 200 PDF documents inside. The program would need to search all of the PDF documents and return the names of the PDF files containing the search terms. I had placed a few PDFs I knew contained the search terms with unique file names to test it works. Most documents were around 71 - 150 pages, with the largest at 432 pages. So I was testing with quite large files. If this ever went into production the files would likely be much smaller. Ok here we go!
Inputs
- 200 PDF documents
- Each PDF between 71 and 432 pages
- Average PDF file size was 5MB
- Number of search terms was 5 ["hurricanes", "epcot", "daimler", "disney", "mercedes"]
- My two target files were a Disney financial report and a Daimler financial report as I knew these actually contained the search terms (no particular reason I chose these, they were just the first I could find 😆)
Results
Approach | Found all search terms | Time in seconds |
---|---|---|
Python and PyPDF2 | Yes | 306 |
Python running pdftotext.exe | Yes | 66 |
C# and iTextSharp | Yes | 66 |
C++ running pdftotext.exe | Yes | 72 |
As I predicted the C++ program was likely slowed down by having to convert to text files first before searching. The most performant approaches and my most preferred, are Python running pdftotext.exe (which is straightforward to receive the stdout of the child process) and C# with the iTextSharp NuGet package. Both of these solutions completed in 66 seconds in the test scenario.
Folder structure for Python project (containing both versions)
/pdfs
pdftotext.exe
pdftotextsearcher.py
pypdfsearcher.py
Folder structure for C# Visual Studio project
/bin
/obj
/pdfs
/PDFSearcherSharp
PDFSearcher.csproj
PDFSearcherSharp.sln
Program.cs
Folder structure for C++ Visual Studio project
/pdfs
PdfSearcher.cpp
PdfSearcher.sln
PdfSearcher.vcxproj
PdfSearcher.vcxproj.filters
PdfSearcher.vcxproj.user
pdftotext.exe
Reflections
So I learned quite a bit about from this exercise, and this provides a good starting point to further develop a solution. It certainly needs more testing and refining to the specific use case. If searching these 200 or so fairly large files took 66 seconds, then at worst case 350,000 / 200 is 1,750 and 1750 x 66 gives 115,500 seconds. Dividing that by 60 gives 1,925 minutes. Dividing that by 60 gives 32 hours. Finally, dividing that by 24 gives 1.33 days 😄. Moving one of these scripts onto a virtual machine and letting it run until done might be the best solution depending where the files are stored.
A caveat to note is the PDFs I used had searchable text, so if you had scanned PDF documents you might need to go down the avenue of using OCR (optical character recognition). I hear pytesseract is useful for this as it acts as a wrapper for Google’s Tesseract-OCR Engine. I might venture into this area next if the need for it arises 😄. Altogether I hope I've shown that reading and searching many PDFs at increasing scale is possible with different approaches, if not always temperamental.
Resources
- Searching text in a PDF using Python
- Using pdftotext on AWS Lambda
- Extract text from PDF in C#
- Extract text from PDF using iTextSharp
- Searching strings in C#
- Child Process in Windows System Programming
- Creating a Child Process with Redirected Input and Output
- PDF parsing in C++
- C++ regex
- C++ list files in a directory
- C++ Wide Char Array Strings