We are starting with a wholly unusable PDF File (https://docs.house.gov/billsthisweek/20240318/WDI39597.PDF). First we need to parse it into usable data.
Goal 1: Read and parse the document
Goal 2: Extract sections.
Read and parse the document
What are some readily available open source projects that I can use to parse PDFs into text?
- Poppler Utils: Poppler is a PDF rendering library that includes a variety of utilities for extracting text, images, and other content from PDFs. The
pdftotext
command is part of this suite and is widely used for converting PDF documents into plain text. - Xpdf: Similar to Poppler, Xpdf is a toolset that includes the
pdftotext
command. It’s an older project but still effective for extracting text from PDF files. - PDFMiner: PDFMiner is a tool written in Python specifically for extracting text, images, and metadata from PDF files. It’s more flexible than
pdftotext
and allows for more detailed analysis of the PDF structure, making it suitable for more complex extraction tasks. - mutool: Part of the MuPDF suite,
mutool
can extract text and images from PDF files. MuPDF is known for its speed and the quality of its rendering. - Apache PDFBox: Although primarily a Java library for working with PDF documents, PDFBox comes with a command-line utility that can be used to extract text from PDFs. It’s useful for those who prefer a Java-based solution.
- Tesseract OCR: For PDFs that contain mostly images of text (like scanned documents), Tesseract OCR can be a powerful tool. It’s an optical character recognition (OCR) engine that can convert images into text, and with the right preprocessing, it can be used to extract text from image-based PDFs.
Let’s try some of these out and see how the result varies. I’m most interested in #1 poppler and #3 pdfminer
Using the poppler option I found 👇 provides a good starting point for text cleanup.
pdftotext -layout -enc UTF-8 WDI39597.pdf poppler.txt
PDFMiner has more options in term of formats (text | XML | HTML). The first thing I noticed though, it’s significantly slower to execute. Annnd. the output is far less usable. I was hopeful for the HTML or XML output. The most ridiculous output was XML. There was literally tags around every letter.
pdf2txt.py -o pdfminer.txt -t text -A WDI39597.pdf
pdf2txt.py -o pdfminer.html -t html -A WDI39597.pdf
pdf2txt.py -o pdfminer.xml -t xml -A WDI39597.pdf
POPPLER WINS! It creates a usable output and is WAY faster in terms of execution, not that that is a huge factor.
Now we have something that looks like this: https://snovak.com/wp-content/uploads/2024/03/poppler.txt
Now, Let’s strip out some garbage and format this a bit more.
I’m using a Python script to do this part.
First it detects page number and formats that appropriately.
Then, it gets rid of leading whitespace.
Then, ditch the date, and the line under that, which has some crazy special characters….
Then, ditch any lines that end in ‘SEN. APPRO’
Now we have something that looks like this… https://snovak.com/wp-content/uploads/2024/03/WDI39597.txt
I’ve preserved the page numbers and line numbers for citation purposes. So, if I want to recall where appropriations were made in the bill, I can cite “Page 36 Line 22” for example.
I’ll have to get to extracting the sections tomorrow…