Month: March 2024

  • Day One: Text Wrangling

    We are starting with a wholly unusable PDF File (https://docs.house.gov/billsthisweek/20240318/WDI39597.PDF). First we need to parse it into usable data.

    Goal 1: Read and parse the document

    Goal 2: Extract sections.

    Read and parse the document

    What are some readily available open source projects that I can use to parse PDFs into text?

    1. Poppler Utils: Poppler is a PDF rendering library that includes a variety of utilities for extracting text, images, and other content from PDFs. The pdftotext command is part of this suite and is widely used for converting PDF documents into plain text.
    2. Xpdf: Similar to Poppler, Xpdf is a toolset that includes the pdftotext command. It’s an older project but still effective for extracting text from PDF files.
    3. PDFMiner: PDFMiner is a tool written in Python specifically for extracting text, images, and metadata from PDF files. It’s more flexible than pdftotext and allows for more detailed analysis of the PDF structure, making it suitable for more complex extraction tasks.
    4. mutool: Part of the MuPDF suite, mutool can extract text and images from PDF files. MuPDF is known for its speed and the quality of its rendering.
    5. Apache PDFBox: Although primarily a Java library for working with PDF documents, PDFBox comes with a command-line utility that can be used to extract text from PDFs. It’s useful for those who prefer a Java-based solution.
    6. Tesseract OCR: For PDFs that contain mostly images of text (like scanned documents), Tesseract OCR can be a powerful tool. It’s an optical character recognition (OCR) engine that can convert images into text, and with the right preprocessing, it can be used to extract text from image-based PDFs.

    Let’s try some of these out and see how the result varies. I’m most interested in #1 poppler and #3 pdfminer

    Using the poppler option I found ? provides a good starting point for text cleanup.

    pdftotext -layout -enc UTF-8 WDI39597.pdf poppler.txt

    PDFMiner has more options in term of formats (text | XML | HTML). The first thing I noticed though, it’s significantly slower to execute. Annnd. the output is far less usable. I was hopeful for the HTML or XML output. The most ridiculous output was XML. There was literally tags around every letter.

    pdf2txt.py -o pdfminer.txt -t text -A WDI39597.pdf 
    pdf2txt.py -o pdfminer.html -t html -A WDI39597.pdf 
    pdf2txt.py -o pdfminer.xml -t xml -A WDI39597.pdf 

    POPPLER WINS! It creates a usable output and is WAY faster in terms of execution, not that that is a huge factor.

    Now we have something that looks like this: https://playwell.studio/wp-content/uploads/2024/03/poppler.txt

    Now, Let’s strip out some garbage and format this a bit more.

    I’m using a Python script to do this part.
    First it detects page number and formats that appropriately.
    Then, it gets rid of leading whitespace.
    Then, ditch the date, and the line under that, which has some crazy special characters….
    Then, ditch any lines that end in ‘SEN. APPRO’

    Now we have something that looks like this… https://playwell.studio/wp-content/uploads/2024/03/WDI39597.txt

    I’ve preserved the page numbers and line numbers for citation purposes. So, if I want to recall where appropriations were made in the bill, I can cite “Page 36 Line 22” for example.

    I’ll have to get to extracting the sections tomorrow…

  • US Spending Visualizations

    This week another Uniparty Omnibus spending bill was passed without much a fuss. I was thinking Speaker Johnson was going to be a force to stand up to the machine and reduce spending. I thought he was going change things. I may have been mistaken. ? We need to get inflation under control, its like a brush fire that could consume the country. Meanwhile the money printing machine is in overdrive. Instead of whining about it on X, why not do something that’ll bring some visibility and comprehensibility to these massive bills?

    Many years back, I’ve registered a domain politipal.com, which I had grandiose plans for. Naturally, I’ve done nothing with it. It’s time to change that too.

    If you haven’t seen one before, these bills are published in the most unusable format possible. A super lengthy document, that no one can easily read and/or understand. Example ??

    No way to compare to previous years, no way to visualize using common graph paradigms. Hopefully, this project will fix that.

    How does a project like this make money? I have not f’ing clue, but I’m tired of doing nothing and watching the shit show carry on uninterrupted.

    The first step is a POC. Can I parse this bill text into usable data with readily available open source scripts, programs, etc?

    Automated Workflow:

    1. Read and parse the document, extracting sections.
    2. For each section, extract relevant details.
    3. Format those details into a JSON object.
    4. Insert the JSON object into Database.