Day One: Text Wrangling

We are starting with a wholly unusable PDF File (https://docs.house.gov/billsthisweek/20240318/WDI39597.PDF). First we need to parse it into usable data.

Goal 1: Read and parse the document

Goal 2: Extract sections.

Read and parse the document

What are some readily available open source projects that I can use to parse PDFs into text?

  1. Poppler Utils: Poppler is a PDF rendering library that includes a variety of utilities for extracting text, images, and other content from PDFs. The pdftotext command is part of this suite and is widely used for converting PDF documents into plain text.
  2. Xpdf: Similar to Poppler, Xpdf is a toolset that includes the pdftotext command. It’s an older project but still effective for extracting text from PDF files.
  3. PDFMiner: PDFMiner is a tool written in Python specifically for extracting text, images, and metadata from PDF files. It’s more flexible than pdftotext and allows for more detailed analysis of the PDF structure, making it suitable for more complex extraction tasks.
  4. mutool: Part of the MuPDF suite, mutool can extract text and images from PDF files. MuPDF is known for its speed and the quality of its rendering.
  5. Apache PDFBox: Although primarily a Java library for working with PDF documents, PDFBox comes with a command-line utility that can be used to extract text from PDFs. It’s useful for those who prefer a Java-based solution.
  6. Tesseract OCR: For PDFs that contain mostly images of text (like scanned documents), Tesseract OCR can be a powerful tool. It’s an optical character recognition (OCR) engine that can convert images into text, and with the right preprocessing, it can be used to extract text from image-based PDFs.

Let’s try some of these out and see how the result varies. I’m most interested in poppler and pdfminer

Using the poppler option I found 👇 provides a good starting point for text cleanup.

pdftotext -layout -enc UTF-8 WDI39597.pdf poppler.txt

PDFMiner has more options in term of formats (text | XML | HTML). The first thing I noticed though, it’s significantly slower to execute. Annnd. the output is far less usable. I was hopeful for the HTML or XML output. The most ridiculous output was XML. There was literally tags around every letter.

pdf2txt.py -o pdfminer.txt -t text -A WDI39597.pdf 
pdf2txt.py -o pdfminer.html -t html -A WDI39597.pdf 
pdf2txt.py -o pdfminer.xml -t xml -A WDI39597.pdf 

POPPLER WINS! It creates a usable output and is WAY faster in terms of execution, not that that is a huge factor.

Now we have something that looks like this: https://snovak.com/wp-content/uploads/2024/03/poppler.txt

Now, Let’s strip out some garbage and format this a bit more.

I’m using a Python script to do this part.
First it detects page number and formats that appropriately.
Then, it gets rid of leading whitespace.
Then, ditch the date, and the line under that, which has some crazy special characters….
Then, ditch any lines that end in ‘SEN. APPRO’

Now we have something that looks like this… https://snovak.com/wp-content/uploads/2024/03/WDI39597.txt

I’ve preserved the page numbers and line numbers for citation purposes. So, if I want to recall where appropriations were made in the bill, I can cite “Page 36 Line 22” for example.

I’ll have to get to extracting the sections tomorrow…

US Spending Visualizations

This week another Uniparty Omnibus spending bill was passed without much a fuss. I was thinking Speaker Johnson was going to be a force to stand up to the machine and reduce spending. I thought he was going change things. I may have been mistaken. 😞 We need to get inflation under control, its like a brush fire that could consume the country. Meanwhile the money printing machine is in overdrive. Instead of whining about it on X, why not do something that’ll bring some visibility and comprehensibility to these massive bills?

Many years back, I’ve registered a domain politipal.com, which I had grandiose plans for. Naturally, I’ve done nothing with it. It’s time to change that too.

If you haven’t seen one before, these bills are published in the most unusable format possible. A super lengthy document, that no one can easily read and/or understand. Example 👇🏻

No way to compare to previous years, no way to visualize using common graph paradigms. Hopefully, this project will fix that.

How does a project like this make money? I have not f’ing clue, but I’m tired of doing nothing and watching the shit show carry on uninterrupted.

The first step is a POC. Can I parse this bill text into usable data with readily available open source scripts, programs, etc?

Automated Workflow:

  1. Read and parse the document, extracting sections.
  2. For each section, extract relevant details.
  3. Format those details into a JSON object.
  4. Insert the JSON object into Database.

Resisting The Machine

Thoughts from: I Can’t Overstate How Dire This Is | Bret Weinstein

I recently watched “Leave the World Behind“. It’s a message, a clear and terrifying message from our adversary. It’s a message about what happens when we resist they/them. When I refer to “they/them”, I’m not referring to the confused millennial non-binary they/them sorts. I’m talking about “The Machine”…. you know, one that “Rage Against the Machine” raged about before the band by that name was corrupted, consumed, and assimilated into the very machine they raged against. I’m talking about The Machine that has largely had a monopoly on influence and power for the last century or so. I’m supposing the 1913 creation of the Federal Reserve is a good marker for that level of influence and power, and the global elite class that wields it.

I internalized the message they wanted to deliver in the movie. They want people to duck and cover, to hide in the basement, with a cache of food, and a box set of “Friends” DVDs to keep our little minds occupied while the world tears itself apart. They want us out of the way, while their carefully choreographed chaos unravels the fabric of society.

After watching the video below, I’m thinking THAT, would be taking their bait. THAT is playing into their hands. THAT is exactly what they want. Instead, in the video below, Bret Weinstein, an extremely brilliant man/scientist/educator advocates for forming coalitions. Getting people together to share ideas and combining the power of the multitudes who stand against The Machine. He suggests that Goliath (as Weinstein refers to The Machine), has lost the first onslaught in their war for power. He speculates that the heros, who have emerged through the first wave largely fit the description of lone wolves. The Machine is learning and leveling up. So we, “The Resistance”, need to learn and level up as well. Those lone wolves need a pack.

With the first wave behind us, the confusion persists. Goliath is looking for a rematch. Some of they/them have been exposed. Presidents of the last decades are all on the list, mix in a bit of Jeffrey Epstein and Hunter Biden, and you have a hot bowl corruption soup. It seems, their exposure only reveals more questions than answers. These revelations now float on the surface, but this bowl runs deep.

I listened to Weinstein’s Dark Horse Podcast throughout the COVID crisis. He and Dr Robert Malone were a bastion of common sense and inquisitive curiosity about the confusion that didn’t fit then. His insight earned my respect and trust. During the below interview, I can see he has a sort of existential alarm about him. He has left the rage behind, and gone to “war with the machine”. I will pray for him, as I will pray for us too.

Homeschool web app Work in progress.

Lately I’m building a topics hierarchy. It could otherwise be called categories, or taxonomy, or whatever else, but for some reason, “topics” seems to fit the bill.

This is a first run at the UI, basically I need it to add, edit, and remove a nested hierarchy of topics. Only admin users will see this, so it doesn’t have to be pretty, just functional.

Each topic has a color, which I suppose should trickle down to it’s child topics. This way there will be some visual separation among the different lessons.

Here I have some initial topics, generated by ChatGPT, these will certainly change. I may use KhanAcademy as a template. They have certainly put careful consideration into their taxonomy.

Also, pay no attention to the branding. I’m still undecided between, “Homeschool Link (homeschool.ink)” or “Learnalot.net”.

I would love your opinon @ x.com

Upgrading Nextcloud

I use Nextcloud, running on a little box in my closet, as an alternative to iCloud or Google Cloud. It’s amazing, really. I’m very grateful that this open source software is available to people who have the will and wherewithal to buck the big personal data miner mafia corps. When it became obvious what these worms intend to do with our data, I started looking for a way to keep my personal data personal.

I’ve been running Nextcloud version 22 for the last couple years. As you can see from https://nextcloud.com/changelog/ , there have been many updates and upgrades since my original installation and I’ve been quite negligent with my sys admin duties. Today, I’m trying to remedy that situation.

I have Nextcloud running in docker. I use docker-compose to set up the environment, so I need to also upgrade through each version of Nextcloud using docker-compose, one major version at a time.

I use this app DAILY so, I don’t want any surprises, which often happen during upgrades. So first, I’ll replicate all the data from my server to my PC. This way I have a sandbox where I can make all my changes while my production environment remains untouched. If something goes wrong.. No problemo.

The PC is a Windows machine, so I’ll spin up an Ubuntu image to do all the transfers.

docker run -it -v "$(pwd):/volume" ubuntu /bin/bash

Next I’ll get the image equipped with the tools that I need to rysnc my way to a mirrored environment.

cd /volume && \ 
apt update  && \
apt install ssh rsync && \
rsync -rav --stats --progress admin@sourceIP:/path/to/nextcloud /volume -e "ssh -o StrictHostKeyChecking=no"

Good, the transfer is ~80Gigs in my case, so that took a min.

This is my existing docker-compose.yml You’ll also notice that I’ve specified mariadb:10.7 as that is what is currently running in the production env. I’ll upgrade that as needed.

version: '3'
services:
  nextcloud:
    container_name: nextcloud
    image: "nextcloud:22"
    ports:
      - 8000:80
    restart: always
    volumes:
      - ./html:/var/www/html
      - ./logs:/var/log/apache2
    env_file:
      - ./db.env
    networks:
      - proxy
      - internal_network

  mariadb:
    container_name: mariadb
    image: "mariadb:10.7"
    command: "--transaction-isolation=READ-COMMITTED --binlog-format=ROW --innodb-file-per-table=1 --skip-innodb-read-only-compressed"
    restart: always
    volumes:
      - ./db:/var/lib/mysql
    env_file:
      - ./db.env
    networks:
      - internal_network

  phpmyadmin:
    container_name: phpmyadmin
    image: phpmyadmin/phpmyadmin
    links:
    - mariadb:mysql
    ports:
      - 8001:80
    env_file:
      - ./db.env
    environment:
      PMA_HOST: mariadb
      UPLOAD_LIMIT: 300M
    networks:
      - proxy
      - internal_network

networks:
  internal_network:
    internal: true
  proxy:  
    external: true

Now that I have a sandbox to start running these upgrades, let’s just run everything once through to make sure the app is running “as-is”.

docker-compose up -d

The logs reported a minor upgrade, but other than that, we’re up and running.

Let’s upgrade to the next major version now. To do that, I just increment the number in docker-compose.yml from image: "nextcloud:22" to image: "nextcloud:23" then run:

docker-compose down
docker-compose up --force-recreate --build -d

Then I’ll watch my logs docker logs nextcloud to see when everything is done upgrading. You should see something like

docker logs --tail 1000 -f daecd812fefe464712b9b6717cb6e2a3d842260e0c64c63ec88ea22e2edb9623 

Initializing nextcloud 25.0.3.2 ...
Upgrading nextcloud from 24.0.9.2 ...

… but with the versions you’re currently updating. The update between 22 and 23 just worked.

Be sure to update all the apps to the new version in between each upgrade with php ./occ app:update --all or through the web UI.

It was between 23 and 24 where I needed to upgrade mariadb as well. In this case, I’m now using mariadb:latest. Then attach a shell into that container and run mysql_upgrade --user=root --password=rootpassword

If you catch a snag at any point, your best bet is to attach a shell into the nextcloud container and run php ./occ upgrade. If you are dealing with file permission issues, try attaching to the shell as the owner with: docker exec -it -u 33 nextcloud bash where 33 is the user #.

ChatGPT – Scaffolding a Nextcloud Plugin

🤯

I’m continually impressed by ChatGPT. This morning I thought it would be really nice to be able to track my health statistics on Nextcloud, my private cloud that I have running just behind me in my closet. What a cool little project to give to ChatGPT and see how quickly we can get something up and running. It’s 8am on a Tuesday morning, I’m back to work on my day job, but I have about an hour to fiddle around with it. Let’s see how quickly ChatGPT can get this started….

Continue reading “ChatGPT – Scaffolding a Nextcloud Plugin”

Vosk on-device Speech-to-text

Since I’ve started using GrapheneOS, a deGoogled Android build, I’ve missed several services you typically get from Apple or Google on my device, one of those core services is Speech-to-Text. It helps a lot to speed up note taking, writing text messages, etc.

I’ve been using a very crude Vosk keyboard on Android to fill the gap. I’d love to try to improve upon this project, but for today I’m interested in getting this functionality in Gnome, my Desktop of choice on Ubuntu Linux. This is not meant to be a tutorial, but more of a journal entry.

Continue reading “Vosk on-device Speech-to-text”

ChatGPT for Homeschool Planning.

I’ve been absolutely blown away by ChatGPT. I’m still reeling by the implications of it’s abilities. I’m hoping it only improves from here, as I suspect it will be dumbed down as it starts to impact some high-level professional job positions.

There are quite a few ways that we might use ChatGPT to plan for homeschool. It’s GREAT at generating lists. Now, our primary task is simply to ask good questions.

Continue reading “ChatGPT for Homeschool Planning.”

Lessons Learned, First Year of Homeschool in Review

Ahh Summer! We are currently on our Summer migration North to Maine, from Florida. Now is a good time to revisit the past year in review. Spoiler alert, it was a huge success. In this short article I’ll cover where we started, what worked, and how we adapted through our first year of homeschooling our daughter, Eva.

Continue reading “Lessons Learned, First Year of Homeschool in Review”

One small bet, Learnalot.net

My wife and I have started homeschooling our daughters since the last year. Personally, I’ve been pushing for it long before COVID. But, since that’s become such a concern, we decided to give it a shot and see how it goes. One year in, we’ve all fallen in love with it. I can’t imagine that we’ll ever go back to public or private school for a NUMBER of reasons, most of which I won’t get into in this post.

Goal Number ONE of home schooling is to help my daughter find a “Love for Learning”. That in mind, we’ve taken an Eclectic approach to Homeschool. Initially, we started schooling the way we were all used to. Several classes, in well accepted common areas of study, consecutively spaced throughout the day. But what we’ve found is what I think everyone already knows. That’s a super tedious slog for ANYONE. Why is it most kids dislike school? Disinterest in the content, probably in large part. “Eclectic Homeschooling”, as well as “Unschooling” are good remedies to keep the attention of even the most over stimulated ADHD children.

Continue reading “One small bet, Learnalot.net”