Day Two: More Text Wrangling Sun, 21 Apr 2024 14:42:37 +0000

Goal today: Separate the text doc from yesterday into sections.

Running into issues with it though. I was hoping I could depend on ALL CAPS and other identifiers to separate sections out. In a few cases, it’s difficult to decipher between section titles and body. In the PDF, I might be able to use italics and other variations in the font to figure out what’s what. This increases the complexity of sorting out the different sections of the document into something with a usable structure.

Day One: Text Wrangling Tue, 26 Mar 2024 10:58:40 +0000

We are starting with a wholly unusable PDF File ( First we need to parse it into usable data.

Goal 1: Read and parse the document

Goal 2: Extract sections.

Read and parse the document

What are some readily available open source projects that I can use to parse PDFs into text?

  1. Poppler Utils: Poppler is a PDF rendering library that includes a variety of utilities for extracting text, images, and other content from PDFs. The pdftotext command is part of this suite and is widely used for converting PDF documents into plain text.
  2. Xpdf: Similar to Poppler, Xpdf is a toolset that includes the pdftotext command. It’s an older project but still effective for extracting text from PDF files.
  3. PDFMiner: PDFMiner is a tool written in Python specifically for extracting text, images, and metadata from PDF files. It’s more flexible than pdftotext and allows for more detailed analysis of the PDF structure, making it suitable for more complex extraction tasks.
  4. mutool: Part of the MuPDF suite, mutool can extract text and images from PDF files. MuPDF is known for its speed and the quality of its rendering.
  5. Apache PDFBox: Although primarily a Java library for working with PDF documents, PDFBox comes with a command-line utility that can be used to extract text from PDFs. It’s useful for those who prefer a Java-based solution.
  6. Tesseract OCR: For PDFs that contain mostly images of text (like scanned documents), Tesseract OCR can be a powerful tool. It’s an optical character recognition (OCR) engine that can convert images into text, and with the right preprocessing, it can be used to extract text from image-based PDFs.

Let’s try some of these out and see how the result varies. I’m most interested in poppler and pdfminer

Using the poppler option I found 👇 provides a good starting point for text cleanup.

pdftotext -layout -enc UTF-8 WDI39597.pdf poppler.txt

PDFMiner has more options in term of formats (text | XML | HTML). The first thing I noticed though, it’s significantly slower to execute. Annnd. the output is far less usable. I was hopeful for the HTML or XML output. The most ridiculous output was XML. There was literally tags around every letter. -o pdfminer.txt -t text -A WDI39597.pdf -o pdfminer.html -t html -A WDI39597.pdf -o pdfminer.xml -t xml -A WDI39597.pdf 

POPPLER WINS! It creates a usable output and is WAY faster in terms of execution, not that that is a huge factor.

Now we have something that looks like this:

Now, Let’s strip out some garbage and format this a bit more.

I’m using a Python script to do this part.
First it detects page number and formats that appropriately.
Then, it gets rid of leading whitespace.
Then, ditch the date, and the line under that, which has some crazy special characters….
Then, ditch any lines that end in ‘SEN. APPRO’

Now we have something that looks like this…

I’ve preserved the page numbers and line numbers for citation purposes. So, if I want to recall where appropriations were made in the bill, I can cite “Page 36 Line 22” for example.

I’ll have to get to extracting the sections tomorrow…

US Spending Visualizations Mon, 25 Mar 2024 11:58:46 +0000


This week another Uniparty Omnibus spending bill was passed without much a fuss. I was thinking Speaker Johnson was going to be a force to stand up to the machine and reduce spending. I thought he was going change things. I may have been mistaken. 😞 We need to get inflation under control, its like a brush fire that could consume the country. Meanwhile the money printing machine is in overdrive. Instead of whining about it on X, why not do something that’ll bring some visibility and comprehensibility to these massive bills?

Many years back, I’ve registered a domain, which I had grandiose plans for. Naturally, I’ve done nothing with it. It’s time to change that too.

If you haven’t seen one before, these bills are published in the most unusable format possible. A super lengthy document, that no one can easily read and/or understand. Example 👇🏻

No way to compare to previous years, no way to visualize using common graph paradigms. Hopefully, this project will fix that.

How does a project like this make money? I have not f’ing clue, but I’m tired of doing nothing and watching the shit show carry on uninterrupted.

The first step is a POC. Can I parse this bill text into usable data with readily available open source scripts, programs, etc?

Automated Workflow:

  1. Read and parse the document, extracting sections.
  2. For each section, extract relevant details.
  3. Format those details into a JSON object.
  4. Insert the JSON object into Database.
Resisting The Machine Sat, 23 Dec 2023 01:41:02 +0000

Thoughts from: I Can’t Overstate How Dire This Is | Bret Weinstein

I recently watched “Leave the World Behind“. It’s a message, a clear and terrifying message from our adversary. It’s a message about what happens when we resist they/them. When I refer to “they/them”, I’m not referring to the confused millennial non-binary they/them sorts. I’m talking about “The Machine”…. you know, one that “Rage Against the Machine” raged about before the band by that name was corrupted, consumed, and assimilated into the very machine they raged against. I’m talking about The Machine that has largely had a monopoly on influence and power for the last century or so. I’m supposing the 1913 creation of the Federal Reserve is a good marker for that level of influence and power, and the global elite class that wields it.

I internalized the message they wanted to deliver in the movie. They want people to duck and cover, to hide in the basement, with a cache of food, and a box set of “Friends” DVDs to keep our little minds occupied while the world tears itself apart. They want us out of the way, while their carefully choreographed chaos unravels the fabric of society.

After watching the video below, I’m thinking THAT, would be taking their bait. THAT is playing into their hands. THAT is exactly what they want. Instead, in the video below, Bret Weinstein, an extremely brilliant man/scientist/educator advocates for forming coalitions. Getting people together to share ideas and combining the power of the multitudes who stand against The Machine. He suggests that Goliath (as Weinstein refers to The Machine), has lost the first onslaught in their war for power. He speculates that the heros, who have emerged through the first wave largely fit the description of lone wolves. The Machine is learning and leveling up. So we, “The Resistance”, need to learn and level up as well. Those lone wolves need a pack.

With the first wave behind us, the confusion persists. Goliath is looking for a rematch. Some of they/them have been exposed. Presidents of the last decades are all on the list, mix in a bit of Jeffrey Epstein and Hunter Biden, and you have a hot bowl corruption soup. It seems, their exposure only reveals more questions than answers. These revelations now float on the surface, but this bowl runs deep.

I listened to Weinstein’s Dark Horse Podcast throughout the COVID crisis. He and Dr Robert Malone were a bastion of common sense and inquisitive curiosity about the confusion that didn’t fit then. His insight earned my respect and trust. During the below interview, I can see he has a sort of existential alarm about him. He has left the rage behind, and gone to “war with the machine”. I will pray for him, as I will pray for us too.

Homeschool web app Work in progress. Mon, 27 Nov 2023 14:33:15 +0000

Lately I’m building a topics hierarchy. It could otherwise be called categories, or taxonomy, or whatever else, but for some reason, “topics” seems to fit the bill.

This is a first run at the UI, basically I need it to add, edit, and remove a nested hierarchy of topics. Only admin users will see this, so it doesn’t have to be pretty, just functional.

Each topic has a color, which I suppose should trickle down to it’s child topics. This way there will be some visual separation among the different lessons.

Here I have some initial topics, generated by ChatGPT, these will certainly change. I may use KhanAcademy as a template. They have certainly put careful consideration into their taxonomy.

Also, pay no attention to the branding. I’m still undecided between, “Homeschool Link (” or “”.

I would love your opinon @

Upgrading Nextcloud Tue, 31 Jan 2023 15:50:03 +0000

I use Nextcloud, running on a little box in my closet, as an alternative to iCloud or Google Cloud. It’s amazing, really. I’m very grateful that this open source software is available to people who have the will and wherewithal to buck the big personal data miner mafia corps. When it became obvious what these worms intend to do with our data, I started looking for a way to keep my personal data personal.

I’ve been running Nextcloud version 22 for the last couple years. As you can see from , there have been many updates and upgrades since my original installation and I’ve been quite negligent with my sys admin duties. Today, I’m trying to remedy that situation.

I have Nextcloud running in docker. I use docker-compose to set up the environment, so I need to also upgrade through each version of Nextcloud using docker-compose, one major version at a time.

I use this app DAILY so, I don’t want any surprises, which often happen during upgrades. So first, I’ll replicate all the data from my server to my PC. This way I have a sandbox where I can make all my changes while my production environment remains untouched. If something goes wrong.. No problemo.

The PC is a Windows machine, so I’ll spin up an Ubuntu image to do all the transfers.

docker run -it -v "$(pwd):/volume" ubuntu /bin/bash

Next I’ll get the image equipped with the tools that I need to rysnc my way to a mirrored environment.

cd /volume && \ 
apt update  && \
apt install ssh rsync && \
rsync -rav --stats --progress admin@sourceIP:/path/to/nextcloud /volume -e "ssh -o StrictHostKeyChecking=no"

Good, the transfer is ~80Gigs in my case, so that took a min.

This is my existing docker-compose.yml You’ll also notice that I’ve specified mariadb:10.7 as that is what is currently running in the production env. I’ll upgrade that as needed.

version: '3'
    container_name: nextcloud
    image: "nextcloud:22"
      - 8000:80
    restart: always
      - ./html:/var/www/html
      - ./logs:/var/log/apache2
      - ./db.env
      - proxy
      - internal_network

    container_name: mariadb
    image: "mariadb:10.7"
    command: "--transaction-isolation=READ-COMMITTED --binlog-format=ROW --innodb-file-per-table=1 --skip-innodb-read-only-compressed"
    restart: always
      - ./db:/var/lib/mysql
      - ./db.env
      - internal_network

    container_name: phpmyadmin
    image: phpmyadmin/phpmyadmin
    - mariadb:mysql
      - 8001:80
      - ./db.env
      PMA_HOST: mariadb
      UPLOAD_LIMIT: 300M
      - proxy
      - internal_network

    internal: true
    external: true

Now that I have a sandbox to start running these upgrades, let’s just run everything once through to make sure the app is running “as-is”.

docker-compose up -d

The logs reported a minor upgrade, but other than that, we’re up and running.

Let’s upgrade to the next major version now. To do that, I just increment the number in docker-compose.yml from image: "nextcloud:22" to image: "nextcloud:23" then run:

docker-compose down
docker-compose up --force-recreate --build -d

Then I’ll watch my logs docker logs nextcloud to see when everything is done upgrading. You should see something like

docker logs --tail 1000 -f daecd812fefe464712b9b6717cb6e2a3d842260e0c64c63ec88ea22e2edb9623 

Initializing nextcloud ...
Upgrading nextcloud from ...

… but with the versions you’re currently updating. The update between 22 and 23 just worked.

Be sure to update all the apps to the new version in between each upgrade with php ./occ app:update --all or through the web UI.

It was between 23 and 24 where I needed to upgrade mariadb as well. In this case, I’m now using mariadb:latest. Then attach a shell into that container and run mysql_upgrade --user=root --password=rootpassword

If you catch a snag at any point, your best bet is to attach a shell into the nextcloud container and run php ./occ upgrade. If you are dealing with file permission issues, try attaching to the shell as the owner with: docker exec -it -u 33 nextcloud bash where 33 is the user #.

ChatGPT – Scaffolding a Nextcloud Plugin Tue, 03 Jan 2023 13:00:00 +0000


I’m continually impressed by ChatGPT. This morning I thought it would be really nice to be able to track my health statistics on Nextcloud, my private cloud that I have running just behind me in my closet. What a cool little project to give to ChatGPT and see how quickly we can get something up and running. It’s 8am on a Tuesday morning, I’m back to work on my day job, but I have about an hour to fiddle around with it. Let’s see how quickly ChatGPT can get this started….

A little background, I’ve been tracking some health parameters for a while with iHealth, mainly because they’ve made it easy to do so. I have a bluetooth bloodpressure cuff, every time I take my BP, it’s logged to the cloud. It has a nice UI. But, I’m not very happy with giving my health information away anymore. So I’ve been looking for a new home for my health data. Lately, I’ve been using “Waistline”, an open source app found on F-Droid. It works, but not nearly as nicely as iHealth. The data is siloed, and I’m not really sure how to get it out of the app. So, passively I’m still looking. That’s were we pick up the story for this idea.

Here is the chat in it’s entirety. I basically walk the bot through the process of coding the entire plugin for me.

At this point, I have yet to test it out, but as you can see, it’s an amazing start. I’ve got a plugin templated out, an API, and directions to get the frontend started as well. It’s 9am now, so I need to get to my day job. But, wow. Just WOW.

More to come as time permits.

Vosk on-device Speech-to-text Wed, 28 Dec 2022 13:37:55 +0000

Since I’ve started using GrapheneOS, a deGoogled Android build, I’ve missed several services you typically get from Apple or Google on my device, one of those core services is Speech-to-Text. It helps a lot to speed up note taking, writing text messages, etc.

I’ve been using a very crude Vosk keyboard on Android to fill the gap. I’d love to try to improve upon this project, but for today I’m interested in getting this functionality in Gnome, my Desktop of choice on Ubuntu Linux. This is not meant to be a tutorial, but more of a journal entry.

Documentation for gnome extension are scant. Here is what I could find:

Here is a great playlist on YouTube to get more familiar with creating gnome extensions.

Damn, don’t you hate when you don’t save your work? I just lost a bunch of work. DOH!

Let’s see, I was astonished to see that, in general, the Gnome extensions area is not super active.

Development is a little rough, I have to switch from Wayland to X11, which makes reloading extensions a little easier. In wayland, you have to log out and back in for extensions to refresh. Yikes.

Here’s a directory of existing extensions:

I like to learn from other code. So I installed this extension, which allows you to manage your system clipboard:

I haven’t found anything preinstalled to manage extensions. Seems like something that would be readily available in “Settings”. 😮‍💨

Anyways, I started this at 9am, I hope to have something working by noon, but time is dwindling. I just spent some time on creating an icon in figma. No matter what I do, it’s still hard to see the “TXT” in the icon. I may just ditch it and just use the mic, but I’ll leave it in for now. Anyway, I hope Gnome supports SVG, which might render a little nicer. Let’s move on. We have some functionality to create.

Golly, documentation is THIN for gnome extensions.

I’m simply trying to get a button in the tray, when clicked it will change color. Also, reloading extensions is still a CHORE. I have to log out, then log back into gnome each time. Tedious.

I found a solution to that here:

I’m using a script to load up another session of gnome, which naturally reloads all the extensions.

dbus-run-session -- gnome-shell --nested --wayland

My SVG isn’t looking great in there though. I may have to use a ready-made system icon.

As you can see, the icon is squished, and also doesn’t change color when clicked.

I’ve got the icon working now, but there still is styling issue, where the icon seems a little small.

I’ve messed with getting Vosk working appropriately. I’ve tried a few of the suggested methods, but I’m having a lot of issues making my microphone accessible in nodejs with the ‘mic’ library.

I’m currently leaning towards running vosk as a docker service with the following docker-compose.yml

version: '3'

    image: alphacep/kaldi-en
      - "2700:2700"

So far, only one test script that I’ve tried actually worked.

#!/usr/bin/env python3

import asyncio
import websockets
import sys
import wave

async def run_test(uri):
    async with websockets.connect(uri) as websocket:

        wf =[1], "rb")
        await websocket.send('{ "config" : { "sample_rate" : %d } }' % (wf.getframerate()))
        buffer_size = int(wf.getframerate() * 0.2) # 0.2 seconds of audio
        while True:
            data = wf.readframes(buffer_size)

            if len(data) == 0:

            await websocket.send(data)
            print (await websocket.recv())

        await websocket.send('{"eof" : 1}')
        print (await websocket.recv())'ws://localhost:2700'))

The problem here is that it’s sending a .wav file, not opening the microphone and transcribing the output.

That’s enough for today. I’ll pick this project back up at some point.

ChatGPT for Homeschool Planning. Wed, 21 Dec 2022 21:10:36 +0000

I’ve been absolutely blown away by ChatGPT. I’m still reeling by the implications of it’s abilities. I’m hoping it only improves from here, as I suspect it will be dumbed down as it starts to impact some high-level professional job positions.

There are quite a few ways that we might use ChatGPT to plan for homeschool. It’s GREAT at generating lists. Now, our primary task is simply to ask good questions.

For instance:

I feel like some of the suggestions here are great, but definitely needs expanding on. So, let’s get an idea of what that looks like. Let’s drill down on combining reading and vocabulary.

Well, that’s plenty to choose from. I love the “Chronicles of Narnia”, it’s Christian based, great story, a good selection. Also, I think I could sell that to Eva. Let’s keep going. How about a list of vocab words from the first chapter of the first book?

Yes, there was an error in the process. I wasn’t completely happy with all of the words in the first go-round. So, I asked for more, and ChatGPT picked up from where it left off.

I think I can select 10 good vocab words from this list of 20 words 19 words.

What’s more important is that I could basically do this same sequence with the all of the books that my daughter selects to read. She could literally select any book and we could turn that reading into a decent reading and vocab list.

Lessons Learned, First Year of Homeschool in Review Mon, 04 Jul 2022 14:21:43 +0000

Ahh Summer! We are currently on our Summer migration North to Maine, from Florida. Now is a good time to revisit the past year in review. Spoiler alert, it was a huge success. In this short article I’ll cover where we started, what worked, and how we adapted through our first year of homeschooling our daughter, Eva.

I started this article by answering a pretty simple question on Twitter…

In the beginning we needed a place to start. We had heard about Abeka as a great Christian based curriculum. It’s a bit pricey, but I figured at the beginning, I’d spare no expense to get this right.

I remember when the Abeka books came in the mail, it was like Christmas for Eva. She was SO EXCITED to bust open the boxes, open all of the brand new books and get an idea of what we would learn this year. We flipped through pages for a couple hours. I got my bearings. It was a great start and it gave us a guiding light when we had nothing else to go on.

Homeschool Styles: Classical, Eclectic, and Unschool

Technically, there are a bunch more approaches to Homeschool, but the above three make the most sense to me and are sufficient enough categories to contain the minute differences of all the rest. Abeka IS a great, classical curriculum. That is, it runs much like public school, but at home. For those, like us, who are transitioning from public or private school, Classical probably feels the most familiar, and is easiest to understand at first for us parents and the student.

While the classical approach was a great place to start, we knew we didn’t want to be quite so regimented. Once we got stated, it was easy to identify what was working and what wasn’t. Eva naturally gravitated towards certain subjects, Language, Science, History, and Bible Study.

The beautiful thing about Homeschooling, it’s a conversation and collaboration with your child. Together, you get to decide what works and where there is resistance. We have a credo and a goal, “…to foster a love of learning”. That’s it. Learning doesn’t have to be work, as in homeWORK or schoolWORK. Learning CAN be met with enthusiasm, interest, and love. That’s our job as teaching parents, to help find what interests our children, to find their spark. Once we find a spark, help feed that spark and fan it into a flame. Then into a fire. The next thing you notice, the “AHA!” moment, is when you catch your kid seeking out new information on their own and self-educating.

Spark to Flame, Flame to Fire

Eva’s spark, right now, is animals. She wants to be a Veterinarian. This actually lead us to getting a dog. The dog, all by itself, is a treasure trove of educational activities. It’s Eva’s job to train the dog, where she learns about behavior and reinforcement training. FYI, this works for dog and humans as well. She takes him out to pee, cleans up his messes, etc. It’s a responsibility, life has lots of those. Furthermore, we help to feed that spark with niche classes from Outschool. The internet can bring you together with educators that interest your child. We found Ms Marcy, who Eva just loves, an author of children’s books who’s husband is a Vet. So much of her writing is related to the one subject Eva is bananas about. We do a couple classes with Ms Marcy each week, Creative Writing and a class where they look at animal X-rays and diagnose injuries or sickness, and much more. It’s great!

Unlearning Failure

There is one subject that Eva has a lot of resistance to, Math. It’s also a subject that I would be remiss to substitute away. So, Math is the subject that I have to work the hardest on to make palatable for Eva. Something that Eva has taken away from traditional school, unfortunately, is a deep seated fear of failure. It makes her not want to try, especially in Math. She’s afraid of it. She cries if/when she gets a bad score on a test or if she’s struggling with learning a concept. It’s taken some time for her to unlearn this. I regularly have to coach her through lessons, reminding her that it’s OK that this is difficult for you to understand. It is supposed to be difficult, life is this way quite often. The important part is to persist, to exert effort to stay curious and focused on the problem, and most important to reframe “failure” as “learning”. Getting problems wrong is just a sign that you need to keep trying at it to master that concept. We expect mastery before moving on. Even if Eva has taken a test and done poorly, we’ll go back and figure out where the confusion exists. Usually, she’s 95% of the way there. However long that takes doesn’t matter, but don’t fret, relax, and try to make it fun. If you can make it fun, that removes a TON of the apprehension.

For Math we use Beast Academy, it helps a lot to make Math fun. We started with Abeka, but found that Beast Academy has put a lot more effort into providing supporting content. They have a comic book, which introduces new concepts in story form. There’s a workbook to practice a bit, and an online app with videos and further practice to master the concept before an online unit test. Reporting is great! There’s an online dashboard and regular emails as your child progresses through the lessons, which keeps you informed of progress and flags any trouble areas.


Lately, we’ve also started replacing Abeka’s History lessons. Eva mentioned that she’s not super hot on reading about old wars. While these wars are important to understanding where we come from, and the unique privileged of being an American. I figured out another angle to deliver similar content and a similar understanding of our past using Law instead.

So, on Tuesdays and Thursdays at 5pm, we do a college level course on the Constitution. Again, this is another high effort subject, from my perspective, but I love it because I’m learning too and it’s QUALITY time with my daughter. We set the environment, close the shades in the room, fire up our big touch screen tv, grab a snack, and watch one or two of these excellently produced videos from Hillsdale College. This has provided TONS of opportunity, I’ve helped expose Eva to note taking during lectures. If something isn’t connecting, either she or I will pause the video to discuss and dive deeper into certain words, subject matter, or to help build context to a story. And, she gets it! It creates conversation and discussion. She asks provocative questions. She’ll stump me! She’s a super sharp kid, I’m wicked proud of her. And, I’m honored to be a part of the process of building her up.

It’s an Honor

The final big take away of the last year of homeschool is that it’s such an honor to have the opportunity to impact our children’s life in such a profound way. We get to shatter the status quo, “one-size-fits-all” approach to education. We get to help our children find a love of learning. As parent teachers, we get to be closer to them. We get a front row seat as they grow up. We get to give them the tools to use in life. And, we have hand in their development into intelligent, wonderful, and beautiful people.
