How I Built a Personal Weekly Digest Tool for Newsletter Readers

A walkthrough of the full pipeline: RSS feeds, Python, a simple feeds file, and one interesting failure along the way.

May 23, 2026

I have been building data pipelines for over a decade. Most of them were for clients, internal tools, or research projects. This one started from a completely different place: I was looking for good data science writing to read on Substack and some other platforms and wanted to have it altogether.

There’re many blogs who write about data but they’re scattered across platforms(Substack, WordPress, Ghost, Medium) and keeping up meant checking multiple places, forgetting half of them, and missing posts I would have genuinely enjoyed. I wanted a single weekly reading list, curated by me, delivered to me. So I built one.

This post walks through what I built, how it works, and one failure that taught me something useful about cloud infrastructure and platform policies.

The Idea

I started simple. I wanted a script that reads a list of newsletter URLs I chose, fetches whatever was published in the last seven days, and puts it all in one place. No algorithms deciding what I see, no platform telling me what is popular. Just the writers I chose, their latest work, organized the way I want.

The input is a plain Excel file with three columns: the newsletter name, the URL, and a label I assign to group things by topic. The output is a markdown file I can read in VS Code. One command, once a week.

Building for Multiple Platforms

The first real challenge was RSS. Every platform handles it slightly differently.

Substack feeds live at publication.substack.com/feed. WordPress is usually yoursite.com/feed. Ghost uses yoursite.com/rss. Medium uses medium.com/feed/@username, a different structure than you might expect. Beehiiv generates a unique URL per publication that cannot be derived from the main URL at all.

Rather than asking people to find their own RSS URL, the script tries a series of known paths in order until one returns entries:

RSS_PATHS = [
    "",
    "/feed",
    "/rss",
    "/?feed=rss2",
    "/rss.xml",
    "/atom.xml",
]

Beehiiv is the only platform that still requires a manual step. The script detects it and prints a specific message explaining exactly where to find the RSS URL in the page source.

One thing I added after the first few test runs was HTML stripping on the excerpt field. Medium especially embeds raw HTML in their RSS summary field. Feedparser returns it as-is, which means your digest ends up with <img> tags and <a href> strings in the excerpt if you do not strip them. A quick regex pass and html.unescape() handles it cleanly:

def strip_html(text):
    text = re.sub(r"<[^>]+>", "", text)
    text = html.unescape(text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

Parallel Fetching

The first version fetched feeds sequentially, one at a time. At ten newsletters that was fine. At fifty it would have taken several minutes. The fix was straightforward. Python’s concurrent.futures.ThreadPoolExecutor lets you fetch multiple feeds simultaneously. With max_workers=10 the script fetches up to ten feeds at once. At a hundred newsletters the fetch step now takes roughly the same time as ten did before.

The Feeds File

The input is intentionally simple. An Excel, CSV, or plain text file with your newsletter list. Three columns: name, url, label. The script detects the file type automatically. No config needed beyond pointing it at your file. Labels become the section headers in your digest, so you can group things however makes sense for you, by topic, by reading priority, by platform, anything.

The GitHub Actions Experiment

Once the script was working locally I wanted to automate it fully. GitHub Actions seemed like the obvious choice: free, version controlled, runs on a schedule. I set it up to run every Sunday morning, stored nothing sensitive, and triggered a test run. It failed immediately. Every RSS fetch returned empty. No entries, no errors, just silence. Adding debug output revealed the issue:

HTTP 403 for https://example.substack.com/feed

Substack was blocking every request. Adding browser-like User-Agent strings and Accept headers made no difference. The problem was the IP range. GitHub Actions runners use well-known cloud IP ranges that Substack’s CDN recognizes and blocks. Residential IPs go through fine. Cloud provider IPs do not.

The workarounds I found, residential proxies and third-party RSS caching services, all introduced dependencies I did not want. Some would work today and break in six months. Others added ongoing costs. So I made a pragmatic decision. The script runs locally on Sunday mornings. Total runtime is under a minute. The automation saves hours of manual work even without the cron job.

The Output

The script generates a markdown file named by ISO week number. Each week gets its own file, nothing is ever overwritten, and they accumulate as a personal archive. Posts are grouped by whatever labels you assigned in your feeds file. Each entry shows the post title as a clickable link, a short excerpt, the newsletter name, and the publication date.

Opening the file in VS Code and pressing Ctrl+Shift+V renders the preview with clickable links. Copy from the preview panel and paste directly into Substack, Notion, or anywhere else you want to read it.

What I Would Do Differently

The Beehiiv situation remains unsolved. Their unique RSS URLs cannot be derived from the publication URL and the script cannot auto-detect them. It works but it puts a manual step on the user that every other platform avoids.

For larger feeds lists, parallel fetching handles the speed problem well. But Google Sheets as a backend for a multi-user version would hit limits. A proper database becomes necessary at scale.

Where This Is Going

The personal digest tool works well enough that I packaged it for others to use. If you want your own weekly reading list across any combination of Substack, WordPress, Ghost, Medium, and more, the tool is available with a full setup guide.

But the more interesting direction is community. Once I had the personal version working, I built another one: a community version. A niche hobby community I am part of wanted a way to surface new writing from their members from different platforms every week. So I built a version for exactly that. It has been running for a month now and so far it works exactly as intended.

Two tools, one pipeline, built entirely on free tiers. Sometimes the best projects are the ones you build for yourself first.

Next in Data

Discussion about this post

Ready for more?