Building an RSS Bot - Practical Lessons
Back in 2021 I spent some time thinking about how news about target shooting could be distributed better. As a sport we’re not always good at distributing news… or communication… or any of that stuff.
I decided that the first thing might be to try and build a bot for reddit, which would consume the RSS feeds from various sources and republish stories. This was a learning project on consuming and posting to APIs, wrangling RSS/XML and generally getting data from A to B, which could then be redeployed elsewhere if I could see a use.
I strung together some Python using Requests and the feedparser library to consume the RSS feed, which is just XML. Strictly speaking you don’t need a dedicated library - the XML library in stdlib would serve. But feedparser does some nice things like handling certain malformed elements, parsing the timestamp from a string into datetime object and generally makes it a bit easier to walk the tree. I used the praw library to post to reddit, which is somewhat overkill (since I wasn’t trying to monitor streams for posts or keywords, just make occasional link submissions). But it seemed easier to just go with the library to start with.
On the first run, the script fetches the RSS feed and look for a text file related to that feed. If it doesn’t find one, then it writes all the URLs in the feed into a new text file. On subsequent runs it compares that text file to the fresh feed and posts any new URLs to the target subreddit, before overwriting that file with the contents of the fresh feed. It would also submit URLs to the Internet Archive’s WayBack Machine. Too many times I’ve been trying to write something up for Wikipedia and the news from the NSRA or NRA site wasn’t archived. Occasionally the news index page is and I can see the article I need - but the article itself never got snagged. No more!
I used the URL since that seemed least likely to change on me - titles could be edited, but the URL seemed most likely to remain constant.
I set up the script on my Synology NAS, created some jobs in the Task Scheduler (one per RSS feed/subreddit combination) and let her rip. UKShootingNewsBot was live. Sorry everyone. Yeah, that’s me.
As it turns out, the code was the easy bit!
Granularity of RSS Feeds and post frequency
I didn’t want to spam any particular subreddit with everything from an organisation. Whilst people might be interested in competition results or calls to action for public consultations, they don’t want a half dozen administrative articles being posted reminding them to renew their club affiliation - club secretaries already get those emails, and smaller subreddits like /r/ukguns would be overwhelmed. Finding a feed for NSRA Competitions was easy. The NRA was more difficult, and I ended up wth their main feed. I wanted a hunting angle, so grabbed the BASC feed too.
This was too much! Immediately, the sub get deluged with 3-4 posts per day, some as irrelevant as recipes for pheasant - marking the start of the game season. This in a sub which gets about 4 posts per week.
On closer inspection, I found that I could get just legal affairs from BASC, and a Comps feed from the NRA. But this in itself is a bit flaky. Most CMS platforms generate RSS feeds automatically and will generate one for each post category that exists. But it’s not always easy to find, and users are not always consistent or disciplined in applying categories - since most of them don’t know about RSS and just treat it as a bit of nice-to-have metadata.
Nonetheless, with a bit of fine tuning I got a sensible post rate and people mostly stopped complaining.
I decided I did want to get everything out there, if only to help search indexing it (reddit can be a better search engine than google these days), so I created a new subreddit - /r/ukshootingnews - which gets the full firehose. The mainline RSS feed for every major shooting body in the UK - the NRA, NSRA, BASC, CPSA, WTSF, STS, etc (but not British Shooting… more on that in a future article.
RSS is unreliable
The bot then ran pretty much okay for a year. There was the odd irrelevant article I manually deleted which had been inappropriately categorised. But in the back end of 2023 I had two big problems in the space of a month.
The first was the NSRA. For reasons best known to them, 6-12 month old articles kept cropping up in the feed. Presumably someone had edited something minor and the new “Date Modified” had bounced it up to the feed. It would be nothing for a week, then four in a day. People started complaining. Then something broke on the NSRA website and nsra.co.uk redirected to a development subdomain. And so did the RSS. And every URL in the RSS feed. So when the script checked which URLs were “new”… it was all of them. And when the NSRA fixed their site a day later, all of those were new as well (because the “old urls” file is overwritten, not appended). Many posts were manually scrubbed. If this annoyed you… sorry. I know, I cringed too.
Having got past that, BBC Sport did something weird with their site as well, changing the URL format - year-old articles got spammed onto /r/ukguns. Again. Sorry about that.
So how to fix this? Well I could limit the damage by throttling the script to only posting one article per run. It might be an old and irrelevant article, but at least it wouldn’t run away with itself. That felt a bit hacky though because an organisation might post two relevant articles in a day and one would not be posted. Would that be so bad? Maybe not. But it didn’t feel optimal.
I could also append the list of articles itead of overwriting it. This would catch 12-month old posts which had dropped off and then popped back up for some reason. I didn’t like this since the overwrite method keeps the file naturally small and means I never need to worry about file sizes (albeit not likely to be an issue with this rate of posting). But it wouldn’t catch changes to URL schema (where the URL is new). In writing this, I’ve realised could idenfity the article by the path rather than the whole domain, which would have avoided the issue with the NSRA site (where the domain jumped to a dev domain, but the path stayed constant), but it wouldn’t have caught the BBC issue.
What I did instead was got my hands dirty with timestamps, and it is to my eternal gratitude that feedparser creates a datetime object so I didn’t have to. The script checks the date and only posts an article with a date created of today or yesterday. Even if the publisher is doing shenanigans with their feed, it’s only going to repost very recent news - not the last 12 months. The important consideration here is that I am only handling feeds with one to two new posts per day. If you were consuming the Ars Technica or BBC News feed, there would be dozens of articles per day and you could end up spamming out many articles even with that limit on. This is kind of a kludge in that it doesn’t actually stop the bad behaviour - just severely limits the blast radius. Defence in depth is important, but this guards against edge cases I haven’t thought of (or been bitten by!) in addition to the ones I have.
Take aways
So what have I learnt?
- RSS feeds will change in unpredictable ways.
- Depending on the config, old articles may be promoted if their dateModified is updated.
- Minor changes to a site’s URL schema will be reflected in the feed - which could impact your list of “previously seen” articles
- When reposting articles from an RSS feed, throttling to extremely recent articles is a good idea and reduces the blast radius when something unexpected happens.
- Identifying an article by the path (or even a subset of the path) could help avoid issues where the domain changes.