27.8.2021

Website Categorisation

blocked.orgcensorshipoverblockingcategorisationdualitymalgorithmsalgorithms

2021-08-27 11:44 +0100

I wrote in February about the overblocking of websites by imperfect network filtering packages - often drawing on website category data as a source.

Target sport clubs and even governing bodies like British Shooting often end up on the wrong side of this equation thanks to the duality of the shooting sports, being classified as “weapons” or even “violence” due to the military and hunting applications of firearms.

Having a list like Clubfinder with domains for UK rifle clubs made me wonder if I could quantify and maybe even address this.

Who are you?

There are a wide variety of categorisation and filtering platforms, some run by appliance vendors and MSPs in the education sector. Other more general platforms are provided by the likes of OpenDNS and CleanBrowsing which include consumer services. Although the platforms are generally proprietary, most have a public-facing interface for checking individual domains (and sometimes submitting requests for re-categorisation). The initial pass turned up Categorify, Palo Alto Networks, zvelo, Cyren, WhoisXML, Webshrinker, Cloudflare and Fortinet.

All these allow you to check domains and submit re-cat requests, but mostly via a Captcha-protected webforms, which doesn’t scale.

A couple however offer public APIs for checking, and a basic HTML form for requesting recategorisation which is easily replicated as a POST request in software. These were the obvious places to start. I’m just going to refer to “Provider A” in this post since I want them to have an opportunity to defend themselves when I’m done with this project.

Target domains

Of the 851 clubs in Clubfinder, ~578 actually have websites. Some of those are little more than webpages on a County or local authority website (all the Jersey clubs were listed on the single “www.jerseylovessport.com” domain). De-duplicating those records and adding county and national associations left 539 domains to check.

Whipping up a bit of Python to query Provider A’s was straightforward and I ended up with a new SQLite database of domains, categories and ratings. At some point I’ll post it to GitHub but it’s currently an ugly mess of scripts, half-finished modules, hard-coded variables and flags that I simply comment out for different runs. It needs some tidying up to avoid outraging Pythonic decency.

Of 539 domains, 356 (65%) were classed as “Weapons” and “PG-13”. Of the remaining 183, 44 were offline when they were scanned whilst the others had innocuous (albeit often wrong) categories such as “Financial”, “Engineering” or “Online Shopping”. Four domains were classified as “Adult/Pornography” and “R & NSFW”. They were <essexsra.co.uk>, <sussex-county.co.uk>, <www.5thsussexhgrc.org.uk> and <www.wessexbiathlon.org>. Notice a pattern in there? A classic case of the “Scunthorpe problem”. Whilst 4 domains is not a meaningful sample size, a 100% hit rate implies that the categorisation platform is still quite crude in places with every domain containing ‘sex’ getting blocked. Definitely worthy of further investigation.

So that’s not great. 65% are classed wrongly, and potentially inaccessible from many institutional and council/library networks. Obviously we’re not looking to commit malfeasance here. Network administrators have the right to set standards for what passes their gateway and it’s not for any site operator to try and bypass that by seeking to categorise their domain as something else. But to filter effectively they need to be able to trust that the categorisations are accurate; it is clear that even in this small sample of sites we have a non-trivial error rate.

Examining how sites are treated

Cleaning this up is undoubtedly a good thing, but does it actually matter? To understand how sites with different ratings are treated, it’s useful to see what conditions result in their blocking. Happily, Provider A offers three public filtering services alongside their commercial offering:

“Security” blocks access to phishing, spam, malware and malicious domains.
“Adult” blocks everything in “Security” plus all adult, pornographic and explicit sites and direct sets Google/Bing to force SafeSearch on.
“Family” includes everything above, plus access to proxy sites and VPN endpoints that might be used to bypass the filters. It also blocks mixed content sites like Reddit.

Checking a sample of sites classified as “Weapons” and “Adult/Porn” showed that even the most restrictive “Family” filter was only blocking sites with an “Adult/Pornography” category. “Weapons” did not in itself cause sites to be blocked. This makes sense since you want to leave people with a reason to subscribe and give you money! “Weapons” sits in the same category as gambling or social media; it will only be filtered if a customer (such as an institutional network administrator) checks the “Weapons” tick-box. This is somewhat reassuring, but appropriate categorisation is nonetheless desirable. I submitted all the incorrect “Weapons” and “Adult/Porn” domains to Provider A and am pleased that they were all corrected to “Sport”.

Broadening the net

After examining the domains from Clubfinder, I thought it would be interesting to check other international websites - the equivalents of British Shooting and continental bodies like the European Shooting Confederation. Lifting a list of affiliated bodies from the ISSF website, this run turned out to be a significant mixed bag. Whilst some were classed as “Weapons” or “Sports” there were also categories of “Financial”, “Literature”, “Software Development”, “Health” and even “Dating”. Apparently the accuracy tails off badly for non-English-language sites. This is perhaps not surprising, but a key limitation that network administrators should be aware of.

Next steps

There are two main routes of inquiry to follow from here.

The first is to continue my investigation of shooting-sport sites with other filter providers. This is liable to run into diminishing returns as many providers only allow API access for paying customers. Moreover, most do not have a mechanism for requesting re-categorisation on a bulk basis unless they agreed to offer me access on a research basis. Even so, simply developing a list of erroneously classified sites could be useful for distributing to the community to raise awareness and encourage clubs can make those requests on an individual basis.

The second is a more broader assessment of how bad categorisation is. Having found that all four domains containing ‘sex’ in the URL were incorrectly marked as “Adult/Porn”, I manually tried a handful of ‘Essex’ domains including <essexstudent.com>, <essexoutdoors.com>, <essexsports.net> and <westessexbowmen.co.uk>. All were classified as Adult/Porn. The easiest approach here would be to get complete TLD zone files from organisations like Verisign and Nominet to examine all domains containing the string “sex” and domains with problematic strings like “essex”, “sussex” and “middlesex”.

This aligns with the sort of complaint the Open Rights Group have made with their Blocked project aimed at ISP-level blocking. On the surface, over-blocking in education and corporate environments might seem prudent - “playing it safe”. In reality it can mean blocking sites relating to sexual health or organisations supporting domestic abuse victims. This can impose real-world harms on vulnerable individuals.