3.6.2024

Build Your Own Search

searchSEOgooglesearch enginebuild your own search enginesystems

2024-06-03 12:11 +0100

Is the time right to take a leaf from the Fediverse and move towards smaller, more localised/focussed (and opinionated?) search?

A treasure trove of internal documentation relating to Google Search was recently leaked, which has set the SEO world alight as it shows that Google were collecting various metrics and datapoints that they had previously denied collecting. In fairness, the docs don’t show weightings or if those metrics are even used in Search. They might be used in Knowledge Graph or other services like ad placement. Nonetheless, this is interesting amidst the ongoing hype around AI, with Google and Bing replacing authoritative search results with hallucinated AI “summaries”.

It has been suggested for a while that Google’s results have become less useful, but the only major alternative is Bing - with metasearch like DuckDuckGo simply rehashing Bing results (with privacy protections) rather than running their own index. Reddit is cited as the ideal search result for technical problems, where organic search results are polluted with AI-generated spam. Mojeek also run their own index, which may or may not work for all users.

How hard could it be to do better? What if we decided we weren’t going to try and index everything? If we decided to add some human curation into the mix? Keep the service tight and focussed (e.g. hyper-localised on UK users), with an opinionated basis for determining reliability rather than trying to index everything purely algorithmically, claiming total neutrality.

At this point I realised I didn’t know how the hard part of search engines (indexing) actually works. Scraping is easy. A teenager with Python and BeautifulSoup could write a webscraper. Indexing is the hard part - how do you actually determine what is most useful for a given search term? Here, the seminal paper describing Google from Brin and Page’s time at Stanford is instructional. Happily, many issues like seek times on spinning disks and the maximum 4GB file size on 32-bit filesystems are no longer a problem!

The problem

Google has long been in a whack-a-mole battle with SEO practitioners, who try to reverse-engineer Google’s indexing mechanism for the benefit of their clients. This means that even as early as 1997, Brin and Page noted that meta tags in page headers could not be trusted as they were modified to make a page look highly relevant when it was wasn’t. This led to the “citation” based foundation of PageRank where a page’s value is derived largely from how many other pages link to it. You can pad as many keywords as you like in your page - it won’t help if nobody links to your site.

Since then, levels of manual intervention have been layered on - for instance the leaked documents show that flags such as “isCovidLocalAuthority” and “isElectionAuthority” have been used to allow-list and promote official sources of information over unofficial sources as well as attempting to bump outright conspiracy and disinformation schemes down the rankings. Such flags are inevitable for fine-grained control - for instance, even in the UK, COVID regulations differed between England, Scotland, Wales and NI. There will be some situations where an “after-the-fact” flag to promote specific sources will be necessary, even if official NHS or Government sources are generally well-ranked, because you need the correct NHS or Government source for that region or sub-topic!

One of the key points of the Google Paper is scalability. In 1997, search engines were indexing up to 100million documents (individual pages, not sites). Their research effort was on improving search, rather than indexing. But they expected to be indexing a billion documents by 2000. PageRank serves to bypass bad behaviour (such as keyword stuffing) by leaning on the graph of links between sites to find respected and cited information - a web of trust. But they probably didn’t expect to have to sift through LLM-generated cruft that looks good but actually tells you to put glue on your pizza. Nor the triviality of being able to spin up websites and generate this shit on an industrial basis.

Now, I would question whether anyone would ever need a search engine with more than a billion indexed documents. Sure, my interests are not the same as your interests. But there’s a vast amount of cruft and spam on the internet. So as we drown under the weight of AI-generated content would it be ridiculous to try and improve search not through smarter algorithms, but simply by being more selective about what we allow in?

This is not a technical problem - it’s a human and social problem. Brin and Page were looking at a rapidly growing web and developed a system that could scale from a technical perspective. But not from a human perspective. And they could not have predicted the boom of LLM-generated nonsense.

A solution?

To my mind, this is a classic problem of a service which started out on a small web and has had to apply safeguarding and fudges retrospectively, much as Facebook and social media have struggled to manage moderation at scale (either missing rampant disinformation and child abuse material, or going overboard and banning social media groups for geographic places like Plymouth Hoe!).

There would be value to an AI-free search platform which bakes in safeguarding from the outset.

For instance, nhs.uk domains would be marked up at the start of the indexing process with a multiplier, to ensure relevant pages from the NHS always outperform overseas sites like WebMD, or fringe sources. Low-grade generic TLDs simply wouldn’t be indexed at all.

Baked-in axioms for a UK-localised index might include:

Bake in priority for official domains such as .nhs.uk/.gov.uk/.sch.uk/ac.uk, etc, along with high-value foreign equivalents. I suspect Google already bumps these domains. In my model, it would be interesting to bake that into the actual page rank rather than adding flags to boost the organic ranking of pages.
Bake in priority for high quality local institutions - e.g. BBC Bitesize for education (BBC News would be graded the same as other News sources).
Collaborate with social organisations such as The Samaritans and Women’s Aid to ensure relevant sites for mental health, domestic abuse, sexual health and other often-miscategorised services are indexed, protected and highly ranked.
Collaborate with social organisations to build reliable lists of things like sports clubs and arts groups, so that they can be marked as “Yup, a real organisation by and for humans”.

Reduce cruft

Don’t index social media posts - imagine the reduction in search clutter if every google search automatically appended “-site:facebook.com -site:x.com -site:pinterest.com -site:tiktok.com". This leaves space for diverse web results to be displayed. This is not to say the sites themselves would be absent - if you searched for Facebook, you would get Facebook.com. But if you search for Taylor Swift, it wouldn’t bring back her latest Facebook post - you’d get her official website, as well as news and articles about Tay.
- Exceptions for “opted-in” sources such as specific technical subreddits which may have better fault-finding or documentation than official sources. For celebrities and public figures, social profiles and data from Wikidata would be used to create a Knowledge Panel, but would not be included in search results.
Don’t index news from low-quality sources. Leverage human-curated fact checkers and lists such as Wikipedia’s Reliable Sources. Major news outlets would be graded for reliability. Political bias is fine - this is not a censoring exercise - but sources with a reputation for poor fact checking, sensationalism or flat-out fabrication would be marked down like social media - not indexed past the home page.
Don’t index sites with AI-generated text content. (Handling the likes of image sites such as DeviantArt with declared AI-generated imagery would need further thought, but Image Search has wider scope for filters and toggles anyway to show/hide categories of content).
Don’t index the majority of gTLDs.
Leverage SpamHaus to downgrade domains associated with email spam.
Don’t index domains less than a month old. This improves stability at the cost of slower indexing. This is fine. It’s okay if your new site takes a month to show up. You probably weren’t registering it the day before your go-live anyway. c.f. spammers who churn domains on a daily basis. Declining to index social media and low-grade journalism would significantly shrink the index, lowering the cost to entry. In principle, it would also be possible to index those services but place them on a block-list, preventing them from being shown on the SERP unless a specific bang was used (e.g. “site:dailymail.co.uk” or “site:dailystar.co.uk”). This means the engine is still useful for researchers trying to measure coverage of an issue, but only quality sources would be shown.

We also have to be careful of gatekeeping too heavily and locking out new entrants. Huffington Post would have struggled to get off the ground if Google refused to index them until they became notable. But it’s also reasonable that sites demonstrate their trustworthiness. Nobody owes them a platform. And the aim here is to surface small but authoritative sites - to reverse the trend of burying them under massive social media posts or spam farms.

To be clear, it’s impossible to provide a fully curated search experience, unless we’re providing an extremely restricted Search For Toddlers. We have to use automation somewhere along the line. But by making some more opinionated decisions up front, we could potentially make some big strides to reliable and safe search by saying:

“x is vital, y is objectively garbage. If you want to search z then by all means do so - using their internal search. We don’t feel the need to index that.”

The problem

A solution?

Social Good

Reduce cruft