📖 Free eBook: The Digital Services Act Explained. Get it here → ×

Google Search Has a Content Moderation Problem

Contents

    AI-generated image with a confused-looking man dressed and ready for a day on the beach but standing in a concrete city landscape.
    The irony of using this AI-generated image is not lost on us.

    What if you asked someone for a travel recommendation and they sent you off to the shittiest resort in some polluted area in a country you didn’t want to visit?

    Sometimes this is what searching on Google can feel like. Thankfully not all the time, but there certainly are plenty of incidents where things just go wrong and you’re shown parts of the web that you’d rather avoid.

    We’d argue that, in this sense, Google is failing to filter and moderate the content it presents.

    Google’s challenge: content overload

    Google’s search index is over 100 petabytes large and includes hundreds of billions of web pages and other internet-accessible content like images and documents.

    While that may sound like an insanely big number, it’s far from the entire web. To create its index, Google crawls and discovers trillions of pages but discards most of them. In 2020, Google’s index was about 400 billion pages.

    We’re saying all this to explain that Google, and any search engine that tries to crawl the entire web, has a huge challenge going through such massive amounts of content on a continual basis, including or discarding pages based on the criteria of its various algorithms.

    At this scale, any issue with Google’s website or content classification is magnified thousandfold.

    Examples of spam in search results

    Anyone who has operated a website knows that Google can be a fickle master when it comes to what content it decides to index and rank.

    There are people who take advantage of Google’s flaws, bugs and inconsistencies, even if just for temporary gains, but there are also plenty of self-inflicted wounds. Google has shot itself in the foot many, many times.

    Here is a mix of recent incidents with less-than-ideal search results being shown:

    • Lately, Google has been boosting the rankings of user-generated content, so forum sites like Reddit and Quora have seen big increases in organic traffic. The irony being that many of these pages include prominent spam comments, as revealed in a recent study by Glen Allsop.
    • Hijacked hotel listings sending searchers to the wrong sites.
    • This spring, Google’s own AI-generated responses (AI Overviews, formerly SGE) were occasionally giving misleading and sometimes dangerous answers.
    • In recent years, something called parasite SEO has been a widespread strategy to get high rankings to third-party pages by piggybacking on the stronger reputation of another site. Google seems to finally be shutting this abuse down but here’s how it works.
    • Lots of issues related to Google Maps and local listings, such as review spam on Google Business Profiles.

    Google usually manages to correct most issues, but it can take a long time. Generative AI is only going to make this more difficult to keep under control as the threshold for creating new content is getting lower.

    Direct and indirect spam

    Search engines can promote spam directly, in their results, but also indirectly, by sending searchers to pages that themselves contain spam.

    This means that other sites’ user-generated content is a problem for Google, too. The evil twin of “mi casa es su casa.”

    This is not a problem entirely unique to search engines but something to keep in mind for any company working with data from external sources. Every input is a potential source of spam or abuse if you don’t carefully filter it.

    For Google, every site out there that isn’t on top of moderating its user-generated content is a potential attack vector. Like those spam-filled Reddit and Quora results we mentioned earlier, for example.

    Curation implies quality

    If you think of Google’s search results as a curated list of what is relevant for what you just searched for, you’d expect to be able to trust the results.

    After all, this is the stated intent of Google, to surface the most relevant and high-quality content for you, the end user.

    Then it really becomes obvious when this fails.

    Ahem… tap, tap… is this thing on? 🎙️

    We’re Besedo and we provide content moderation tools and services to companies all over the world. Often behind the scenes.

    Want to learn more? Check out our homepage and use cases.

    And above all, don’t hesitate to contact us if you have questions or want a demo.

    Contents