Generative AI Bots Are Harvesting Your Online Community’s Data—Here’s What You Can Do About It.

One of the biggest concerns for online sites today is the unauthorized collection of data by generative AI bots to train AI models. Unlike search engines that index web content to make it more accessible, these bots amass vast amounts of data, which, once absorbed into AI models, loses its attribution. This sparks serious debates about ethics, privacy, and ownership rights.

Both businesses and individuals are wrestling with these challenges. For example, the ongoing legal battles between major news outlets like The New York Times and AI powerhouses such as OpenAI underscore the seismic shifts AI could trigger in copyright norms and data usage. At the end of 2023, 48% of the most widely used news websites across ten countries were blocking OpenAI’s crawlers, according to Reuters research.

This brings up an important consideration: Should you implement similar protective measures for your online community? Like most decisions, the answer is “it depends.” Let’s explore this further to help you decide the best course of action.

The Impact of Generative AI Crawlers on Online Communities

Generative AI crawlers, which automatically extract content from websites to train or power AI-driven activities, present several risks to the health of online communities.

IP and privacy issues

AI bots can repurpose scraped content in different contexts without the original creators’ permission. These bots may inadvertently gather personal information such as names and preferences from interactions in online communities. This collection often occurs as an unintended consequence of extracting data from the site, typically without the users’ knowledge or consent. AI companies can then use this data to train AI models or to serve targeted ads.

In many online communities, members share their thoughts or works expecting it to be seen and used only within specific limits. However, unauthorized scraping and redistribution by AI can lead to content being shared beyond the intended audience. This can make the content feel less exclusive and potentially lower its perceived value.

If community members discover that AI bots use their contributions in ways they didn’t agree to, they may feel exploited and lose trust. This could result in their withdrawal or departure from the community.

User engagement and traffic

AI-driven technologies are transforming online search behaviors and affecting user participation in online knowledge communities. Recent studies from the Communications of the ACM (CACM), a leading computer science journal, show that after ChatGPT was launched, StackOverflow.com, a popular resource for software developers, experienced about a 12% fall in daily visits and over a 10% decrease in weekly questions posted. This decline raises concerns about the long-term quality and viability of discussions within these knowledge-sharing platforms.

Content integrity

Generative AI bots sometimes extract text without understanding its full context or nuances. For example, they may lift a comment from a discussion thread without considering the surrounding conversation, tone or intent. Training AI models with decontextualized content can distort the original message and lead to misinformation.

A notable example occurred with an AI chatbot for Air Canada, which “hallucinated” a bereavement discount policy based on similar policies from other airlines. Consequently, Air Canada had to honor the nonexistent discount because the error appeared on its official website.

Furthermore, when AI bots indiscriminately gather data, they risk propagating outdated, inaccurate, or misleading content, which can further spread misinformation.

Security concerns

Intense activity by AI bots, such as frequent crawling and interactions, can strain website resources, leading to increased server loads and operational costs. In severe cases, it might result in diminished website performance or even outages.

Additionally, hackers can design AI bots to flood a website with excessive traffic in a brief period, overwhelming the server. This surge in traffic can prevent legitimate user requests from being processed, leading to what is known as a Denial of Service (DoS) attack.

How to Block Generative AI Bots from Crawling Your Online Community

To prevent AI bots from accessing your website, start by editing your site’s robots.txt file, which dictates the areas of your site that web crawlers are allowed to access.

However, it’s important to note that adherence to robots.txt is voluntary. While reputable search engines generally follow these rules, some AI bots, especially those designed for data scraping, may ignore them.

For more robust protection, consider implementing a specialized AI bot blocker, such as the AI Bot Shield built into Higher Logic Vanilla’s community platform. This tool uses digital signatures to detect and block invasive AI bots, offering protection that surpasses what’s possible with robots.txt alone.

Given the complexity introduced by AI technology in managing bots within web applications, we also employ advanced strategies at the internet proxy layer. This approach helps us block malicious traffic before it even reaches our customers’ communities, ensuring better security and performance.

Will this affect your community’s search engine rankings?

AI Bot Shield selectively targets AI bots for scrutiny without blocking SEO-related bots. It differentiates between benign search engine crawlers, which are allowed, and potentially harmful generative AI bots. Therefore, activating AI Bot Shield will not affect the visibility and search engine ranking of your website.

Some companies, such as Google, operate both AI crawlers and search crawlers. In such cases, AI Bot Shield only blocks the AI crawlers, ensuring that crawlers known for collecting data for AI training are targeted.

Should You Block Generative AI Bots from Crawling Your Online Community?

Not all communities view AI bot crawlers in the same light. The decision to block or allow these entities access to your site’s content should be informed by your specific goals and the nature of your community.

For some, the visibility offered by generative AI platforms like Bing, ChatGPT, or Gemini could be desirable. AI bots can be a powerful tool for disseminating knowledge to the public or increasing awareness around a brand, new product category, or point of view.

Depending on your community’s needs, you can configure AI Bot Shield to block or allow specific bots, providing flexibility in how you engage with different AI technologies.

To learn more about how Higher Logic Vanilla blocks generative AI bots in online communities, reach out to me on LinkedIn or book a demo with our team.

Matt Crouse

Matt is the director of product engineering at Higher Logic Vanilla.

Who does your community serve?