OpenAI Introduces New Web Crawler GPTBot to Consume more Open Web

The web crawler will gather information from websites that is freely accessible to the public while avoiding content that is paywalled, sensitive, or illegal.

By Sahil Pawar

August 8, 2023

To increase its dataset for training its upcoming generation of AI systems, OpenAI has introduced a new web crawling bot called GPTBot. According to OpenAI, the web crawler will gather information from websites that are freely accessible to the public while avoiding content that is paywalled, sensitive, or illegal.

However, the system is opt-out. GPTBot will presume available information is open for use by default, similar to other search engines like Google, Bing, and Yandex. The owner of a website must include a “disallow” rule in a common server file in order to stop the OpenAI web crawler from digesting that webpage.

Additionally, according to OpenAI, GPTBot will check scrapped material in advance to weed out personally identifiable information (PII) and anything that contravenes its rules. However, some technological ethicists believe that the opt-out strategy still poses consent-related concerns.

Some commenters on Hacker News defended OpenAI’s action by arguing that it needs to amass as much information as possible if people want to have a powerful generative AI tool in the future. Another person who was more concerned with privacy complained that “OpenAI isn’t even quoting in moderation. It obscures the original by creating a derivative work without citing it.”

The launch of GPTBot comes in response to recent criticism of OpenAI for previously illegally collecting data to train Large Language Models (LLMs) like ChatGPT. The business changed its privacy policy in April to address these issues.

Meanwhile, a recent GPT-5 trademark filing appears to hint that OpenAI might be working on its next version of the GPT AI model. Large-scale web scraping would probably be used by the new system to refresh and increase its training data. However, there is no official announcement concerning GPT-5 as of yet.

OpenAI Introduces New Web Crawler GPTBot to Consume more Open Web

LEAVE A REPLY Cancel reply

Most Popular

OpenAI Introduces New Web Crawler GPTBot to Consume more Open Web

Subscribe to our newsletter

RELATED ARTICLES

Anthropic Releases a New Citations Feature for Claude

ByteDance Launches an Advanced AI Model, Doubao-1.5-pro

Chinese AI Lab’s DeepSeek R1 LLM Outshines Competitors

LEAVE A REPLY Cancel reply

Most Popular