To increase its dataset for training its upcoming generation of AI systems, OpenAI has introduced a new web crawling bot called GPTBot. According to OpenAI, the web crawler will gather information from websites that are freely accessible to the public while avoiding content that is paywalled, sensitive, or illegal.
However, the system is opt-out. GPTBot will presume available information is open for use by default, similar to other search engines like Google, Bing, and Yandex. The owner of a website must include a “disallow” rule in a common server file in order to stop the OpenAI web crawler from digesting that webpage.
Additionally, according to OpenAI, GPTBot will check scrapped material in advance to weed out personally identifiable information (PII) and anything that contravenes its rules. However, some technological ethicists believe that the opt-out strategy still poses consent-related concerns.
Read More: OpenAI’s Sam Altman Launches Cryptocurrency Project Worldcoin
Some commenters on Hacker News defended OpenAI’s action by arguing that it needs to amass as much information as possible if people want to have a powerful generative AI tool in the future. Another person who was more concerned with privacy complained that “OpenAI isn’t even quoting in moderation. It obscures the original by creating a derivative work without citing it.”
The launch of GPTBot comes in response to recent criticism of OpenAI for previously illegally collecting data to train Large Language Models (LLMs) like ChatGPT. The business changed its privacy policy in April to address these issues.
Meanwhile, a recent GPT-5 trademark filing appears to hint that OpenAI might be working on its next version of the GPT AI model. Large-scale web scraping would probably be used by the new system to refresh and increase its training data. However, there is no official announcement concerning GPT-5 as of yet.