Recently, the New York Times (NYT) took a preemptive measure to avoid having its content used to create and train artificial intelligence models. The New York Times banned the use of its content, which includes text, images, audio and video clips, metadata, and other forms of content in the development of any software program. The new Terms of Service, which were updated on August 3, specifically forbid training a machine learning or artificial intelligence system on its data. It is not surprising that NYT’s decision gained traction after OpenAI released a new web crawling bot called GPTBot to expand its dataset for training its forthcoming generation of AI systems. According to OpenAI, the web crawler would collect data from publicly accessible websites while avoiding paywalled, sensitive, or illegal content.
However, the system is an opt-out one, which means website owners have the option to disable the web crawler to access their content. Similar to other search engines like Google, Bing, and Yandex, GPTBot will assume that any information that is available is, by default, available for usage. However, the question remains: why has NYT banned OpenAI’s GPTBot web crawler despite the growing popularity of its AI chatbot ChatGPT? Well, there might be several reasons behind this crackdown. Let’s take a quick look at what they might be.
OpenAI’s artificial intelligence chatbot ChatGPT is based on GPT large language models, which are trained on a vast amount of dataset gathered from the internet. This means a web crawler used for data collection purposes copies the entire content, as it is, from the websites to feed into the LLMs for training. The responses generated by ChatGPT are based on this training dataset fed to the LLMs. The opt-out option for OpenAI’s web crawler came only after a series of lawsuits against the company for copyright infringement, as the training data gathered for the earlier versions of GPT was used without the consent of the respective owners. The content created by authors and writers from the New York Times is protected by copyright. Since ChatGPT generates all its responses without any attributes or credits to the original source of the information, it clearly violates copyright laws. Neither is there any compensation for original content creators for the unconsented use. Considering this, it seems only fair on the part of NYT to prevent OpenAI from using its copyrighted content.
Apart from the copyright infringement issue posed by OpenAI web crawler, NYT perhaps might be concerned about ChatGPT stealing its thunder. The AI chatbot has the remarkable ability to produce all sorts of textual content based on detailed prompts provided by the users. This is because of the vast amount of eclectic content it has been trained on. Now, NYT is known for the quality of its written content, which is thoroughly research-based and inhibits a unique writing style. Once the GPT models are trained on the NYT’s content, it is not much of an assumption to say that ChatGPT may be able to imitate its content style. This can be used by malicious actors to create content under the name of the prestigious news organization, seriously affecting its reputation and credibility. There have been several similar instances since the advent of ChatGPT. Recently, author Jane Friedman protested that five books listed as being written by her on Amazon were actually not written by her. According to the author, the books are poorly written and are probably created using ChatGPT. Amazon later pulled the titles from sale.
$100 million Google Partnership
In May, The New York Times signed a deal with Google that will enable Alphabet to feature NYT content on several of its platforms, including the Google News Showcase, a product that pays publishers to feature their content on Google News and some other Google platforms. Google will pay the New York Times about $100 million over the course of three years as a part of the deal. Now, ChatGPT is being seen as the potential rival for Google, threatening it to become the future of search engines. Keeping this in mind, the NYT’s decision to ban the OpenAI web crawler may be a calculated move on the part of Google as a part of the deal, the sole purpose of which is to put OpenAI at a disadvantage. This assumption is supported by recent talks of NYT considering legal action against OpenAI over copyright infringement, which could easily turn into a high profile legal tussle as it will also bring into consideration the intellectual property rights. There are speculations that if this lawsuit goes ahead and the NYT is successful, OpenAI could be forced to completely erase ChatGPT’s dataset and start again using only authorized content, which will serve Google very well.
Repercussions for NYT
Despite the several valid reasons that support NYT’s decision, there may be some consequences for the new organization in the future. The advent of LLMs and their subsequent applications, such as ChatGPT and Bing Chat, are changing the way people search for information. Instead of visiting links on the internet, people now desire a prompt response to their search queries, which the AI chatbots are remarkable at achieving. Bing Chat is already able to access the internet and provide up-to-date information such as current events and news. It is only about time that ChatGPT joined the race, too, considering OpenAI’s conscious efforts to partner with new organizations such Associated Press and the American Journalism Project for their training data. It can be easily said that AI chatbots such as ChatGPT can become the future of search engines.
Websites, such as NYT that deny web crawlers access to their web content might be sabotaging their own future. Naturally, Bing Chat and ChatGPT, both of which are based on GPT large language models, will only show content that they have been trained on and have access to. If these chatbots do become the future of search engines and NYT continues to prohibit the use of its content for AI training, the news organization might eventually lose its domain authority, directly impacting its readership. This may even impact the credibility of the organization’s content, since the training dataset is devoid of their content and wouldn’t prioritize them. Moreover, NYT’s competitors, who decide to allow their content to be used for AI training, are bound to have the edge over the news organization. Many companies use datasets like Common Crawl to create lists of websites to target with advertising, and since NYT won’t be in the datasets, it may also affect its ad revenue.
Considering all the points mentioned in the article, it is evident that there could be several reasons why NYT has banned OpenAI from using its content for training its AI models. While some of the reasons might seem pretty valid, there might also be serious repercussions to the NYT’s bold decision as the AI chatbots gain more traction every day.
Now, the question remains: should you allow GPTBot to crawl your websites? The answer depends on several factors. If your intent is to maintain or increase the website traffic, protect the copyrighted content, or are concerned about being taken out of context or any other valid reason, then you may consider blocking the web crawlers for your own good. However, if the above-mentioned reasons are the least of your concerns and your sole purpose is to stay at the top of the rapidly changing search landscapes, then allowing the data to be used can be seen as a wise decision.