
[This post is authored by Bharathwaj Ramakrishnan. Bharathwaj is a 3rd year LLB Student at RGSOIPL, IIT Kharagpur, and loves books and IP. His previous posts can be accessed here.]
As perhaps all readers are aware by now, the GenAI Copyright litigations have made their presence known in India. As discussed earlier, ANI had filed a copyright lawsuit against Open AI in the Delhi High Court. After that, it looks like Reuters (subsequently picked up by others) had stated that (see here and here) a new copyright suit had been filed against OpenAI by the Federation of Indian Publishers. However, based on the information available on the Delhi HC website on the case (pdf), it is clear that the Federation of Indian Publishers had filed an intervention application, and as per the Court hearing yesterday, DNPA and other copyright owners also wish to intervene in the suit (pdf) (also reported here).
These developments have to be seen in light of numerous suits already filed against Open AI and other GenAI companies for copyright infringement around the globe. There are very few details on the intervention application. Based on the hearing that happened on the 28th of Jan, it seems that the Court has not admitted the intervenors yet but has clarified that even if they are allowed to be made party to the suit, the issues in the suit would be limited to the ones already framed by the Court (pdf). Likewise, both the amicus Dr. Arul George Scaria and Adarsh Ramanujan had filed their amicus briefs before the Court. The Court had also concluded the crucial issue of jurisdiction would be heard holistically when the defendants sought to settle the jurisdictional issue as a preliminary matter.
The case raises important questions about copyright law and how generative AI needs to be treated under the existing copyright doctrines. I will use the issues that the Court framed in the first hearing to serve as signposts to ground the discussion in this post.

In addition to these issues framed by the Court, I would like to highlight two additional contentions that may play out in this litigation. First, there might be differing contentions regarding how a GenAI model learns from its training data and produces outputs. Secondly, there will be differing arguments and contentions on how much a model memorizes its training data and during what circumstances the Model spits out memorized training data as outputs. The entire post will be divided into two parts, with the first part dealing with the background and the first two issues, while the second part of the post will deal with the final two issues and the two additional contentions I mentioned above.
With that long, winding introduction done, we can finally move forward!
Talkin’ ‘Bout the AI Supply Chain:
To better understand Issues framed by the Court, it is helpful to understand the ‘supply chain’ of these AI models. Lee, Cooper and Grimmelmann, in a very detailed paper have explained this as involving eight stages.

The supply chain starts from the creation of expressive works, which includes all creative works that have been produced throughout human history (even the rock painting in Bhimbetka is included!) and ends in Generation (8th stage). As can be seen though, these stages are not necessarily linear, and there can be various interconnections between these stages. For example, a model post-deployment (6th stage) can also be fine-tuned again for improved performance in a specific narrow domain (5th stage). Likewise, images, text, or any data examples generated at the 8th stage can form part of the dataset (3rd stage) or, in other words, is a reference to the use of synthetic data to train Gen AI models. It must also be noted that as per the supply chain model “training” is not a singular activity done pre-deployment. It is an activity that can happen both pre and post-deployment. This gains significance as the first issue also deals with the question of whether training of the software ChatGPT is copyright infringement. With that said, as Lee et al point out, Copyright issues can arise at every stage of the supply chain. Now, with this background, we can attempt to locate the stages at which copyright concerns have been raised in the context of the OpenAI suit.
Locating the Copyright Concerns of ANI within the Supply Chain:
The first two issues reflect the two-pronged approach by ANI. The first issue deals with the question of whether “storage” of the plaintiff’s data, “training” of the defendant’s software (ChatGPT) and “use” of the plaintiff’s data to produce outputs leading to copyright infringement. Thus, one can conclude that copyright concerns are being raised because ANI’s copyrighted content is part of the dataset collection (Stage 3) and is being further used in Model training (Stage 4). It has to be noted here that even at the stage of generation (Stage 8), due to user prompts, the Model itself might refer to online sources and respond (ChatGPT search refers to online sources while responding to prompts). The online sources might have included ANI, and as ANI alleges, the Model itself might be producing potentially infringing outputs. The blacklisting of the ANI website can be seen as a step to ensure that ChatGPT does not refer to ANI content online when producing generations nor be used as part of training data set to train future models. However, it has to be noted that this blacklisting does not address or answer the question of whether previous ChatGPT models were trained on ANI’s copyrighted content.
Understanding the Contentions Involved in the First Issue:
The engagement of intermediate copying or creation of training copies in service of training the Model as seen in Stage 3 and further training under Stage 4 has been alleged by ANI to constitute infringement as per the news reports. The framing of the first issue also reflects this contention. Thus, the argument presumably could be that the defendant is engaging in the act of reproduction, a specific right vested with the Copyright owner under Section 14 of the Act.
Yet, there is a counter-argument in this paper by Akshat Agrawal and Sneha Jain,that reproduction for the sake of reproduction, which in the end does not lead to consumption or enjoyment of the work, is not to be treated as reproduction. Thus, as Agrawal and Jain put it, can the reproduction of a book for use as a doorknob be infringement merely because a copy of the physical book was made?
Another interesting thing that needs to be noted in issue 1 is that in bracket text, the Court notes that ANI’s copyrighted content is in “the nature of news“. This specific observation indicates the Court may look into the possible applicability of the idea-expression dichotomy and/or applicability of fair dealing regarding the reporting of current events as under S. 52(1)(a)(iii). This latter argument will be dealt with later in the second part.
On the idea-expression dichotomy – there are two levels: First, what is the scope of copyrighted protection in the copyrighted content that ANI is asserting against OpenAI? The Court has to filter out unprotectable ideas from the protectable expression. Secondly, does training a GenAI model on copyrighted content constitute infringement? If the Model merely learns patterns and ideas from its training datasets, it takes ideas, not expressions. As the Supreme Court in RG Anand observed, “There can be no copyright in an idea, subject matter, themes, plots or historical or legendary facts”, and any violation of copyright would be limited to expression. Thus, if the Gen AI model is merely learning patterns from its training or learning meta-information, then there is an argument to be made that the Model is merely learning the ideas, not the expression. It must be noted at this juncture that how a model learns from its training data is a contested issue and I discuss the question of how a model learns in the second part of the post. Secondly, there is also the issue of the scope of copyright protection that might apply to the Copyrighted content that has been asserted against OpenAI by ANI. The question of the scope of copyright in the content will gain relevance at the output stage since ANI has also accused OpenAI of producing infringing outputs.
Making Sense of the Second Issue:
Now coming to the second issue framed by the Court is interestingly worded to the extent that wherein it asks whether “use by the defendants of plaintiff’s copyrighted data in order to generate responses for its users”, and it is worded widely enough to state that even the first issue might fall under the second issue, as use of ANI’s copyrighted content to train ChatGPT itself can be a use of ANI’s copyrighted content towards the purpose of generating response for its users. So, it is slightly unclear what specific analysis or legal issue is sought to be resolved under the second issue. Is it the question of using ANI content as a reference to generate output, as in the case of ChatGPT search, or is it related to the question of whether the Model is regurgitating? In other words, is the Model spitting out memorized training data when prompted by an innocent user with no intention to make the Model generate infringing output? It is also possible that the second issue might give rise to differing contentions surrounding the adversarial user but I will touch on this in the second part of this post.
I will continue the discussion on the last two issues and the two contentions in the second part of the post.