[This Post has been authored by our former blogger Varsha Jhavar. Varsha is a lawyer based in Delhi and is a graduate of Hidayatullah National Law University, Raipur. Her previous posts on the blog can be viewed here, here, here, here.]
Considering the widespread use of AI today, there appears to be a need to regulate its development and ensure it is actually benefitting humanity. Countries have started taking steps in that direction – such as the USA’s Algorithmic Accountability Act of 2022, European Commission’s proposed Artificial Intelligence Act, Canada’s proposed Artificial Intelligence and Data Act and Beijing AI Principles. Italy’s Data Protection Authority went a step further and ordered ChatGPT to stop the processing people’s data with immediate effect, on the basis that no information is provided to the users whose data is collected by Open AI. It was noted that there does not appear to be any legal basis for such data collection and its processing for training of the chatbot’s algorithm. However, the ban has been lifted after data privacy improvements by ChatGPT’s developers.
In Part I of this post, I have contended that there is a need to regulate the development and use of AI, specifically from an IP-centric perspective. In Part II, I have attempted to explore certain aspects that could to be regulated, in order to ensure that AI is responsibly and ethically developed. To clarify, I have not argued for the introduction of laws for the grant of authorship/inventorship and/or ownership to AI, as not only will it require time and resources, but the need for the same may not be as urgent, as the need to regulate the use of materials/content protected under copyright and trademark law for the training of AI!
AI training data sets & Copyright Infringement
The generative AI systems i.e. type of AI that can generate text, code, art, music, etc., are trained on data that has been scraped from websites and could be the subject matter of copyright protection. In such circumstances, would it be liable for copyright infringement? If the user provides an input that is unlawful or would result in copyright infringement, should the AI be held liable? Well, it could potentially be considered to be an intermediary, as it is generally used akin to a search engine. However, it does not host links to third party websites. Unlike search engines which reflect the information/content available online, ChatGPT or AI art generators do not provide links to the websites where the relevant information is available, rather they provide an answer without attribution.
The suits that have been filed in the US provide some insight into the potential issues that could arise on a large scale when the data that AI has been trained on is subject to copyright and trademark protection or when AI ends up replicating such data upon being provided with an input. In this post, the aim is to explore ways to deal with such issues. The copyright implications of ChatGPT are also relevant for this post and have been previously covered on the blog.
Getty Images Suit
In a complaint filed by Getty Images before the US District Court of Delaware, it has accused Stability AI, creator of an (free) art generator Stable Diffusion, of copyright infringement, trademark infringement, unfair competition and providing false/altered copyright management information (CMI). It has supported its claim by providing 7,216 examples of text-and-image pairings claimed to have been copied without authorization or compensation. As stated in the complaint, Stable Diffusion seems to have been trained on a dataset created by a German entity, which used content from Getty without any authorisation. Getty also alleges that Stability AI’s actions are also contrary to its Terms of Use which prohibits the “(a) downloading, copying or re-transmitting any or all of the Site or the Getty Images Content without, or in violation of, a written licence or agreement with Getty Images; (b) using any data mining, robots or similar data gathering or extraction methods”. Interestingly, Getty claims that its text-and-image pairings on its website have been critical to the successful training of Stable Diffusion to provide appropriate output and that the unauthorised copies of its copyrighted content are not transitory in nature.
However, the transitory nature claim appears to be incorrect, as the images/dataset that the AI has been trained on are not stored or reproduced within Stable Diffusion, but instead the model analyses the similarities/patterns between the images and stores such information. This information is used by the AI to generate new images. This is a factor that could affect Getty Images’ chances of success in court, but the fact that Getty Images has in the past licensed content from its platform to an AI art generator might help tilt the scales in its favour. Thus, the potential copyright and trademark implications of widespread copying by generative AI can be seen from this suit.
Github Copilot Suit
Github Copilot (Copilot) helps coders by filling in blocks of code using AI and is subscription-based. Copyrighted materials which have been made available publicly on Github, a hosting service for storing and managing code, are subject to various open-source licenses prescribing the terms of use of the works and the common terms/conditions in most of these licenses are – attribution of author, notice of copyright and a copy of the license. The complaint states that Codex (used by Copilot to suggest code) and Copilot have been trained on open-source code from Github, and the output often simply reproduces code which is near-identical to the training data (albeit without attribution and copyright notice) that can be traced back to the source i.e. someone else’s work. Interestingly, the complaint alleges that the Defendants have claimed that Codex and Copilot do not retain copies of the materials they have been trained on.
Potential aspects for regulation
The long-term monopoly provided by copyright law is what supposedly motivates authors to create works of literature, art, and music. How does this incentive mechanism work with AI-generated creations? Recently, an AI-generated song which replicated AI vocals in the style of Drake duetting The Weeknd, was uploaded on platforms such as Apple and Spotify. In the wake of AI-generated songs becoming available on streaming services, Universal Music Group has asked streaming platforms to block AI from scraping lyrics and musical compositions from their copyrighted songs. In fact, Google has refrained from launching its text-to-music AI, called MusicLM, as about 1 percent of the music generated by it was found to be a direct reproduction of copyrighted works.
A Disney illustrator, Hollie Mengert found that 32 of her pieces were downloaded by a student and used to train Stable Diffusion to recreate her art style. In scenarios like this, the courts might come to the conclusion that the student’s use amounts to copyright infringement, but another factor that would need to be taken into account is that the use was non-commercial in nature. What if someone feeds all of Dan Brown’s novels into an AI and asks it produce a novel in his style? Or if someone asks ChatGPT to show important extracts from a certain chapter of a copyrighted book? In India, the use of training dataset without permission (and for paid versions of generative AI), is likely to be held to be copyright infringement (discussed here previously) and thus, there seems to be a need to regulate certain aspects of AI and Part II of this post shall explore some of these aspects.