Quantcast
Channel: SpicyIP
Viewing all articles
Browse latest Browse all 3439

Part II- Applying Natural Intelligence (NI) to Artificial Intelligence (AI): Understanding ‘Why’ Training ChatGPT Transcends the Contours of Copyright

$
0
0

Continuing from his earlier post, where he explained the technical workings of Large Language Models vis-a-vis where different copyright questions arise, in this post Shivam Kaushik argues that LLMS are in effect interacting with non-expressive parts of the works in question. Further, he questions whether even a Text Data Mining Exception is required in the Indian Copyright Act. Part 1 of this 2 part series can be accessed here. Shivam is an LLM candidate at NUS Law specializing in IP and Tech Laws and a Research Assistant at the Lumens Machine Learning Project. He is interested in exploring the legal issues posed by emerging technologies. His previous posts can be accessed here.

Part II- Applying Natural Intelligence (NI) to Artificial Intelligence (AI): Understanding ‘Why’ Training ChatGPT Transcends the Contours of Copyright

Shivam Kaushik

When I say copyright, it means just what I choose it to mean- nothing more nor less.

In a nutshell, during the training, the LLM decomposes, abstracts, and constructs not text, but representations of relationships common to the tokens it generated from the earlier text! Now, one obvious question that arises from the copyright infringement perspective is this- once the text is converted into tokens, given a token ID, abstracted into numeric representations being vectors and word embeddings- Is any ‘expression’, which the copyright ostensibly seeks to protect, left in the work? There seems to be little doubt that ChatGPT has ‘used’ the copyrighted text. But is all use of the text protected by copyright? For instance, any text embodies the following:

An image showing the gradation of "Idea" "sematics" "syntax" "expression" in that order, with it indicating increased "expressiveness" and decreased "linguistic functionality" as it goes from "Idea" to "Expression".  And indicating that only "expression" is protectable under copyright.

It is well established that copyright over a work does not give exclusive rights over the idea imbued in the text. Similarly, the meaning of the words used in the text (semantics), and the grammatical arrangement of words (syntax) fall beyond the ambit of copyright protection. The only thing copyright protects is the ‘expression’ (I don’t think a source is needed for this). Now, when the LLM devours the text’s semantics, syntax, conceptual relationships and other underlying features, doesn’t it seem too far-fetched for any author to argue that she has ‘exclusive rights’ over them? Aren’t these elements, ideally speaking, “non-expressive” part of the work? 

A language model merely compresses and cross-references linguistic information to identify predictable patterns and reduce redundancies by representing meaning probabilistically. During the pre-training, the ‘identity’ and ‘wholesomeness’ of copyrighted works are lost, and they are stripped of everything but their raw linguistic essence, functionality and utility. The compression only captures the relational meaning in mathematical form. The aspects of the copyrighted work ‘used’ by the model during pre-training are the mathematical representation of word relationships. Thus, pre-training ‘transcends’ the limit of copyright as it abstracts text into  multi-dimensional numeric representations and patterns. Copyright can only protect the original expression, not the statistical relationships between words irrespective of the source being copyright-protected. In an article published way back in 2019 in the Journal of the Copyright Society, Prof. Matthew Sag, while giving the examples of non-expressive use, cited using software to identify patterns of speech, relationships, or frequency of particular words as possible instances of non-expressive use (pp.301 & 302).

The 2nd Circuit Court of the US comes to a similar-ish finding, though from a different perspective in Authors Guild Inc.v. Hathitrust (2014). The Court holds that the copyright owner cannot assert his copyright against a text-searchable database holding the copyrighted work without authorisation, as “the result of word search is different in purpose, character, expressionmeaning and message from the page (and the book) from which it is drawn.” Pertinently, the 2ndCircuit gave a finding on fair use in the case, holding the use to be transformative, without examining whether there was copyright infringement in the first place. Calling an act as fair use when the purpose is changed, but my point is that when the ‘expression’ and ‘meaning’ of the copyrighted work are changed, such use ‘transcends’ the ambit of the right called copyright.

Certain jurisdictions, such as Singapore, have a Computational Data Analysis (CDA) exception (s.243) allowing identifying, extracting, and analysing information to improve the functioning of a computer program. For this particular ‘use’, the statute even allows making a copy of the work in question (s.244). India has no Text and Data Mining (TDM) or CDA exception. However, the use of non-expressive elements of copyrighted work is within the concept of copyright, and there is no per se need for a statutory exception. Prof. Tim Dornis echos the view that TDM or CDA does not require an exception, in his recent paper He further adds that the issue of infringement props up only because non-protectable non-expressive information is embedded in a copyright ‘container’ or ‘shell’ (p.7). He further says that the underlying aim of the exceptions is to legalise copies and reproductions that precede TDM. However, Prof. Dornis has a more fundamental beef with the proposition being canvassed here (p.11). He argues that since AI does not differentiate between semantic (non-expressive) and syntactic (apparently expressive) information during training, it infringes copyright. However, “somewhat surprisingly” (to quote Mikolov et al.), in his entire paper, he does not explain the basis for calling syntax a subject matter of copyright. It is inconceivable that anyone could monopolize grammatical structure and arrangement of words in a language.

Cautions and Disclaimer

There will be no conclusion to this post. The final word on LLMs is yet to be spoken. However, it is important to put out a few words of caution. The word AI is an attention grabber (that is why it was used in the title of the post) but has little substance to offer (that is why it has not been used in the body of post). Instead of making broad stroke arguments about AI, it would be more instructive for legal academicians to deal with specifics. For example, what I say here is very specific to text and language and would certainly not be applicable to Midjourney, Stable Diffusion and DALL E. Judges, lawyers and policy makers will have to appreciate the nitty-gritty and not generalize. Justice cannot be based on assumptions. Generalizations create prejudice, not fairness. The discussion on LLMs and copyright cannot and should not be resolved with superficial understanding of AI. Saying it is a black box is just not enough.

Also, a point worth reiterating is that the discussion has been limited to the pre-training stage. It does not delve into fine-tuning, RLHF, storage and most importantly, the output stage. Subsequent posts ‘might’ follow dealing with these topics building upon the ideas discussed here.


Viewing all articles
Browse latest Browse all 3439

Trending Articles