It is best if Copilot copies everything

8 min readJul 13, 2021

Last week, Github released its new AI-powered programming assistant called Copilot, sending shockwaves towards the community. It is a surprising technology, not exactly new as there were similar assistants like Tabnine. Copilot takes it one step further, it does not look like just autocompletion but a full suggestion of what you might be trying to achieve. Still, most of the noise has come from the potential legal implications of both using and training autopilot.

Time for the “I Am Not A Lawyer” disclaimer. I do not intend to offer any solution to the legal mess this will create, but just to reflect on how bad these new problems really are in the light of what we have seen in the past.

So, what is all the fuss about?

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;  // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );  // what the fuck?

Copilot is suggesting almost literal chunks of code like the one above, a famous chunk of code from Quake III, with funky comments and all. People who have access to the closed beta are producing tons of these examples, some are useless and funny, some hilarious, others bring privacy concerns. This has generated controversy around whether these examples are copyright violations. but is any of this illegal?

Github trained Copilot using a lot of code with copy-left licenses such as GPL, which means any derivative work must be distributed under the same terms. Although even if it modified the original source, the resulting work could be considered a derivative work. Copying a copyrighted material without permission is not always an infringement. To bring balance between public benefits and incentives for creation, a copy can fall under “fair use”, we cannot assume that merely copying some lines of code will result in a copyright violation.

Where do we draw the line?

I was curious about where the lines exactly lay for computer programs.

“…computer programs to the extent that they incorporate authorship in the programmer’s expression of original ideas, as distinguished from the ideas themselves.”
17 USC 102: Subject matter of copyright: In general

Copyright law was not originally designed for software, it has evolved from ideas originally designed to protect books to include maps, music, videos, and computer programs. An interesting problem is created when the propagation of ideas influences the accumulation of knowledge and future creativity, which may be the case for science and computer programs. Technology advancements change how creative work is produced. Consequently, Copyright law needs to evolve to keep up with it. Access to information and sources of inspiration helps us produce works that go beyond what was previously achievable. By building on top of other people's work and knowledge, a work can still be original, transformative, and add new expression.

Copyright is not intended to protect ideas but original work, it is about the expression of the author. Computer programs are included in copyright law as a specific case of literary work. However, its functional nature makes them subject to a much lower degree of protection than other highly expressive works like music, paintings, and novels. The same concept applies to more factual work, like biographies and historical works.

In the US, copyright law allows copying and replicating protected work without the permission of the author under fair use. Even so, establishing fair use is far from being straightforward. It is done on a case-by-case basis considering the four factors: purpose and character, nature, amount, and market effect.

Next, I will explore some past cases and the theory behind the boundaries of fair use.

The purpose and character of the use

The first factor analyses whether the new work is adding new expression, meaning, or value of any sort. In 2008, the artist Richard Prince created a series of paintings using 35 images from a photographer's book to generate around 30 appropriations, 25 of them well deemed fair use with the other 5 are still being decided in court.

On the left Cariou’s photograph, on the right Prince’s painting.

The Prince vs Cariou case exemplifies how counterintuitive can be to qualifying a copy as a violation of copyright. For software, it gets even harder.

There are many ways in which the first factor can be relevant to decide in favor of fair use. In the three cases presented below the usage of copyrighted works was considered to be beneficial for the public, adding value or character, being transformative:

Google reused parts of the Java APIs definitions in Android, around 11,000 verbatim lines of code, allowing developers to work in a new environment without discarding a familiar language.
Connectix reverse engineered Playstation BIOS to extract APIs and implement their Desktop emulator, allowing the owner of games to use them in a different platform.
Accolade disassembled Sega Genesis games and copied the Trademark Security System to release their own unlicensed games circumventing the protections, allowing them to compete with Sega licensed games.

The nature of the work

The nature of the program does matter, is it an unpublished or published source, is it only definitions or implementations? what is its purpose?

In the Connectix vs Sony case, Connectix needed to copy Sony’s BIOS to be able to examine its unprotected functionality. Therefore the BIOS had a lower level of protection from copyright law. Allowing public access to the ideas and functional elements behind the copyrighted software was more important than protecting the BIOS code.
In the Google vs Oracle case, the Java APIs definitions used in Android were bound with ideas that are not copyrightable, such as the organization of an API. Its nature helped the case for fair use.
Constraints introduced by hardware, software, industry practices, or standards play a role in determining fair case. For example, when building different expressions is impractical. Even if a code can be written in multiple ways, copyright might not protect it from being copied verbatim. This is what happened when the company Static Control Components (SSC) copied the whole of the code included in Lexmark toner cartridges to circumvent the process that prevented toners to be refilled in Lexmark printers. While possible, it was impractical for SSC to try to write a different one.

The amount

The amount of the original work that was copied also plays a role. In Google vs Oracle, it was pointed out that Google had copied only 0.4% of the Java source code, and thus it was minimal. Similarly, the amount of code that Accolade copied to run its unlicensed games in Sega Genesis is tenths of a thousand times less than the complete game, making it overwhelmingly original content.

It must be said however that amount should not be interpreted literally as a relative size in bytes or lines of code. Its relative importance or quality also matters. The code in question might be little, but it might still be the core or heart of the expression. A magazine lost its case of fair use having published only 300 words of Gerald Ford's unpublished memoirs of 500,000 words, however, the 300 words were one of the most moving parts of it.

The effect

The damage to the potential market or value of the copyrighted material is the last factor. In Google vs Oracle, it was determined that Android was not a substitute for Java SE. Similarly, Accolade did not diminish Sega games market.

In the other two cases mentioned earlier, the reasoning is different. While it is possible that Connectix Playstation emulator had caused some economic loss to Sony, it is not the role of copyright law to protect the monopoly of devices that can play the games. In the case SSC vs Lexmark, it is plausible that SSC’s chip caused losses to Lexmark in the toner cartridges market, however copyright law focus on the market of the work that is being protected, the protection code, which in this case did not exist. Similarly to the Sony vs Accolade case, copyright law was not there to protect Lexmark’s toner loading program.

Conclusions

The cases involving source code copyright violations are harder to assess because they are not as common and old as the cases we can find in other fields like art, there are fewer examples that can be used as guidelines. However, while protectable by copyright, software has a broader scope of fair use than highly expressive works, such as novels, music, and paintings.

Copilot might cause developers to unknowingly copy parts of programs protected by copyright laws. For now, based on what I have seen about Copilot, the copies will likely be reduced to short statements and short complete functions. I consider it a stretch to call a function an original work with expression, it is the entire program that achieves that, and using that piece of code in an entirely different context seems quite transformational. The types of functions that Copilot seems to output entirely also seem simple enough to constraint the number of truly different ways in which it can be achieved. The amount of code that Copilot suggests each time is limited. It is unlikely that it will steal the heart of a computer program, and it is hard to envision a situation where the particular code had a market that needs to be protected.

I find much more concerning the impact of these technologies in disciplines where there is much more precedent and the original work is better protected, think gaming art. We already have similar tools assisting artists to build incredible animations, recreating models, building AI-generated conceptual art, images, and videos. Soon we will have one of these tools embedded in Photoshop, Maya, and other design software. As Copilot does, the tools will suggest tons of full models, concepts, textures, and compositions based on short concepts or uncompleted work. It might modify it slightly using style transfers techniques to make everything match the lore and visuals of your game universe, it will turn full existing UIs into complete newly styled screens. The first time, some results will be for sure a rip-off from publicly available but protected sources. We are about to get an avalanche of complicated copyright cases.

People have always used sources and art to inspire and base their work, they learn and train from other's work, and many times things end up a little too inspired. As we have more sources of information and accumulated knowledge, the creation process will be assisted more and more, so that artists and developers can focus on the things that the machine cannot achieve, at least not soon, being creative.

The problem will become ubiquitous, and the law will need to evolve to handle it. We must remember that the goal is to provide a balance between the greater good and the incentives to creation. If these advancements lead us to have small teams building games like GTA or The Witcher III in months, I would say we achieved both.

Thanks to Duncan Mac-Vicar P. for his views and suggestions as open-source expert.

References

Fair Use - Copyright Overview by Rich Stim - Stanford Copyright and Fair Use Center

Fair use is a copyright principle based on the belief that the public is entitled to freely use portions of copyrighted…

fairuse.stanford.edu

Supreme Court — Lexmark vs SCC

Supreme Court — Google vs Oracle

Harvard School of Law — Sega vs Accolade

Harvard School of Law — Sony vs Connectix

Good examples to see the potential of AI generating art:

www.artbreeder.com

Deep Dream Generator

The technique is a much more advanced version of the original Deep Dream approach. It is capable of using its own…

deepdreamgenerator.com

DALL·E: Creating Images from Text

DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of…

openai.com

Image GPT

We find that, just as a large transformer model trained on language can generate coherent text, the same exact model…