Print this page
Published in News

Half of GitHub is duplicated code

by on24 November 2017

Researchers shocked

More than 70 percent of code stored in GitHub is duplicated, a study has found.

An international team of eight researchers working for the University of California were surprised when they started doing a study into how how much files changed between different clones. What they discovered was a “staggering rate of file-level duplication”.

Presented at this year's SPLASH conference in Vancouver, the research found that out of 428 million files on GitHub, only 85 million are unique.

This paper looked at 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. It found that this corpus has a mere 85 million unique files.

In other words, 70 percent of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only six percent of the files are distinct. Java, on the other hand, has the least duplication, 60 percent of files are distinct.

A project-level analysis shows that between nine percent and 31 percent of the projects contain at least 80 percent of files that can be found elsewhere.

These rates of duplication have implications for systems built on open source software as well as for researchers interested in analysing large code bases.

The team has created DéjàVu, a publicly available map of code duplicates in GitHub repositories.

“There is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and instead, goes in via copy and paste of files and even entire libraries”, the study noted.

Last modified on 24 November 2017
Rate this item
(0 votes)