An international team of eight researchers working for the University of California were surprised when they started doing a study into how how much files changed between different clones. What they discovered was a “staggering rate of file-level duplication”.
Presented at this year's SPLASH conference in Vancouver, the research found that out of 428 million files on GitHub, only 85 million are unique.
A project-level analysis shows that between nine percent and 31 percent of the projects contain at least 80 percent of files that can be found elsewhere.
These rates of duplication have implications for systems built on open source software as well as for researchers interested in analysing large code bases.
The team has created DéjàVu, a publicly available map of code duplicates in GitHub repositories.
“There is a lot more duplication of code that happens in GitHub that does not go through the fork mechanism, and instead, goes in via copy and paste of files and even entire libraries”, the study noted.