The project explores a theory of open source software (OSS) evolution based on statistical natural language processing techniques. Based on the emerging recognition that software code is, in many ways, as "natural" as natural language (e.g., English), there is a trend to apply statistical models for software development tasks such as code analysis, comprehension, and programmer support. This grant extends the "naturalness of code" theory by studying how the code lexicon evolves in open source software as different developers work on a software project and features are added, modified, deleted. The goal is to learn the extent to which the evolution of a developer's lexicon follows the laws of natural language evolution.
To create the needed demonstration, large datasets of code lexicons are being collected from a large number of OSS projects and their revisions (on GitHub and SourceForge). The main constructs of the frequency model of natural language evolution will be applied to track and identify the main patterns of language changes (e.g., birth, propagation, death of terms in the lexicon) throughout OSS projects life cycle. Part of the challenge is to better understand how events that instigate code evolution, such as maintenance activities and team formation, are fundamentally different from the events that instigate change in natural language, such as war and migration. The research should lead to new ways to predict software project outcomes and to improve software productivity and quality. The project will make available the data, tools, and algorithms that will be produced by the project, which will support future work to understand the dynamics of code evolution in open source software ecosystems.