Natural Language Processing is a challenging field because of how unstructured the data within it is. Finding and analyzing hidden patterns among the noise is where a data scientist earns the big bucks.
Recent progress in this field has included identifying the names of authors of wrote a certain piece of literature. This has been automated to quite an extent but how can we apply this to a programmer’s code? It’s also a collection of text and numbers, albeit in a very different manner.
A couple of researchers from Drexel University and the George Washington University have revealed (in this brilliant WIRED article) that code, just like literature, can also be analyzed to identify and pinpoint the author. They will be presenting their work at the DefCon hacking conference later this week.
So how did these researchers design their system? First, the features present in samples of code are identified by the algorithm. The two researchers then narrowed the features to only include those ones which helped them distinguish individual developers. This cut down the number of features significantly.
The researchers created an “abstract syntax tree” which was used to recognize the code’s underlying structure. As you can imagine, the algorithm requires a few examples/samples to train. In this research paper, the researchers along with others showed that it’s possible to identify the programmer using just their compiled binary code.
The researchers picked up code samples from Google’s annual Code Jam competition to test their algorithm. It achieved an impressive 96% accuracy when analyzing 100 individual coders (each had eight code samples). But the accuracy dropped a bit to 83% when the number of programmers was increased to 600.
The curious (though not altogether surprising) finding was that it was far easier to recognize experienced programmers from their code, as compared to newcomers. I imagine this must be because of the number of samples present plus the fact that each programmer must embed his/her own unique style in each piece of their code.
I previously covered DeepCode’s efforts to clean up a programmer’s code, but this latest project in a whole different beast. It tackles a variety of problems, like plagiarism and identity theft. It could also help in cyber security by identifying who created a specific piece of malware.
I believe we are still a fair bit away from seeing this algorithm being used in practical scenarios given how complex the problem is. Let’s wait and watch where this study leads us in the near future.