Skip to content Skip to footer

GitHub and the Flight to Good Coding Practices

By Raluca Blujdea

Have you ever been reading a research paper, looked at the methods or data analysis sections, and felt they were rather incomplete? Has that made you feel like you cannot reproduce their conclusions without emailing the authors back and forth?

Unfortunately, this seems to be the norm and not an exception. As I am beginning my journey through the field of neuroscience, I’m already sinking in an overflow of incomplete data. I wonder if it would even be possible, if I wanted, to further their analysis without all the information? Would I be able to pick up where they left off, but with my own data? There has to be an easier way.

Research today has become a community of scientists that collaborate to provide knowledge and expertise. Most scientists are no longer solitary figures in a laboratory uncovering the secrets of life. It has become a social undertaking, that we and the rest of society, rely on. However, science has produced an exponentially large number of papers and it still grows, with the help of the internet and better communication technology.

Why is this a problem? I hear you ask. Well, imagine this:

Between the 1950s and 80s, Professor William Thorpe, a pioneering songbird researcher at the University of Cambridge, was examining the sound of birds signing. He adopted a group of finches and used sonograms (visual representations of bird vocalisations). When he started analyzing these sonograms, there were probably only a handful of relevant papers for his project published.

Between his time and today, more than 250,000 papers were published in bird research (PubMed, see figure 1). Over all subjects, more than 1 million papers were published only in the Netherlands (VSNU). Just imagine the output from the rest of the world. With this overload of papers, it becomes crucial but almost impossible to reproduce results and ensure high quality research.

In 2018, one of our VU Master of Neuroscience students, Eduarda Centeno, started her own journey in songbird research. Her internship at the University of Bordeaux aimed to analyse how birds sing to investigate neural mechanisms involved in speech. With the thousands of papers published to this day, it would take more than a lifetime for Eduarda to update herself on developments in the field, let alone reproduce and ensure conclusions she bases her study on are correct. For this not to hinder her study, better tools, such as sharing articles and research, become crucial.

Figure 1: Number of papers published between today and 1950 on PubMed after a search for “bird$”. Each dot represents a year.

What can Scientists do about this?

This enormous amount of work has been made accessible across the globe using online platforms such as PubMed (US) or NARCIS (NL). As long as you have access or a subscription to the scientific journal, you could likely read with ease. However, until the last decade, it hasn’t been quite as commonplace, especially for bioscientists, to also share their analyses, workflow or code as open source, that is freely available for others to share and modify.

Sharing is a substantial and very important part of this social community of scientists. To improve our sharing practices, researchers could post, store or “reposit” procedures and code used, on an online platform that is easily accessible by others. One candidate repository for this is GitHub, a code hosting platform that can be used to contribute, homogenise analysis, or workflow, within data processing. GitHub is not only a coding platform, it can be used for storage of text as well as a version control platform. It is free to use and has a tutorial to start you off (more examples in table 1 below).

Want to see how Eduarda got on with her birdsong analysis?

Eduarda’s project involved using songbird data to develop an open-source computational tool in Python for its analysis. On Monday the 20th, she introduced how she used GitHub and Doxygen (a platform that generates documentation from code, like from GitHub) to perform this delicately complicated task. She used these to store scripts, documentation and example data for tutorials. The aim of this was to open it for people in her group and for other groups to collaborate/join the project.

More details on Eduarda’s project were outlined in her presentation. She give some tips on how to use Github to improve the quality of her research, good practices in coding and advice on more efficient and effective documentation of your code, all links of which are provided in table 1 below.

Sharing code is very useful in a world where research is faced with a heavy volume of data. But this asks some basic knowledge both in coding and in GitHub. So for that, we have tried to provide some initial resources in table 1. Using coding, together with an online platform, such as GitHub, can allow scientists to collaborate more easily.

TABLE 1
Coding for beginnersCode academy, Data camp, Coursera, LinkedIn Learning, Pizza4Python (VU Amsterdam)
GitHub knowhowsGitHub
To get you started: Hello World guide and ISBE Symposium guide
Songbird researchEduarda’s Song Bird Data Analysis example
Songbird Science (still under construction)
More on Zebra Finches (atlas)
Version ControlVersion control (Git) on datacamp and YouTube
G-node GIN (Based on Git)
Git tutorial
Better code citing on Zenodo and its GitHub tutorial
Sharing for better reproducibilityReproducibility Guide from rOpenSci
Munafò, et al.(2017): A manifesto for reproducible science
Open scienceOpen Access Netherlands
rOpenSci
Crüwell, et al. (2018): 7 Easy Steps to Open Science: An Annotated Reading List
R knowhowsR Markdown for writing reproducible manuscripts
R for reproducible scientific analysis
R for data science (free book)
Table 1 Resources to start your journey through the world of better sharing practices

About the writer

Raluca is a second year VU Master of Neurosciences student who is particularly interested in neurological development & neurodegenerative disorders.

1 Comment

Comments are closed.