“I’m a Data Scientist.. I don’t need to write clean code because most of my code is throwaway anyways”. “Clean code and agile are good for developing softwares.. It does not make sense in my work”. The number of times I have heard the above & the reluctance to even try some of the suggestions on clean code, baffles me.
Well, let me tell you.. you don’t need to write clean code for software development either. You don’t need to practice agile for software development either. One can make a perfectly working software even without the above (maintaining/ modifying/ scaling will get difficult. But that’s not the focus of this article). When you need to follow clean code practices is when you are working in a TEAM! Irrespective of whether you are developing a software or an algorithm or have to try out multiple algorithms.
The basic idea of clean code is that your fellow team members should be able to understand what you have written. This is especially important in data science. As a scientist your experiments must be reproducible. Must be verifiable. That means others on your team should be able to understand & reproduce your results.
We exist in a team. It is impossible to be a data scientist by yourself. Most of the times in industry you would be working in applied sciences. This means you have to understand someone else’s problem & they (team, business folks) too need to understand your solution.
As a data scientist you would have frequent interactions with the following members (not limited to) of the team:
· Other data scientist/ data analyst (using interchangeably here) — collaborate with other data scientist on algorithms, models, interpreting the results, feature engineering, etc.
· Data engineers — you don’t want to end up processing all the data yourself. Queries can be optimised by data engineers to get your consolidated results in a minimal time. They need to understand what data you need.
· Business Analyst (BA)/ Domain expert — what do your results mean in business terms. There will be a need to understand with the BAs, as to what your input variables mean & what impact your results will have on the business. To essentially understand the domain.
Now that we have established that collaboration is important for a data scientist, let’s talk about the ways it can become effective via code. These are the low hanging fruits:
1. Meaningful names — Your team members shouldn’t have to ask you what a variable means. Use names that make it very clear what the intention is. It is ok to have a long name. The solution to that is using better editors (by which you don’t have to type the whole variable name every time) & not shorten the name thereby making it cryptic. Also, no magic numbers please.
2. Avoid mental mapping — you know in your head which parameters in your code need to be changed to obtain desired result. Others don’t. If you have to remember all the variables that need to be changed simultaneously, then you are wasting a lot of the teams’ time. Someone (including you) is going to run the code forgetting to change one of the variables. This means the run was a waste & the code will need to re run. Considering the time some algorithms take, this should be a NO NO.
3. Don’t have multiple names for the same thing. Once you decide as a team, stick to it.
4. Use business domain names. Keep your creativity to engineer features, analyzing, creating models. Don’t invent names.
5. Comments become obsolete very soon. Code always tell the truth. Hence, comments are normally discouraged. Same applies to data science code as well. However, I would make an exception here — add comments that reveal your intention of using/ not using certain models/ features. It is ok to comment: “This is going to take 5 hours” or “Using SVM with Radial basis kernel for faster convergence”
6. Modularize your code. It is not easy to understand one big blob. No matter how well it is mapped in your head right now. You will take time to understand it at a later point in time. Others will take time to understand it. Why? Because one has to go through each and every line of the code to understand what it is doing. When you create functions that do exactly what its name says, one need not go through each line inside it. The reader will still understand what is being done.
7. Don’t repeat yourself! Do not copy and paste those lines of code because you need the exact same thing somewhere else too. Extract it out in a function & reuse that function. This ensures consistency.
8. Unit tests — Yes, they are very much relevant to data science code. Data science code is still code! Testing your functions that spit out numbers based on which business decisions will be taken becomes ever more paramount. Why would you not even think of testing your functions that can have so many repercussions. All your functions should be tested to make sure they are doing what you expect them to do. Writing a function that identifies a stationary signal? Test it by using a signal that is stationary & make sure your function returns it as such.
9. Formatting — Decide the formatting you as a team want to follow. This includes using spaces vs. tabs for indents, directory structure, file naming conventions, output results format, etc. I am all for trying out various things/ approaches/ model in a quick & dirty manner. This is very effective to understand how much & in what direction should one invest their time in. But, once you finalize your approach and incorporate it in your working code, do clean it up. All the above concepts help people work as a team. I find it useful even if I am working solo. It helps me understand my work at a later point in time. It positions me to explain my results & allows me to cross examine surprising (or shocking) results quickly.
The Clean code book does a fantastic job of explaining all & more of the points mentioned above. So, if you are a data scientist & are going to work in a team, please read & implement clean code :) Every scientist has their own rules. But when working in a team, the team has rules.