Stop using notebooks (or why you should start making data science like a developer)

Johan Jublanc
9 min readJun 11, 2022

--

https://burst.shopify.com/photos/ocean-waves-meet-red-sands-aerial?q=wave

Most data scientists I know use or have used notebooks, me first. And that’s fine. However, the more I think about it, the more I’m convinced that data scientists should be considered and should think of themselves as developers and therefore replace them with scripts developed with an IDE.

Why is that? There are three main reasons:

  • it’s more effective and efficient;
  • it leads to a better implementation of the scientific methodology;
  • it shortens the development of your solution.

It’s not easy, but you’ll save 50% of your time and considerably reduce the risk of never going into production. Of course, coding in notebooks is not a bad thing in itself, but in my opinion, we can do better in 2022.

In the following, I’ll explain why I think using notebooks can waste time and money and increase risk. Then, I will discuss the benefits of notebooks and solutions that can be used to maintain those benefits when using IDEs (integrated development environment).

If you have a different opinion on using notebooks, feel free to share it in the comments or contact me. I would be happy to hear about your use of this tool and your best practices.

What are “notebooks” used for and why they are not effective ?

Why do we like notebooks as data scientists?

Most of the time notebooks are used at the early stage of a data science project, to explore solutions, make a proof of concept and validate the technical feasibility. Several main features are often brought to the fore to justify their usage:

  • velocity: data scientists can launch a new notebook in a few minutes;
  • compute capacity: notebook can be launched on virtual machines, in cloud environment (such as SageMaker, AI Plateform or Azure ML Studio, etc.);
  • ease of use: no advanced development skills needed to start making tests and code can be executed line by line;
  • a ready-made presentation: plots and markdown text displayed within your document can be shared with non technical teams.
screenshot of a notebook cell

Even if those features have been of a great interest, for me as well as for many data scientists, using notebooks prevents or hinders the implementation of good software development practices.
Most of the time (to not say always) it implies to recode everything when going into production. This is double work! Wouldn’t it be better to have a good basis made by data scientists and that is improved afterward instead of recoding everything?

Furthermore notebooks have four major disadvantages:

  • Lack of traceability;
  • Lack of reproducibility;
  • Lack of tools to develop the code (help for docstring, lintage check, debug, autocompletion and many others);
  • Focus PoCs on the machine learning part instead of testing all technical parts of the solution;
  • Debugging can be time-consuming, in particular when setting kernel and environment.

Not to mention that, notebook is not conducive to the development of complex systems and systematic analysis.

Not a rigorous scientific framework

https://www.flaticon.com/free-icons/lab by Prosymbols

The development of the notebooks does not provide a rigorous framework for a scientific approach. Even if a rigorous method is put in place by the team (which is often the case), the notebooks do not facilitate its implementation.

Lack of traceability. The first reason is that there is no easy way to follow the evolution of the experiments. One reason is that notebooks are not easy to version. Indeed, you can have a lot of conflicts when merging, so that there is often no other solution than to work alone on your notebook.

As a result, the team often loses track of the evolution of experiments. I have often seen teams use an alternative manual versioning system to track notebooks, using file names as a source of information, which is not an optimal solution.

Lack of reproducibility. The second reason is that the results appearing may have nothing to do with the actual code in the notebook. The engineer has navigated his code, executed a cell here, made a change there, and the information used to interpret the exploration results is confusing (What code did I use to get this result? What data?). However, this is not a lack of rigor on the part of the engineer (no one thinks linearly) but a problem of the framework of the experiment.

As a consequence, experiments are not always reproducible.

Not a software product development framework

https://www.flaticon.com/free-icons/agile by Flat Icons

Lack of development tools. In a notebook, you don’t have access to the wide range of tools that can be used in IDEs such as autocompletion, code navigation, code formatting, versioning assistance, refactoring, etc. It is, therefore, more time-consuming and not conducive to the implementation of good development practices, which exist to:

  • facilitate teamwork and ownership by other team members;
  • have an unambiguous software behavior, i.e. each function has a unitary behavior;
  • ensure the modularity of the code.

Focus the PoCs only on one part of the problem. Since it is the product that brings the value, it is necessary to do a PoC of a product, and not only of the modeling part. What if the main problem is not the learning task, but the model service? What if it works locally, but not in your application? If you only test the learning task, you are not validating the UX, deployment, security and all the technical aspects of your software. Also, starting with a simple end-to-end test, you can shorten the development cycle, test the value of the product earlier and save time and design quality. To do this, it is best to develop directly your solution as a software.

Make data science as a developer is not so easy, but solutions exist!

A new approach: think of exploration as part of the software development life cycle

Today more than yesterday the data science will be much more about working on the use case, modeling the problem and working on the data, rather than spending time on the model as there are more and more existing models and turnkey solutions.

Make data science like a developer

Use scripts and an IDE to explore. By doing this, you can run richer experiments by manipulating multiple modules and animating them in sometimes complex learning cycles. For example, reinforcement learning or image generation problems require designing several parts (an environment and one or more agents) or several models (a generator and a discriminator) and animating them in a learning cycle whose design will be crucial for learning. Being able to simulate and solve particularly complex problems will be even more crucial for data scientists as they no longer need to solve simple machine learning problems. Indeed, many solutions can be used by people who are not experts in mathematics or statistics, such as:

  • code-free solutions (such as dataiku);
  • machine learning (such as auto-sklearn);
  • pre-trained models, models and examples from open sources (e.g. provided by platforms like Vertex AI, a company like Hugginface and the open source community).

Versioning code and experiences as developers do. With a strong versioning methodology, one can :

  • understand each experiment by difference with the others;
  • make experiments truly reproducible;
  • increase traceability and reduce the risk of exploring the same thing multiple times.

However, versioning code and experiments are two different things. On the one hand, one can use git and good practices (one of them would be the Karma convention) and thus keep a trace of the evolution accessible to all team members. On the other hand, data science experiments can be tracked with tools like MlFlow tracking, which allows to obtain the metrics, configuration and artifacts associated with each of them.

Ideally, one should link the two sides of versioning to be able to map the code to the results of an experiment. This can be done using git python which allows to make a new commit when running the code (i.e. doing a new experiment).

Keep the most interesting advantages of notebooks

Run the code line by line and plot graphics. When exploring and since Python (which is the main language used by data scientists) is an interpreted language, it can be very convenient to execute code line by line or block by block. This is perfectly feasible if your IDE has an associated python terminal. Using Pycharm you can execute selected lines with a shortcut.

In notebooks you can see your graphic results directly under each cell, which is less convenient in an IDE. This is not impossible though, in Pycharm you can have a pop-up windows with your plot if using matplotlib. However, you can also, and maybe even should, save your plots as html pages (or other convenient format) in dedicated folders so you can keep track of them. It can easily be done thanks to tools such as plotly for any type of chart. One can also use pandas-profiling to make a quick and exhaustive first exploration of her data.

screenshot of pandas-profiling html report

In addition, there are tools to track experiments in a more comprehensive way, such as tensorboard or MLflow, so you don’t need to plot many graphs to monitor your exploration.

https://stackoverflow.com/questions/53529632/how-to-display-the-accuracy-graph-on-tensorboard

Being able to run your code on large machines. The power of the local machine is sometimes limited. Unlike notebooks that can be run remotely on very large machines (notebook instances provided by all major cloud providers, Google Collab, etc.), IDEs are often used locally.

A first solution can be to run tasks on remote machines. One can launch computational tasks on a remote machine like a DXG station with GPU through an ssh connection or through managed services provided by cloud providers (Azure batch jobs can be launched with the python API for example).

https://apy-groupe.com/fr/fr/nvidia-dgx-station-deep-learning-solution/421-nvidia-dgx-station-a100-160go.html

A second possibility is to use an IDE on a VM. This type of service is provided by JetBrain GetWay, with which you can launch an IDE on your own VM with an ssh connection or on Space, another JetBrain service.

https://blog.jetbrains.com/fr/space/2020/12/09/space-est-disponible-pour-tous/

All things considered, most of the time it is worth asking whether you need so much computing power. This is often overkill, as there are strategies to avoid using a huge amount of computing capacity:

  • explore on samples
  • favour the most direct approaches (transfer learning, pre-existing models whether they are available on platforms or in service mode, business rules)

These strategies have another advantage: that of first trying a method that is faster than a large training on huge machines.

An effort worth making

Sure, doing data science as a developer has a cost of entry for most data scientists, but the effort is worth it:

  • It’s a one-time cost of entry
  • A team can capitalize on what’s being done by others.

Not so long to set up. In my experience, most data scientists are able to adopt the main best practices within a few weeks if they are not already familiar with them:

  • IDE;
  • Versioning;
  • Testing;
  • Serving.

In particular, I have set up an on-boarding bootcamp for data scientists of only three days, during which we talk about good development practices among many other topics. I was happy to see that they not only understood the concepts quickly, but also tried to put them into practice directly in their projects. After some time, they were able to start a data science project directly in their IDE and apply good development practices.

Conclusion

Stopping using notebooks is not easy, you have to change your habits, learn to use new tools and find a good workflow that works for your own context. However, it can be done in a short time and save a lot of time in the short-term while reducing the risks associated with exploration.

In addition, it allows data scientists to be closer to developers, data engineers and other members of a project team. It will leads to a better alignment with standards and best practices, facilitates work and understanding within the team and increases overall velocity.

Getting rid of notebooks is the first step toward what should be your goal: integrating experimentation into a complete MLOps workflow.

--

--