ML experiments: keeping track of data and hyper-parameters is not enough!

5 min readJan 3, 2022

https://www.pexels.com/fr-fr/photo/photo-de-l-ile-pendant-l-heure-doree-1119973/

In my experience, we do better machine learning solutions if we can collaborate by sharing what each one has experimented with. In many cases parameters versioning is not enough to share what is an experiment about and is often not sufficient to reproduce it. Your work as a data scientist or ML engineer is much more than setting a bunch of parameters: you conceive, along with the team, original solutions to particular and sometimes very complex problems. Thus, if you do not not keep track of everything you have done, you won’t be able to roll back if performances are dropping down, you can’t make your results checked by others, etc.

When it comes to versioning your experiment, you may be, like me, a little lost: what information should I save to be sure that I can reproduce my experiment? Can I just save everything?

Another issue that can arise when trying to reproduce a full experiment is the way the data has been preprocessed. Indeed, you have several ways to version your data, but in my own experience, I haven’t found a very good tool to track the operations of the entire preprocessing.

In the following, I will explain how I use git python, mlFlow, and a cloud provider to resolve these issues. This method attempts to implement MLOps concepts such as model versioning in a practical and automated way.

Commit your data process at each experiment

When developing ML models, I was often overwhelmed with all the information I wanted to record. So I decided to save everything! And we already have the right collaborative tool to do it: git. The only thing you need to do is automate versioning every time you start a new experience.

To do this I am using git python. You must first create a repository object to interact with your versioning management tool.

repo = git.Repo(gitwd)

Then you just have to commit your changes when executing your code.

commit_code(repo, f"exp(preprocess): timestamp={ts}")

I suggest choosing a commit message containing a unique ID to track your data. Here I have chosen a simple timestamp (ts).The data is also saved to a storage or database using the same identifier so that when you want to replicate your experiment you can either preprocess your data again or use the preprocessed data knowing exactly what transformation you have performed.

local_file_name = f"{ts}_{key}.csv"

Automate the commit of your whole training process

When you train your model you are simply doing the same as before to track your experience. You don’t have to think about the data you used because the data ID is directly saved in your code.

You do not have to overthink about what to save, since you saved everything. To get back to your experiment easily you can use your run ID as a commit message.

Use notebook only to explore data, make graphs and not experiments

Many people, and me first, like to explore solutions in notebooks. This is a great tool to explore data and understand in a better way your model. But in my opinion one should not use them to train models or preprocess data. Indeed, it is not a good support to be versioned since much information is not needed to reproduce the experiment (cells execution status and outputs as graphs or tables) and it can create versioning conflicts and then should not be committed.

However, exploration and visualization are very important, particularly in the early stage of an ML project, and can be critical when debugging your model. In those cases it can be really time saving to have an interactive way to explore data, intermediate outputs and model characteristics.

This is why I suggest not to get rid of your notebooks, but to use them at the exploration and/or model debugging stages. One can use jupytext that will convert your notebook into script so that it is easily versioned.

Work in a collaborative way using a leaderboard and branches for each task

You can use mlflow versioning with a cloud provider as Azure to track your metrics on a leaderboard shared with your team. When you want to get the code corresponding to the best model, you just have to explore the experiment in mlflow UI, get the run ID and look for this ID in your commit messages. Using that tool, you will be able to share results with other members of the team and collaborate in a more efficient way.

To improve a bit more the collaboration between team members, they can separate preprocessing and training into different branches. Let’s assume that a data scientist work on data preprocessing. ML engineers may experiment with last data version on their own branch which is possible since the data are available in a shared storage or database. When the new version of the data is available, the ML engineers does not have to switch branches, but just to change data ID in their code.

Conclusion

There are many cases when backing up settings and data is not enough. For example, you might want to record reinforcement learning experiments that cannot be summarized only with hyper-parameters in the model. In this particular case, you may want to record how you schedule your training in addition to hyper-parameters. You will also probably want to save your reward function (which should be educational and allow your agent to achieve the end goal of the problem).

Out of a thousand, a second example that may come to mind is when you are dealing with deep learning models with a complex structure. Again, a bunch of settings won’t be enough to replicate the entire experiment. This is especially true when using a learning loop like for GANs or when using a special module like STN (Spatial Transformer Network).

That’s why I developed a basic version of a wrapper to be able to perform automatic and collaborative versioning as described above. Operation is as easy as a decorator added to the main preprocessing and training functions.

You can find the code on my github : https://github.com/JJublanc/mlops_versioning

Thank you for reading this article ! Hope this is helpful. Please do not hesitate to contact me if you have any questions, suggestions or comments, I always try to challenge my ideas and my tools to continuously improve them.

Illustrations

Photo :

https://www.pexels.com/fr-fr/photo/photo-de-l-ile-pendant-l-heure-doree-1119973/

Icons :

Code made by IconKarma from flaticon.com
Folder made by freepik from flaticon.com
Csv made by iconixar from flaticon.com

ML experiments: keeping track of data and hyper-parameters is not enough!

Conclusion

Written by Johan Jublanc

No responses yet