Fasten your data exploration with EDA tools

Johan Jublanc
7 min readJun 27, 2022

--

Photo de Lukas: https://www.pexels.com/fr-fr/photo/illustration-de-graphique-a-secteurs-669621/

I always do the same things when starting a new data science project… explore the data. I am quite sure you do the same. If it is a crucial step, EDA (Exploratory Data Analysis) can be time-consuming if one do not have a clear purpose and good tools.

This is why I suggest planning your exploration and use proper packages that can fasten data exploration and save you precious minutes or even hours or work.

As a candidate for a data scientist or analyst position, I also suggest using this kind of tool when carrying out a test. You will speed up your work and show that your general data culture will be useful to the team you will join.

What do I look for when exploring my data ?

Before making any exploration task, I always ask myself what information do I need to start my project. The answer can vary from one case to another, it depends if your are planning to train a model of machine learning, use data to make decisions, or transform it as an intermediary source of information.

However, there are elements that we want to check in most projects. I propose to recall them here:

  • data quality: are there missing values, outliers, unexpected values ?
  • data distributions: are data regularly distributed ? Will I go for a max-min scaling, a standardization or something else ?
  • data stability: are there major differences between my train and test set ? Are there trends trough time that I need to take into account ?
  • data interaction: are there biases ? Are there subgroups with particularities ? Is there redundant information ?

I would be careful to not over explore data at this step. I will favor modeling to determine individual structuring in groups or complexe interactions between variables. I find that one of the common mistakes made by junior data scientists is to over-interpret the basic analysis of the data.

For instance, in machine learning projects, I have seen many data scientists dropping columns without any Pearson correlation with the target. But it is useful to remember that when we develop a model, we are looking for more complex relationships than linear correlations between two variables. So why drop these potentially useful data ?

What tool do I need ?

What is a good EDA tool ?

First of all, as I’m looking to save time, I need a tool with little configuration and a quick start. I also need very clear visualization and a handy dashboard to instantly get the information I need.

In addition, the more general the tool, the more it will adapt to my needs, since I can work on several types of data, several use cases with different levels of quality required.

Finally, a good tool would help me get answers to most of the questions I presented earlier (about quality, distribution, stability, interaction, etc.).

I tested several tools, some are open source, others are custom, but in all cases we can get started in a few minutes.

Pandas-profiling

This one is a great tool. It is very practical, a report can be created in few lines of code and several features are interesting. After loading my Pokemon dataset, I can run the following code and save an html report of my data.

First of all, I get an overview of my data, which is useful to make sure of the general quality of it. Here, my dataset is small, but there is not much missing values and no duplicates.

Then, I have access to univariate descriptions of the data. I can deep dive and get details about values’ distribution, and missing data, common and extreme values, etc. This is very practical to make decisions about additional checks, data cleaning needed, imputation and drop strategy, etc.

If I want more information about interaction, I can visualize bi-variate correlations. You will look to complex interactions after the exploration part, so I suggest not to make big conclusions from such heat-maps. All the more so as you will have only bi-variate correlations and of a particular type.

In addition, there are some integration with other tools. For instance in Pycharm, you can launch a profiling on any dataset in your project, just by selecting your dataset click right and launching pandas profiling in External Tools (for more information refer to the documentation). This can save you additional time if you have a lot of dataset to explore.

Finally, which seems to be very promising is the integration with popmon and great expectation.

https://github.com/ing-bank/popmon

The first one allows to check stability of dataset through time (cf. doc) and works with both pandas and spark datasets.

https://greatexpectations.io/

The second one is a shared, open standard for data quality (according to their site).

SweetViz

Another EDA tool to make quick reports. This one allows to make the same kinds of analysis, also very quickly, but with less details. For instance bivariate plots are less rich.

However, the Sweetviz reports are quite nice to navigate in. For instance, I like the lateral panel that allow to focus on details. This ease managing several levels of information in a glance.

One of the great advantages of Sweetviz is to make it possible to compare two datasets.

Here again, the way plots and information are presented seems to me to be straightforward and clear.

Custom EDA for Images

For now we have explored classic data, but what if we want to explore images for instance? In that case, one can use this short snippet of code to have a first idea of the kind of images we have (make sure you have .jpeg images or adjust the code).

It will simply plot some images examples, add a title with the size of the image and plot distribution values for red, green and blue layers.

This can be completed with a lot of treatments and analysis. However, this is a fast way to get main images information and choose the priority task to plan. In this case, one could consider that the images needed to be re-sized. In that case the target size should be chosen carefully since dimensions appear to be very different from one image to another. We could also make the hypothesis that the values for blue channel are often lower than the other which could be used to create filter or design treatment.

Custom EDA for Text

Many tools exist, but one might want something simple and direct to go around textual data. A first overview would be to understand the number of words by text (here tweets) :

First of all you’ll probably want to get the words from your texts.

Then you can make counts and plot distributions.

Finally you can even make advanced analysis in a few lines of code, using a Latent Dirichlet Allocation (LDA) model.

https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models

I would recommend the reading of this article to have quick ways to explore text data : https://neptune.ai/blog/exploratory-data-analysis-natural-language-processing-tools

Conclusion

EDA is one of the steps to go trough before transform your data project into a valuable product. Tools exist to make it easy and fast. It would be a shame not to take advantage of it. The key idea is to avoid make again packages that work fine and concentrate one efforts to what does not exist yet.

References

https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/

--

--