July 29, 2024
In the machine learning (ML) and artificial intelligence (AI) domain, managing, tracking, and visualizing model…
Recently I made the switch to actively tracking my machine learning experiments. First with ML Flow and then with Comet.
Tracking experiments is becoming more common in MLOps. As time goes on, data scientists and machine learning engineers have realized the importance of model versioning. This has gone beyond saving metrics in experiments. Modern experiment tracking now includes saving metrics, logging charts, and even saving artifacts.
In this article, we’ll focus on Comet’s experiment tracking within an Azure Databricks environment. In this article I’ll cover four things:
Databricks has built-in ML experiment tracking using MLFlow, which I find not very beginner friendly, with a high learning curve. Comet is easy to use for data science professionals of different skill levels and is easy to set up.
The goal of this project is to demonstrate for beginners the advantages of Comet in an Azure Databricks environment.
This article assumes you know how to navigate Databricks and upload/mount the datasets that will be used.
Dependencies:
Other libraries such as pandas, numpy, scikit-learn, Seaborn, and matplotlib are already pre-installed in Databricks.
The installation process is straightforward for installing Comet into Azure Databricks. If you already use AutoML, installing the comet-for-mlflow package is also really useful. Comet will log your existing MLFlow work and save it to a current Comet experiment.
The first step is installing it in the cluster. Once you have started your cluster, select the Install New button, highlighted below.
In the Install Library window, select PyPI. Then type the name of the package. I would recommend at this point installing both comet_ml and comet_automl.
There is an option to install on all your current clusters — this a personal choice. Generally, I only install Comet Libraries on either my clusters used for development or for production.
Since Databricks is not a Jupyter Notebook nor a Python file, there will be instances where certain Comet features do not work. You will often have to cover gaps with AutoML, depending on what you want to save to your experiments.
Now you’re ready to set up an experiment in a Databricks notebook.
Before we create the experiment, let’s first look at the dataset we will be using. The data set is the Diamonds dataset from Kaggle.
The diamond data set has 53940 rows and 10 features:
Further information on carat, cut, color, quality, and other terms can be found here.
Before creating your experiment, select the computer cluster where you installed comet and comet_automl. Once you have selected the cluster, create a new Databricks notebook. We will load the following libraries:
Once you have loaded the libraries, load your dataset into Databricks. You can do this by uploading your CSV directly to Databricks. If you have a blob storage, you can mount the storage container.
Let’s create an experiment. When you first create your experiment, make sure you add the API key from your Comet account. The Comet account name is your workspace name, so add it as well.
Once you have added it, add your experiment name. This will be the main repository for all the runs in your experiment. If the experiment name does not exist in your Comet workspace, it will create a new experiment with the name.
If you want your code to show in a run, then set log_code parameter to True. If you are concerned about privacy or revealing your IP address, set log_env_host to False. Azure will log different environment settings than if you ran it in a Jupyter Notebook.
For the sake of clarity, I always add an experiment tag to note the experiment environment. I also add tags for experiment models that are in development, staging, and production.
If you are working in a team, you may not be working solely in Databricks, so it helps to add multiple tags.
experiment = Experiment( api_key=[API KEY] project_name="experiment_name", workspace="your_workspace", log_code = True #add this if you want to save your code to an experiment log_env_host=False #add if you want to hide your IP or system settings ) #Add a tag to distinguish this from other experiments experiment.add_tag("Azure Databricks") #log dataframe profile experiment.log_dataframe_profile(diamond_pd,"Diamond Pandas Dataframe")
What tips do big name companies have for students and start ups? We asked them! Read or watch our industry Q&A for advice from teams at Stanford, Google, and HuggingFace.
Exploratory data analysis using Comet in Azure Databricks works very similar to using it in a Jupyter Notebook.
Logging figures is pretty important. Data may change between runs of an experiment for various reasons, so it’s important to save charts from your EDAs to an experiment.
To be able to create the figures, your dataframe will need to be in a pandas format rather than in a Spark dataframe.
In Comet there are two ways to log figures to an experiment: matplotlib or seaborn. For this project, we used both Seaborn and matplotlib.
An example of logging a matplot lib figure is here:
Matplotlib is the simplest way to log a graph. Simply put the experiment.log_figure method after your completed figure. Do not add the plt.show()—you will get an error and the figure will not be saved.
Unlike matplotlib, Seaborn has some trouble saving charts in Comet. There’s a quick workaround to this. You will first need to save the Seaborn figure as a variable. Then convert it using .fig or .figure — some charts can only be saved using one of these.
I used the following method to save the charts:
Alternatively, you can use exception handling without using the method:
For this article we also logged histograms, a correlation matrix and a scatter plot, and a violin plot. Let’s take a look at the figures that we logged in our test EDA.
In the dataset, diamonds usually have smaller carat size and prices. The length, width, and height (x,y,z) are within very small ranges.
Carat, width, depth, and height are highly correlated with each other. Price is also highly correlated with carat size. Given their high correlation with each other, they may be multi-collinear. We will be dropping these variables later down the line.
As we can see from the model, most cuts range from below 1 carat to 2.5 carats, regardless of cut quality. This suggests that the carat size is being standardized.
The median price for all cuts is below $5000 USD. Ideal cuts show the lowest median price. Premium, good, and fair cuts have a higher median price than very good and premium cuts.
To prep for modeling we first have to do three steps: transforming the data, converting the dataframe to Spark, and vectorizing.
The dataset still contains three categorical features: cut, color, and clarity. In addition, it contains three variables: length(x), width (y), and depth (z).
We will first encode the categorical variables since we cannot use string data types for vector encoding features. We will also drop the x, y, and z. An example of this is below:
Now that that’s done, let’s convert the panadas dataframe to a Spark dataframe.
Vectorizing
The second step is vectorizing. We do this using VectorAssembler. This method is a transformer that combines a given list of columns into a single vector column, which is added to the data frame.
Vectorization is useful for combining raw features and features generated by different feature transformers into a single feature vector in order to train ML models like logistic regression and decision trees.
The code example is below:
Once we vectorized the features ‘carat,’ ‘cut,’ ‘color,’ ‘clarity,’ ‘depth,’ and ‘table,’ we will select only the price (our target variable) and the features column created by the VectorAssembler method into a new dataset.
The dataset should look like this after selecting only the features and price columns:
Now that we have created the data frame, we will split it randomly into testing and training datasets. In PySpark, this is done using the random split method:
Now that we have split the data into train_df and test_df, it’s time to build the models.
To keep it simple, I will be focusing on linear regression. The three models that we will be creating are ordinary least squares, decision tree regression, and gradient boosted regression. Since Databricks is a Spark environment, we will be using PySpark to create these algorithms.
PySpark has its own built-in library for machine learning, pyspark.ml.regression. From this library we imported the LinearRegression, RegressionTree, and GBTRegressor methods.
For each regression we created the algorithm, labeled the features column, then logged the metrics to a variable, which we uploaded to our Comet experiment.
The methods to obtain metrics for decision tree regression and gradient boosted regressor are different than linear regression. So we created a method called get_SparkMetric to obtain the metrics for each:
The metrics we will be logging are:
For each of the metrics, we will use the experiment.log_metric method. This method will log a value for each to metric a given experiment run. The method also allows you to set the name of the metric in the experiment store.
Now, let’s look at the regression types we will be using.
Innovation and academia go hand-in-hand. Listen to our own CEO Gideon Mendels chat with the Stanford MLSys Seminar Series team about the future of MLOps and give the Comet platform a try for free!
To get metrics for linear regression, we need to do the following:
An example is below:
The model logs the following metrics:
With decision tree regression, we need to import both the DecisionTreeRegressor and RegressionEvaluator. We will be using the last one to get the metrics since the .summary method is used only with linear regression.
To simplify the code we used the get_SparkMetric method detailed earlier in the article. The method is used to gather the metric from the model which is then logged into the experiment run.
An example of this is shown below:
The model returns the following metrics:
Gradient boosted regressor’s metrics are recorded in the same way as the decision tree model, using the get_SparkMetric.
The model returns the following metrics:
Once you have logged the metrics to the run, end the experiment by creating a cell with experiment.end() . The end result should look like the picture below:
After you have done that, let’s check out the results in Comet. The url displayed at the end of the run is the location of your experiment in your Comet workspace.
Let’s check how the experiment logged the data and how the metrics compared to each other. The data saved in each experiment can be used to check for subtle changes in the model, metrics, and even the dataset.
Let’s now look inside the Panels tab (i.e., Comet’s concept for data visualization).
We separated the metrics into four distinct graphs comparing the r2, mean squared error, mean absolute error, and root mean square error. The gradient boost out-performed simple linear regression and regression tree.
You can also view the logged metrics in the metrics tab. The metrics tab also includes features that allow you to search by metric name and change the decimal precision.
From both the charts and metric data, the gradient boosting algorithm performed the best compared to simple linear regression and regression tree.
If we want to look at the charts we created from the EDA we need to click the graphics tab of the experiment. The Seaborn and matplotlib figures will be saved in the experiment. We can search for these figures by the name we assigned when we logged the figures.
Saving these charts is really valuable, especially if you need to create reports or presentations for stakeholders. It’s also quite useful if another teammate is attempting to replicate the results of your EDA on their own computer.
Now, let’s check out my favorite Comet feature: Code Logging.
The Azure Databricks environment doesn’t integrate with GitHub like a Jupyter Notebook would (sad, I know). However, it does log the code from your Databricks notebook if you set it up in the initial experiment.
The code inside the experiments can be accessed from the code tab in the Comet UI.
All the code between the experiment() method and the experiment.end() method will be logged. Any code that is outside these blocks will not be logged. The tab also allows you to export your code via the download button in the upper right corner.
Well, that’s it! You’ve successfully logged your first experiment in Comet using Azure Databricks. Databricks is a different environment combined with Comet, and can really speed up your experiment tracking.
Links to my Github repository and Comet Experiment repository are below:
Thank you for reading! Connect with me on LinkedIn for more on data science topics.