August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
The open-source movement is responsible for most of the technological innovation we see today, and machine learning is no exception. This movement has birthed many new libraries, fueled projects, enabled rapid growth, and increased the reproducibility of experimental results and innovative applications. In addition, these libraries have made it more feasible to design large-scale real-world systems and adopt models.
Let’s get started and see what we’ve got on our hands! Because there are a lot of open source machine learning libraries out there, we are only going to look at a few. Below is a list of various machine learning libraries and how they’ve changed the machine learning landscape.
Numpy is one of the first libraries you’ll encounter when getting started with machine learning. Numpy, also known as numerical Python, was first developed by Travis Oliphant. This library is the go-to Python library for scientific computation, handling multi-dimensional array and matrix operations.
NumPy uses a special array type and brings the computational power of languages like C and Fortran to Python. Thus, it can perform various calculations like linear algebra, Fourier transform, and random number capabilities in milliseconds.
It also lies at the core and forms the basis of many data science libraries. So, for example, we have visualization libraries like Matplotlib, Seaborn, Plotly, Altair, and Bokeh; machine learning libraries like Scikit-learn and SciPy; and array libraries like Dash, PyTorch, TensorFlow, MXNet, and even Pandas.
Strengths:
Weaknesses:
import numpy as np #creating an array wiith np.array new_matrix = np.array([[1,2,3], [4,5,6], [7,8,9]]) print(new_matrix)
Visit Numpy.org to learn more about this library and submit an open-source contribution today! I contribute to this project so you can reach out to me if you need help.
You can’t say you haven’t heard of Pandas as a data scientist. Pandas is a fast and flexible data analysis and manipulation library developed by Wes McKinney. Pandas isn’t technically a machine learning library, unlike the other libraries on this list. However, it is needed for handling tabular data, data transformation, and performing EDA (Exploratory Data Analysis).
This library leverages high-level data structures (Series and DataFrame) and has multiple in-built data cleaning and analysis methods. Panda methods cut down complex Python calculations into a few lines of code. You can pretty much call it the Microsoft Excel of Python. Besides data manipulation, Pandas is very handy for data visualization.
To get a feel for Pandas in action, check out feature engineering for categorical data, build an article recommendation system using Python, and data analytics with Pandas 🐼.
Strengths:
Weaknesses:
import pandas as pd df = pd.DataFrame({"A":[1, 10, 100, 1000, 10000], "B":[2, 20, 200, 2000, 20000], "C":[3, 30, 300, 3000, 30000], "D":[4, 40, 400, 4000, 40000]}) #return the mean absolute deviation of the values for the requested axis df.mad(axis = 0)
Visit Pandas.pydata.org to learn more about this library and submit an open-source contribution today!
Scientific Python, also known as SciPy, is used to perform scientific analysis and technical computing on large data sets. SciPy was developed by Travis Oliphant, Pearu Peterson, and Eric Jones. Array optimization, linear algebra, integration, interpolation, special functions, ODE solvers, FFT, and signal and image processing are among the modules in this Numpy-based library.
Strengths:
Weaknesses:
from scipy import special #exponential Function x = special.exp10(1) print(x)
Visit Scipy.org to learn more about this library and submit an open-source contribution today!
Scikit-learn is home to many machine learning algorithms, model selection, and preprocessing features. Scikit-learn was written in C and Python and built on top of NumPy and SciPy. It was developed by David Cournapeau as a Google Summer of Code project. It is presently one of the most widely used machine learning libraries for developing machine learning algorithms.
The library is simple, robust, intuitive, and user-friendly. The library is straightforward, dependable, intuitive, and user-friendly. It’s also a useful library for building machine learning models, evaluating models, data modeling, and statistical modeling. This library can also vectorize text using BOW and hashing vectorization, among other things.
Check out fake news detection with Python, the best ways of splitting data for machine learning, how do I detect anomalies and why is it necessary? to get a sense of how Scikit-learn works in practice.
Strengths:
Weaknesses:
from sklearn import cluster, datasets # load data iris = datasets.load_iris() # K-means clustering: create clusters for k=3 k=3 k_means = cluster.KMeans(k) # fit data k_means.fit(iris.data) # print results print( k_means.labels_[::10]) print( iris.target[::10])
Visit Scikit-learn.org to learn more about this library and submit an open-source contribution today!
Tensor is a library for building, training, and running deep learning models and neural networks. This library was developed by the Google Brain team. Its architecture and framework are also flexible; hence it can run across various computational platforms such as CPU, GPU, and TPU.
TensorFlow also offers a web-based visualization tool called Tensorboard, as well as frameworks like TensorFlow Lite and TensorFlow that make it easy to deploy machine learning models. You can visualize model parameters, gradients, and performance with Tensorboard.
Read Dropout Regularization with Tensorflow Keras to see Tensorflow in action.
Strengths:
Weaknesses:
import tensorflow as tf # Create a Tensor. hello = tf.constant("hello world") print(hello)
Visit Tensorflow.org to learn more about this library and submit an open-source contribution today!
Keras, often called Python deep learning library, is used for developing and evaluating neural networks within deep learning and machine learning models. Keras was built by François Chollet. It supports multiple backend support: Tensorflow, Theano, and CNTK. Because of this, training neural networks is easier and can be done with fewer codes and configurations.
Keras has inbuilt features like cov2d, max-pooling layers, and data processing libraries. It also comes with a variety of pre-trained models and image classification models.
Interested in how Keras works? Check out Trump’s Twitter insults and build your first convolutional neural network to classify cats and dogs.
Strengths:
Weaknesses:
import tensorflow as tf from tensorflow import keras #load dataset mnist = tf.keras.datasets.mnist #Build a machine learning model model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(100, activation='relu'), tf.keras.layers.Dropout(0.5), tf.keras.layers.Dense(10) ])
Visit Keras.io to learn more about this library and submit an open-source contribution today!
Big teams rely on big ideas. Learn how experts at Uber, WorkFusion, and The RealReal use Comet to scale out their ML models and ensure visibility and collaboration company-wide.
PyTorch is a library that is used for deep learning. The library was developed by Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. It offers various tools for machine learning, computer vision, and natural language processing (NLP). PyTorch can also generate computational graphs and execute tensor computations using GPU acceleration.
Currently, PyTorch is managed by Meta AI, formerly called FAIR (Facebook AI Research lab).
Strengths:
Weaknesses:
import torch p_Tensor = torch.ones((2, 2)) #size of a Tensor print(p_Tensor.size()) #resizing 2x2 Tensor to 4x1 p_Tensor = p_Tensor.view(4) print(p_Tensor)
Visit PyTorch.org to learn more about this library and submit an open-source contribution today!
NLTK is a Python library for performing natural language processing (NLP) tasks. The library was developed by Steven Bird, Edward Loper, and Ewan Klein. One thing to keep in mind is that NLTK is a collection of sub-packages and modules rather than a single ML library. These modules enable you to perform a range of tasks. For example, sentence segmentation, stopword removal, word tokenization, entity recognition (NER), dependency parsing, sentiment analysis, and text classification.
Check out keyword extraction with Python to get started with NLTK.
Strengths:
Weaknesses:
from nltk.stem import PorterStemmer #from nltk.tokenize import sent_tokenize, word_tokenize ps = PorterStemmer() text_words = ["tech","technology","technologized","techy", "Technologization"] for x in text_words: print(ps.stem(x))
Visit NLTK.org to learn more about this library and submit an open-source contribution today!
Danfo.js is a JavaScript library for manipulating and processing structured data. It provides high-performance, intuitive, and simple-to-use data structures. The library is based on TensorFlow.js and is greatly inspired by Pandas. This library was developed by Rising Odegua and Stephen Oni.
This library is handy especially because it enables developers to create JavaScript applications for machine learning and deep learning.
Strengths:
Weaknesses:
npm install danfojs-node import * as dfd from "danfojs-node" #creating a DataFrame/Series s = new dfd.Series([1, 3, 5, undefined, 6, 8]) s.print()
Visit Danfo.jsdata.org to learn more about this library and submit an open-source contribution today!
Microsoft CNTK is a deep learning model and algorithm training toolkit. It can be used as a standalone machine-learning tool as Brainscript or in Python and C++ projects. CNTK makes it simple to employ standard models, including feed-forward DNNs, convolutional neural networks (CNNs), and recurrent neural networks (RNNs/LSTMs) for voice training, handwriting recognition, and image recognition projects.
Strengths:
Weaknesses:
Visit docs.microsoft.com/cognitive-toolkit to learn more about this library and submit an open-source contribution today!
Open-source libraries are essential, and they have revolutionized machine learning research. They’re used in various machine learning stacks, helped solve real-world problems, and have simplified the development of real-world projects.
If you enjoyed this post, you should try your hand at applying any of the libraries mentioned while using Comet’s machine learning platform to track, compare, and reproduce your machine learning experiments.
Thanks for reading!