MLOps
August 22, 2022At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…
Written by Cory Maklin Genius is making complex ideas simple, not making simple ideas complex - Albert Einstein You should follow them on Twitter
At the time of this writing, organizations are still putting notebooks into production! Fortunately, the machine learning space is slowly beginning to adopt software engineering best practices. Among…
I’ve heard a lot about data governance, but I still didn’t understand what it meant to implement it on practical level. According to Google, data governance is defined as: I found that the best way…
Since their introduction in 2017, transformers have revolutionized the world of natural language processing. Prior to Transformers, LSTMs and RNNs were the state of the art. The reason Transformers…
Long gone are the days in which data practitioners trained machine learning models from scratch themselves. Unless you have a very specific use case, you’re better off leveraging the pre-trained…
Transformers have revolutionized the way data practitioners build models for natural language processing. In a similar vein, the advent of transfer learning has changed the game. Rather than training…
With a few exceptions, machine learning models do not accept raw text as input. The sequences of words must first be encoded in some fashion. We could represent each sentence as a Bag of Words (BOW)…
In the early 90s, recommendation systems, particularly automated collaborative filtering, started seeing more widespread use. Fast forward to today, recommendation systems are at the core of the…
Back in 2006, Netflix announced the Netflix Prize, a machine learning competition for predicting movie ratings. They offered a one million dollar prize to whoever improved the accuracy of their…
In recent years, there have been multiple scandals involving a machine learning model that made an unjust decision on the basis of gender or race. The EU is seeking to pass legislation requiring AI…
Data Vault modelling is used to build data warehouses while addressing the drawbacks of 3NF (Bill Inmon), and dimensional (Ralph Kimball) modelling. Data Vault, originally conceived by Daniel…
You can bet that you will be asked what kind of data issues you might encounter in your day job during one of your data engineer or data scientist interviews. Data quality will do more for model…
Latent Dirichlet Allocation, or LDA for short, is an unsupervised machine learning algorithm. Similar to the clustering algorithm K-means, LDA will attempt to group words and documents into a…
I’ve always wondered, if it is possible to be a senior engineer with a junior engineer title? If so, what’s the difference between the two? There have been instances in my career where, although I…
The data mesh architecture is on the rave nowadays, and for good reason. The data mesh brings to the data lakehouse what microservices brought to monolithic applications that is, decoupling. Allow me…
In my previous role, we had written transformations using Spark Structured Streaming in notebooks and scheduled them in Airflow using the Papermill operator. We lacked the internal expertise and…
Isolation Forest is an unsupervised machine learning algorithm for anomaly detection. As the name implies, Isolation Forest is an ensemble method (similar to random forest). In other words, it use…
To this day, forecasting remains one of the most valuable applications of machine learning. For instance, we could use a model to predict the demand of a product. This information could then be used…
If you’re like me, then, whenever you hear talk of artificial intelligence ethics, you can’t help but think of a professor in a philosophy department contemplating whether robots should be given the…
Synthetic Minority Over-sampling TEchnique, or SMOTE for short, is a preprocessing technique used to address a class imbalance in a dataset. In the real world, oftentimes we end up trying to train a…
In the previous article, we discussed why the data warehouse architecture came to prominence. We also saw how it was unsuited for unstructured data and the volumes of data inherent in Big Data. We…
The term Data Warehouse was first coined in the 1970s. In essence, a data warehouse is a database management system (DBMS) that houses all of the enterprise’s data. The data warehouse serves as a…
It’s Tuesday afternoon, you’re sitting at your cubicle, and you’re typing away at your keyboard. Earlier in the day, you volunteered to pick up the ticket to modify the ingestion pipeline, but now…
Let’s say you decide to build a Facebook clone. You and your roommate grind away for a few weeks to get the application up and running. Everything looks great, you’ve got over 100 users (including…
Breadth First Search (or BFS for short) is a graph traversal algorithm. In BFS, we visit all of the neighboring nodes at the present depth prior to moving on to the nodes at the next depth. Breadth…
Quicksort In Python. We’ve all been guilty of it. Whenever we come across a problem that requires us to sort an array, we default to implementing bubble sort. I….
Often times, we can’t solve integrals analytically and must resort to numerical methods. Among these include Monte Carlo integration. As you may remember, the integral of a function can be…
Like other MCMC methods, the Gibbs sampler constructs a Markov Chain whose values converge towards a target distribution. Gibbs Sampling is in fact a specific case of the Metropolis-Hastings…
A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide…
AES (Advanced Encryption Standard) is the most widely used symmetric encryption algorithm. AES is used in a wide array of applications that include the encryption of data at rest, and secure file…
In short, the Diffie Hellman is a widely used technique for securely sending a symmetric encryption key to another party. Before proceeding, let’s discuss why we’d want to use something like the…
XGBoost is short for Extreme Gradient Boost (I wrote an article that provides the gist of gradient boost here). Unlike Gradient Boost, XGBoost makes use of regularization parameters that helps…
Generative Adversarial Networks or GANs for short are a type of neural network that can be used to generate data rather than attempt to classify it. Although slightly disturbing, the following site…
If you have a background in electrical engineering, you will, in all probability, have heard of the Fourier Transform. In layman's terms, the Fourier Transform is a mathematical operation that…
Suppose that you’re at a house party and you’re talking to some cute girl. As you listen, your ears are being bombarded by the sound coming from the conversations going on between different groups…
Random forest is one of the most popular machine learning algorithms out there. Like decision trees, random forest can be applied to both…
We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions.
A tutorial on how to implement Ridge Regression from scratch in Python using Numpy.
As the name implies, the method of Least Squares minimizes the sum of the squares of the residuals between the observed targets in the…
Support Vector Machine (SVM) is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The linear SVM classifier works by drawing a…
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that…
Singular Value Decomposition, or SVD, has a wide array of applications. These include dimensionality reduction, image compression, and denoising data. In essence, SVD states that a matrix can be…
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique which minimizes the variance and maximizes the distance between…
An explanation of the Logistic Regression algorithm with an example of how to implement it in Python.
A tutorial on how to implement the random forest algorithm in R.
An example of how to implement a decision tree classifier in Python.
An example of how to implement linear regression in Python.
A tutorial on how to perform image classification on the MNIST dataset using convolutional neural networks (CNN) and Python.
A tutorial on how to use the k nearest neighbor algorithm to classify data in python.
Gaussian mixture models can be used to cluster unlabeled data in much the same way as k-means. There are, however, a couple of advantages to using Gaussian mixture models over k-means. First and…
Spectral clustering is a popular unsupervised machine learning algorithm which often outperforms other approaches. In addition, spectral clustering is very simple to implement and can be solved…
Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…
Affinity Propagation was first published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you to…
The Affinity Propagation algorithm was published in 2007 by Brendan Frey and Delbert Dueck in Science. In contrast to other traditional clustering methods, Affinity Propagation does not require you…
Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…
Existing data clustering methods do not adequately address the problem of processing large datasets with a limited amount of resources (i.e. memory and cpu cycles). In consequence, as the dataset…
DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify unlabeled data…
Density-Based Spatial Clustering of Applications with Noise, or DBSCAN for short, is an unsupervised machine learning algorithm. Unsupervised machine learning algorithms are used to classify…
For most of their history, computer processors became faster every year. Unfortunately, this trend in hardware stopped around 2005. Due to limits in heat dissipation, hardware developers stopped…
An example of how to train a logistic regression model at scale using Apache Spark MLlib and Python.
A step-by-by tutorial on how to perform sentiment analysis using a LSTM recurrent neural network implemented with Keras.
Word embeddings are used to reduce the number of features in NLP (Natural Language Processing) based problems such as sentiment analysis.
An example of how to implement batch normalization using tensorflow keras in order to prevent overfitting.
A tutorial on the various methods for evaluating a classification model's performance.
An example of how to perform time series forecasting by building an ARIMA model in Python.
A tutorial on how to perform image classification using a conv net and tensorflow keras.
An in depth explanation of the gradient boosting decision tree algorithm.
An example of how to publish data to kafka docker container using a nifi processor.
The most common ways to store data are CSV, XML and JSON. JSON is less verbose than XML, but both still use a lot of space compared to binary formats. In JSON, you repeat every field name with every…
An explanation of the naive bayes classifier algorithm and assumption.
An example of how to implement TFIDF (TF IDF) from scratch with Python.
Principal Component Analysis or PCA is used to reduce the number of features without the loss of too much information. The problem with having too many dimensions is that it makes it difficult to…
An example of how to use k-fold cross validation with sklearn to estimate hyperparameters.
How to calculate and interpret R Squared. An example which covers the meaning of the R Squared score in relation to linear regression.
A brief explanation of Apache Hadoop. A high level overview of what is YARN, HDFS and MapReduce.
A high level overview of stream processing and how it relates to Apache Kafka.
People rightly believed (although maybe a bit too optimistically at first) that the advent of the Internet would bring about a revolution in the realm of commerce. Following the dotcom boom, a new…
The Elastic Stack has recently risen to fame in the realm of Big Data analytics and machine learning. The Elastic Stack is a suite of tools (i.e. Elasticsearch, Logstash, Kibana and Beats) for…
For a GUI-less server, the shared clipboard functionality of VirtualBox Guest Additions does not work, as a text-based server does not have a clipboard. Therefore, if you want to use copy and paste…
A brief overview of the meaning behind CPU specs.
A brief overview of the meaning behind CPU specs.
A tutorial on how access data from Local Storage, Session Storage and IndexedDB using the JavaScript API.
An explanation of one of the reasons why compiled languages like C are more efficient than interpreted ones such as Python.
An explanation of the differences between public and private keys. A comparison of asymmetric and symmetric encryption.
Hierarchical clustering algorithms group similar objects into groups called clusters. Learn how to implement hierarchical clustering in Python.
Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms, clustering attempts to group data without having first been train on labeled data. Clustering…
Supervised learning problems can be further grouped into Classification and Regression problems. As opposed to classification problems, regression has the task of predicting a continuous quantity…
Logistic Regression is a supervised machine learning algorithm used in the classification of data. For example, suppose that given their income, we wanted to predict whether a customer would buy a…
K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been…
Linear Support Vector Machine (or LSVM) is a supervised learning method that looks at data and sorts it into one of two categories. LSVM works by drawing a line between two classes. All the data…
K-Nearest Neighbors (or KNN) is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a non-parametric, lazy learning algorithm. When we say a technique…
The random forest algorithm makes use of multiple decision trees. It can solve both regression and classification problems. With Random Forest however, learning may be slow (depending on the…
A tutorial on how to create a sign up form using AWS Amplify, Cognito and React.
A tutorial on how to build a fullstack application that leverages AWS Lambda, DynamoDB API Gateway and S3.
A tutorial on how to host a static website from a S3 bucket using the AWS CLI.
A tutorial on how to create and query a DynamoDB table using the AWS CLI.
A tutorial on how to create Lambda functions using the AWS CLI.
A tutorial on how to create NGINX server in the cloud using the Azure CLI and Cloud Init.
A tutorial on how to create EC2 instances using the AWS CLI.
A tutorial on how to create virtual machines in the cloud using the Azure CLI.
It’s only a matter of time before self-driving cars become widespread. This tremendous feat of engineering wouldn’t be possible without convolutional neural networks. The algorithm used by…
Linear regression is the most basic form of machine learning. In linear regression we attempt to determine the best fitting line for our data. In the proceeding article, we’ll go through a simple…
In the following article, we’ll delve into how to train our machine learning models or in other words how to minimize loss. In the context of machine learning, when people are speaking about a…
The MNIST dataset is often referred to as the “Hello World” of machine learning programs for computer vision. The MNIST dataset is composed of 28x28 pixels images of handwritten digits (0, 1, 2…