Explanation of Analysis of Variance (ANOVA)

Image via Pixabay

Experimentation is widely used at tech startups to make decisions on whether to roll out new product features, UI design changes, marketing campaigns and more, usually with the goal of improving conversion rate, revenue and/or sales volumes. Oftentimes, we want to test the effect of one change (treatment group) against the status quo (control group), but what if we are considering several options and want to conduct an experiment with more than 2 groups?

In this article, I will walk through the intuition behind one-way ANOVA and how to use it to analyze your results from an experiment with multiple…

Machine Learning

Understand the intuition behind bagging with examples in Python

In this article, I will go over a popular homogenous model ensemble method — bagging. Homogenous ensembles combine a large number of base estimators or weak learners of the same algorithm.

The principle behind homogenous ensembles is the idea of “wisdom of the crowd” — the collective predictions of many diverse models is better than any set of predictions made by a single model. There are three requirements to achieve this:

  1. The models must be independent;
  2. Each model performs slightly better than random guessing;
  3. All individual models have similar performance on their own.

When these three requirements are satisfied, adding…

Five Steps to building a bot that scrapes news websites and tweets out the top headlines

Note: This article is purely for educational purposes. We do not encourage you to scrape websites, especially those that may have terms and conditions against such actions.

Nowadays, it is not uncommon for people to get news stories from social media platforms where recommended articles tend to be about similar topics. In the weeks leading up to the 2020 U.S. Presidential Election, I scrolled through many similar recommended headlines on my social media feed, and soon came to the realization that I needed to broaden my perspectives and diversify my news sources to escape from the echo chamber reinforced by…

A Hands-on Modeling Guide using a Kaggle Dataset

With the surge in e-commerce and digital transactions, identity fraud is has also risen to affect millions of people every year. In 2019, fraud losses in the US alone were estimated to be at around US$16.9 billion, a substantial portion of which includes losses from credit card fraud¹.

In addition to strengthening cybersecurity measures, financial institutions are increasingly turning to machine learning to identify and reject fraudulent transactions when they happen, so as to limit losses.

I came across a credit card fraud dataset on Kaggle and built a classification model to predict fraudulent transactions. In this article, I will…

How to Code and Deploy a Python Web App using Plotly Dash and Heroku

I find that simulations are a useful way to understand mathematical concepts, so I recently coded one to illustrate the gambler’s ruin problem. I made a web app to simulate a series of games and their outcomes in Python. In this web app, users define a set of parameters (probability of success per round (p), initial amount (i), goal amount (N) and number of games), and it will return the probability of winning as well as the balance at the end of every round for each game, as shown below.

Result of a Simulation with p=0.5, i=10, N=50, n_games=30

In this article, I will walk through how to code…

How to plot geolocation coordinates and cluster centers using geopandas and matplotlib

When working with geospatial data, it is often useful to find clusters of latitude and longitude coordinates either as a data preprocessing step for your machine learning model or as part of segmentation analysis. However, some frequently asked questions related to finding geospatial clusters include:

  • Which clustering algorithm works best for your dataset?
  • Which coordinates belong to which clusters?
  • Where are the boundaries for each cluster/ how are coordinates being separated?

I recently worked on an Kaggle competition from 2017 to predict Taxi trip durations from mainly geospatial and temporal features (see post here). One of the preprocessing steps I…

Mobility data has surged in popularity recently due to COVID-19, so I wanted to work on a prediction problem involving geospatial data. I decided to tackle the NYC Cab Trip Duration Kaggle competition, where the objective is to predict trip duration of NYC cab rides given primarily geospatial and temporal features.

Using a LightGBM model, I was able to achieve a RMSLE score of 0.38109, which would put me in position #177 of 1254 entries on the public leaderboard (but Kaggle doesn’t publish late submission scores on the leaderboard) or the top 14th percentile!

RMSLE score of predictions scored by Kaggle

In this article, I will outline…

Visualization with Geospatial data can be a very powerful tool for storytelling. Using a dataset of cab rides from 2016, I recently made a gif animation illustrating a time-lapsed heatmap of cab pickup locations in New York City at each hour interval from Mondays to Sundays.

In less than a minute and a half, this animation is able to convey information of multiple dimensions—geospatial data (latitude and longitude coordinates), temporal data (time of day and day of week), as well as descriptive statistics on the number of rides per time period listed in the title at the top. …

Claudia Ng

Data Scientist | FinTech | Harvard MPP | Language Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store