A constantly updated python machine learning cheatsheet

ShuaiW ShuaiW Last update: Aug 21, 2022

ML Cheatsheet

Assuming we have the dataset in a loadable format (e.g., csv), here are the stepswe follow to complete a machine learning project.

  1. Exploratory data analysis
  2. Preprocessing
  3. Feature engineering
  4. Machine learning

A couple of notes before we go on.

First of all, machine learning is a highly iterative field. This would entaila loop cycle of the above steps, where each cycle is based on the feedback fromthe previous cycle, with the goal of improving the model performance. Oneexample is that we need refit models when we engineered new features, andtest to see if these features are predicative.

Second, while in Kaggle competitions one can create a monster ensemble of models, inproduction system often times such ensembles are not useful. They arehigh maintenance, hard to interpret, and too complex to deploy. This is whyin practice it's often simpler model plus huge amount of data that wins.

Third, while some code snippets are reusable, each dataset has its ownuniqueness. Dataset-specific efforts are needed to build better models.

Bearing these points in mind, let's get our hands dirty.

Exploratory data analysis

Exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics,often with plots. The goal of EDA is to get a deeper understanding of thedataset, and to preprocess data and engineer features more effectively. Hereare some generic code snippets that can be applied to any structured dataset

import libraries

import osimport fnmatchimport globimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns

data I/O

df = pd.read_csv(file_path) # read in csv file as a DataFramedf.to_csv(file_path, index=False) # save a DataFrame as csv file# read all csv under a folder and concatenate them into a big dataframepath = r'path'# flatall_files = glob.glob(os.path.join(path, "*.csv"))# or recursivelyall_files = [os.path.join(root, filename)             for root, dirnames, filenames in os.walk(path)             for filename in fnmatch.filter(filenames, '*.csv')]             df = pd.concat((pd.read_csv(f) for f in all_files))

data I/O zipped

import pandas as pdimport zipfilezf_path = 'file.zip'zf = zipfile.ZipFile(zf_path) # zipfile.ZipFile objectall_files = zf.namelist() # list all zipped filesall_files = [f for f in all_files if f.endswith('.csv')] # e.g., get only csvdf = pd.concat((pd.read_csv(zf.open(f)) for f in all_files)) # concat all zipped csv into one dataframe

To a table in sqlite3 DB (then you can use DB Browser for SQLite to view and query the table)

import sqlite3import pandas as pddf = pd.read_csv(csv_file) # read csv filesqlite_file = 'my_db.sqlite3'conn = sqlite3.connect(sqlite_file) # establish a sqlite3 connection# if db file exists append the csvdf.to_sql(tablename, conn, if_exists='append', index=False)

data summary

df.head() # return the first 5 rowsdf.describe() # summary statistics, excluding NaN valuesdf.info(verbose=True, null_counts=True) # concise summary of the tabledf.shape # shape of datasetdf.skew() # skewness for numeric columnsdf.kurt() # unbiased kurtosis for numeric columnsdf.get_dtype_counts() # counts of dtypes

display missing value proportion for each col

for c in df.columns:  num_na = df[c].isnull().sum()  if num_na > 0:    print round(num_na / float(len(df)), 3), '|', c

pairwise correlation of columns

df.corr()

plotting

plot heatmap of correlation matrix (of all numeric columns)

cm = np.corrcoef(df.T)sns.heatmap(cm, annot=True, yticklabels=df.columns, xticklabels=df.columns)

heat-corr

plot univariate distributions

# single columnsns.distplot(df['col1'].dropna())# all numeric columnsfor c in df.columns:  if df[c].dtype in ['int64', 'float64']:    sns.distplot(df[c].dropna(), kde=False)    plt.show()

hist

plot kernel density estimaton (KED)

# all continuous variablesfor c in df.columns:  if df[c].dtype in ['float64']:    sns.kdeplot(df[c].dropna(), shade=True)    plt.show()

kde

plot pairwise relationships

sns.pairplot(df.dropna())

pairwise

hypertools is a python toolbox forvisualizing and manipulating high-dimensional data. This is desirable for the EDAphase.

visually explore relationship between features and target (in 3D space)

import hypertools as hypimport seaborn as snsfrom sklearn import datasetsiris = datasets.load_iris()X = iris.datay = iris.targethyp.plot(X,'o', group=y, legend=list(set(y)), normalize='across')

hypertools

linear regression analysis using each PC

from sklearn import linear_modelsns.set(style="darkgrid")sns.set_palette(palette='Set2')data = pd.DataFrame(data=X, columns=iris.feature_names)reduced_data = hyp.reduce(hyp.tools.df2mat(data), ndims=3)linreg = linear_model.LinearRegression()linreg.fit(reduced_data, y)sns.regplot(x=reduced_data[:,0],y=linreg.predict(reduced_data), label='PC1',x_bins=10)sns.regplot(x=reduced_data[:,1],y=linreg.predict(reduced_data), label='PC2',x_bins=10)sns.regplot(x=reduced_data[:,2],y=linreg.predict(reduced_data), label='PC3',x_bins=10)plt.title('Correlation between PC and Regression Output')plt.xlabel('PC Value')plt.ylabel('Regression Output')plt.legend()plt.show()

lg-pc

break down by labels

sns.set(style="darkgrid")sns.swarmplot(y,reduced_data[:,0],order=[0, 1, 2])plt.title('Correlation between PC1 and target')plt.xlabel('Target')plt.ylabel('PC1 Value')plt.show()

by-label

For more use cases of hypertools, check notebooksand examples

Preprocessing

drop columns

df.drop([col1, col2, ...], axis=1, inplace=True) # in placenew_df = df.drop([col1, col2, ...], axis=1) # create new df (overhead created)

handle missing values

# fill with mode, mean, or mediandf_mode, df_mean, df_median = df.mode().iloc[0], df.mean(), df.median()df_fill_mode = df.fillna(df_mode)df_fill_mean = df.fillna(df_mean)df_fill_median = df.fillna(df_median)# drop col with any missing valuesdf_drop_na_col = df.dropna(axis=1)

encode categorical features

from sklearn.preprocessing import LabelEncoderdf_col = df.columnscol_non_num = [c for c in df_col if df[c].dtype == 'object']for c in col_non_num:  df[c] = LabelEncoder().fit_transform(df[c])

join two tables/dataframes

df1.join(df2, on=col)

handle outliners (outliers can either be clipped, or removed.WARNING: outliers are not always meant to be removed)

In the following example we assume df is all numeric, and has no missing values

clipping

# clip outliers to 3 standard deviationlower = df.mean() - df.std()*3upper = df.mean() + df.std()*3clipped_df = df.clip(lower, upper, axis=1)

removal

# remove rows that have outliers in at least one columnnew_df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

filter

# filter by one valuenew_df = df[df.col==val]# filter by multiple valuesnew_df = df[df.col.isin(val_list)]

Feature engineering

Transformation

one-hot encode categorical features; not necessary for tree-based algorithms

# for a couple of columnsone_hot_df = pd.get_dummies(df[[col1, col2, ...]])# for the whole dataframenew_df = pd.get_dummies(df)

normalize numeric features (to range [0, 1])

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()normalized_df = MinMaxScaler().fit_transform(df)

log transformation: for columns with highly skewed distribution, we can applythe log transformation

from scipy.special import log1ptransformed_col = df[col].apply(log1p)

log

Creation

Feature creation is both domain and engineering efforts. With the help fromdomain experts, we can craft more predicative features, but here are some genericfeature creation methods worth trying on any structured dataset

add feature: number of missing values

df['num_null'] = df.isnull().sum(axis=1)

add feature: number of zeros

df['num_zero'] = (df == 0).sum(axis=1)

add feature: binary value for each feature indicating whether a data point is null

for c in df:  if pd.isnull(df[c]).any():    df[c+'-ISNULL'] = pd.isnull(df[c])

add feature interactions

from sklearn.preprocessing import PolynomialFeatures# e.g., 2nd order interactionpoly = PolynomialFeatures(degree=2)# numpy array of transformed dfarr = poly.fit_transform(df)# all features namestarget_feature_names = ['x'.join(    ['{}^{}'.format(pair[0], pair[1]) for pair in tuple if pair[1] != 0])    for tuple in [zip(X_train.columns, p) for p in poly.powers_]]new_df = pd.DataFrame(arr, columns=target_feature_names)

Selection

There are various waysto select features, and an effective one is recursive feature elimination (RFE).

select feature using RFE

from sklearn.feature_selection import RFEmodel = ... # a sklean's classifier that has either 'coef_' or 'feature_importances_' attributenum_feautre = 10 # say we want the top 10 featuresselector = RFE(model, num_feature, step=1)selector.fit(X_train, y_train) # select featuresfeature_selected = list(X_train.columns[selector.support_])model.fit(X_train[feature_selected], y_train) # re-train a model using only selected features

For more feature engineering methods please refer to thisblogpost.

Machine learning

Cross validation (CV) strategy

Theories first (some adopted from Andrew Ng). In machine learning we usually have the following subsets of data:

  • training set is used to run the learning algorithm on
  • dev set (or hold out cross validation set) is used to tune parameters,select features, and make other decisions regarding the learning algorithm
  • test set is used to evaluate the performance of the algorithms,but NOT to make any decisions about what algorithms or parameters to use

Ideally, those 3 sets should come from the same distribution, andreflect what data you expect to get in the future and want to do well on.

If we have real-world application from which we continuously collect new data,then we can train on historical data, and split the in-coming data into dev andtest sets. This is out of the scope of this cheatsheet. The following exmampleassume we have a csv file and we want to train a best model on this snapshot.

How should we split the the three sets? Here is one good CV strategy

  • training set the larger the merrier of course :)
  • dev set should be large enough to detect differences between algorithms(e.g., classifier A has 90% accuracy and classifier B has 90.1% then a devset of 100 examples would not be able to detect this 0.1% difference.Something around the 1,000 to 10,000 will do)
  • test set should be large enough to give high confidence in the overallperformance of the system (do not naively use 30% of the data)

Sometimes we can be pretty data strapped (e.g., 1000 data points), and a compromisingstrategy is 70%/15%/15% for train/dev/test sets, as follows:

from sklearn.model_selection import train_test_split# set seed for reproducibility & comparabilityseed = 2017X_train, X_other, y_train, y_other = train_test_split(  X, y, test_size=0.3, random_state=seed)X_dev, X_test, y_dev, y_test = train_test_split(  X_rest, y_rest, test_size=0.5, random_state=seed)

As noted we need to seed the split.

If we have class imbalance issue, we should split the data in a stratified way(using the label array):

X_train, X_other, y_train, y_other = train_test_split(  X, y, test_size=0.3, random_state=seed, stratify=y)

Model training

If we've got so far, training is actually the easier part. We just initializea classifier and train it!

from sklearn.linear_model import LogisticRegressionclf = LogisticRegression()clf.fit(X_train, X_test)

Evaluation

Having a single-number evaluation metric allows us to sort all models accordingto their performance on this metric and quickly decide what is working best.In production system if we have multiple (N) evaluation metrics, we can setN-1 of the criteria as 'satisficing' metrics, i.e., we simply require thatthey meet a certain value, then define the final one as the 'optimizing' metricwhich we directly optimize.

Here is an example of evaluating a model with Area Under the Curve (AUC)

from sklearn.metrics import roc_auc_scorey_pred = clf.predict(X_test)print 'ROC score: {}'.format(roc_auc_score(y_test, y_pred))

Hyperparameter tuning

example of nested cross-validation

import numpy as npfrom sklearn.grid_search import GridSearchCVfrom sklearn.cross_validation import cross_val_scorefrom sklearn.ensemble import RandomForestClassifierX_train = ... # your training featuresy_train = ... # your training labelsgs = GridSearchCV(  estimator = RandomForestClassifier(random_state=0),  param_grid = {    'n_estimators': [100, 200, 400, 600, 800],     # other params to tune     }  scoring = 'roc_auc',  cv = 5)scores = cross_val_score(  gs,  X_train,  y_train,  scoring = 'roc_auc',  cv = 2)print 'CV roc_auc: %.3f +/- %.3f' % (np.mean(scores), np.std(scores))

Ensemble

Please refer to the last section of thisblogpost.

Subscribe to our newsletter