10 things I wish I knew. about Machine Learning Competitions

Size: px

Start display at page:

Download "10 things I wish I knew. about Machine Learning Competitions"

Gwen Chapman
6 years ago
Views:

1 10 things I wish I knew about Machine Learning Competitions

2 Introduction Theoretical competition run-down The list of things I wish I knew Code samples for a running competition

3 Kaggle the platform

4 Reasons to compete Money Fame Learning experience Tough challenge Fun

5 Competition run-down Head over to kaggle.com Read the competition description Download the train/test set

6 Preparations Plot the data Look at the distributions Start simple (all-zeroes benchmark) Make sure to optimize the correct metric Read up on the specific properties à e.g. Logarithmic Loss, extreme predictions

7 Preprocessing Replace missing values Remove duplicates from the training set One-Hot encode categorical features Decide what to do with outliers Scaling/Standardizing

8 Building the model Start with a baseline or simple model à Random predictions à LogisticRegression à Decision trees à KNearestNeighbours Establish a cross-validation scheme

9 Submit Leaderboard score vs. local score Mismatch? à Check your scoring function à Check the sample size of the public LB à Ignore the LB

10 Kaggle isn t real world ML Trade-off: Accuracy vs. Interpretability vs. Speed Interpretability/speed is often more important than accuracy "Arrow splitting "Netflix Problem"

11 1) Timing Don t start too early «Beat the benchmark», sharing, motivation Don t start too late You ll certainly run out of time ~ 30 Days before the deadline

12 2) Learn a tool, stick with it Python R Matlab/Octave The grass is always greener on the other side

13 3) Make sure your result are reproducible Fix the seeds for algorithms that involve randomization Automate your pipeline Preferably one script from input to output

14 4) Make sure your result are reproducible Examples: Weight initialization (Neural Networks) Data subsampling (e.g. Random Forest) # scikit-learn train_test_split(x, y, random_state=42) # numpy np.random.seed(42)

15 5) Don t trust the Leaderboard Danger of overfitting when tuning your models according to feedback of the public leaderboard Use cross-validation to estimate the performance of your model Don t, if computationally to expensive à Train/Test split might cut it too

16 6) Avoid Leakage Common Sources PCA TfIdf Imputation (Mean/Median) Duplicate rows in the training set Inappropriate Cross-validation Scheme Row, Person, Time, Location

17 7) Bias/Variance Trade-off High Variance (Overfitting) High Bias (Underfitting)

18 8) Think outside the box «Don t get stuck in local minima» Stop doing what you re doing if you re not making significant progress Read-up relevant papers on the problem Explore a different model Try more feature engineering

19 9) Spend your time wisely Feature Engineering vs. Hyper-parameter tuning Read up on Error Analysis Read up on Learning Curves

20 9) Improving a learning algorithm Get more training examples (V) Try smaller sets of features (V) Try getting additional features (B) Try adding polynomial features (B) Increase regularization (V) Decrease regularization (B)

21 10) Make use of ensembling Six bad models are usually better than one really good model [1, 2] à KNN, SVM, NeuralNet, RF, LogisticRegression, Ridge à Neural Nets (structurally, seed) Make yourself familiar with: Bagging, Boosting, Blending, Stacking [1] [2]

22 An example would be handy right about now.

23 Make use of ensembling (cont) True signal Training data Linear model Non-Linear model Averaged

24 Working with features Feature selection Feature engineering à categorical à numerical à textual

25 Examples of feature selection/engineering Remove correlated features Remove features using statistical tests Try pair-wise feature interactions a*b, a-b, a+b, a/b Try feature transformations sqrt(a), log(a), abs(a)

26 Feature engineering (categorical) CabinID into deck and room number A25 à ( A, 25) B16 à ( B, 16) Recode number of siblings to binary (family) Decompose Dates Year, month, day Day of the week Day of the month

27 Feature engineering (Textual) Lowercase Stemming ( rainy à rain ) Spelling correction «I wsa hungray» à «I was hungry» «It s hotttt outside» à «It s hot outside» Remove stopwords N-Grams TfIdf, Count, Hashing

28 What usually doesn t work (for me) Dimensionality reduction (information loss) Feature elimination (information loss) Tree-based methods on Highdimensional/Sparse data (by design)

29 There is always a twist Feature engineering à a.k.a. Golden Features How exciting is this project? à linear decay towards the end Removing useless/noisy features

30 Dataset Trends Datasets become larger (millions of samples, thousands of features) Datasets are anonymized à Black-Box Machine Learning

31 Interesting stuff to keep an eye on Caffe, cudnn Vowpal Wabbit (Wee-Dub) h2o from 0xdata Regularized Greedy Forests Factorization models

35 55 features, 15k training samples, ~500k Test samples

36 Random predictions

37 Start simple: Decision tree

38 A little more complex

39 Let s see what the model thinks

40 Next: SVM! What?!

41 Feature scaling!

42 Enough playing, let s get real.

43 73.5% accuracy?

44 Class distribution

45 Scale it up!

46 75.489% accuracy

47 Even more? Nope, no more progress! Time to switch tactics.

48 Feature Engineering

49 78.212% accuracy

50 One more round

51 Mail LinkedIn: ch.linkedin.com/in/mattvonrohr/ Kaggle: kaggle.com/users/8376/matt

Kaggle See Click Fix Model Description

Kaggle See Click Fix Model Description BY: Miroslaw Horbal & Bryan Gregory LOCATION: Waterloo, Ont, Canada & Dallas, TX CONTACT : miroslaw@gmail.com & bryan.gregory1@gmail.com CONTEST: See Click Predict