BUILDING A TRAINING SET FOR AN AUTOMATIC (LSST) LIGHT CURVE CLASSIFIER

Size: px

Start display at page:

Download "BUILDING A TRAINING SET FOR AN AUTOMATIC (LSST) LIGHT CURVE CLASSIFIER"

Baldwin Sims
5 years ago
Views:

1 RAFAEL MARTÍNEZ-GALARZA BUILDING A TRAINING SET FOR AN AUTOMATIC (LSST) LIGHT CURVE CLASSIFIER WITH: JAMES LONG, VIRISHA TIMMARAJU, JACKELINE MORENO, ASHISH MAHABAL, VIVEK KOVAR AND THE SAMSI WG2

Design concept: a survey that will take an image of every part of the entire visible sky every few nights, in six bands,

2 THE MOTIVATION: LSST IS COMING The Large Synoptic Survey Telescope is a 8.4m reflector currently under construction in Chile (first light expected in 2021). Design concept: a survey that will take an image of every part of the entire visible sky every few nights, in six bands, for 10 years. Transients and variable stars: periodic and non-periodic variable sources will be studied in detail, and new types are expected at very short and very long timescales. SNIa Dado & Dar, 2013

CHALLENGE: VARIABILITY IS DIVERSE Periodic

frequencies, but no consistency in phase or

Variability without any obvious patterns

3 CHALLENGE: VARIABILITY IS DIVERSE Periodic (RR Lyrae stars, Cepheids) Consistent in their periods and amplitudes. Quasi-periodic (Mira stars) Dominating frequencies, but no consistency in phase or amplitude Stochastic (AGNs, QSOs) Variability without any obvious patterns Transient (Supernovae, stellar flares, GRBs) Short-time changes in flux, non periodic

THE CLASSIFICATION CHALLENGE LSST will deliver the time stories of ~10 9 sources. Classification of sources according to their light curves becomes impossible for humans in reasonable times.

4 THE CLASSIFICATION CHALLENGE LSST will deliver the time stories of ~10 9 sources. Classification of sources according to their light curves becomes impossible for humans in reasonable times. Light curves will be both sparse Machine-learning algorithms can greatly help in this classification task (in principle). Joachim et al. And non-simultaneous across filters ML Algorithms can: Learn functions that map the LCs features into class probabilities. Detect outliers whose features stand out with respect to the full population.

5 WHERE CAN THINGS GO WRONG? Training set bias 1. Training set TRAINING LIGHT CURVES FEATURE VECTORS Only brightest or nearest sources have robust labels Rare classes underrepresented. LABELS MACHINE LEARNING ALGORITHM NEW LIGHT CURVE FEATURE VECTOR PREDICTIVE MODEL POSTERIOR FOR LABEL

6 WHERE CAN THINGS GO WRONG? Training set bias 1. Training set TRAINING LIGHT CURVES FEATURE VECTORS Only brightest or nearest sources have robust labels Rare classes underrepresented. LABELS MACHINE LEARNING ALGORITHM NEW LIGHT CURVE FEATURE VECTOR PREDICTIVE MODEL POSTERIOR FOR LABEL

7 WHERE CAN THINGS GO WRONG? 2. Feature extraction TRAINING LIGHT CURVES FEATURE VECTORS Uneven time sampling. Noise. Features can be classspecific. Computationally expensive LABELS MACHINE LEARNING ALGORITHM NEW LIGHT CURVE FEATURE VECTOR PREDICTIVE MODEL POSTERIOR FOR LABEL

8 WHERE CAN THINGS GO WRONG? TRAINING LIGHT CURVES 3. Features FEATURE VECTORS Which is your ground truth? Labels might come from a different domain (i.e. a different survey) Need to map the discriminators to new target space LABELS MACHINE LEARNING ALGORITHM NEW LIGHT CURVE FEATURE VECTOR PREDICTIVE MODEL POSTERIOR FOR LABEL

9 WHERE CAN THINGS GO WRONG? TRAINING LIGHT CURVES 4. Training FEATURE VECTORS Accuracy depends on method used. Multi-dimensional feature spaces where classes are not linearly separable. RFs? SVMs? ANNs? LABELS MACHINE LEARNING ALGORITHM NEW LIGHT CURVE FEATURE VECTOR PREDICTIVE MODEL POSTERIOR FOR LABEL

10 TRAINING SET BIAS Richards et al Discrepancies in the period-amplitude plane: ASAS data has high density in the short period, high amplitude region. Testing data also has smaller values of the QSO-like variability metric.

POSSIBLE APPROACHES TO TACKLE BIAS: Active

set that would maximize the performance of the

$(θ) D target (θ) Stanford Need only a fraction$

11 POSSIBLE APPROACHES TO TACKLE BIAS: Active Learning 1) Choose unlabelled sources in test set that would maximize the performance of the training if label was known. 2) Follow them up. 3) Add to the training set. Domain adaptation Transformation: T: D source (θ) D target (θ) Stanford Need only a fraction of labeled sources in the target datase But we need to get started somewhere

12 THE SDSS STRIPE 82

13 WE ARE BUILDING A TRAINING/TEST SET USING STRIPE 82 SOURCES The catalog has ~60K light curves in bands u,g,r,i,z, with about ~50 observations per LC. We have a github repository with code to download the dataset, gather existing literature labels, merge the classifications, and split the dataset into training and testing sets: We have also tested code to: Inspect variability of sources, and make a census of the different source classes (QSOs, RR Lyrae, Delta Scuti, eclipsing binaries, etc.) Perform feature extraction Test supervised and unsupervised classification methods (random forests, K-means, clustering) - Next talk by Virisha. Identify outliers, and discover the weirdest objects.

14 BASIC FACTS 5 bands: ugriz Example light curves: ~60,000 variable sources ~50 observations per band (but significant variance.) Photometry is roughly simultaneous across bands. Survey is deep: 2 mag deeper than regular SDSS obs. Cadence: on average, sources are re-observed every two days, followed by 5-day, 10-day, and yearly observations.

Outperforms existing methods, specially for non-simultaneous, sparsely sampled multi-band LCs.

15 MULTI-BAND PERIOD DETERMINATION We use the multi-band periodogram (van der Plas & Ivezic, 2015) to estimate periods. Outperforms existing methods, specially for non-simultaneous, sparsely sampled multi-band LCs. Single band (Lomb-Scargle): Multi-band: Method is linear on the θ parameters, and thus it is fast. Regularization is the key to allow multi-band analysis, and to avoid overfitting.

16 EXTRACTING FEATURES FROM IRREGULAR TIME SERIES TYPE EXAMPLES MORPHOLOGY PERIODICITY REGRESSION CAR(1) MODELS MULTIBAND VARIABILITY COLOR

17 A TOOL FOR FEATURE EXTRACTION We want to improve this: See: FeaturesDocumentation.html

18 FEATURE EXTRACTION

19 CORRELATIONS BETWEEN FEATURES We need to be careful about which features are actually useful. Some of them can be correlated, anti correlated, or otherwise interdependent. Which features do we give more importances (or which features should we bother calculating at all?)

20 FINDING THE WEIRDEST OF THE WEIRDEST Outlier detection with unsupervised random forest 1. Sample from marginal distributions Baron et al Train RF classifier with real and synthetic as classes 3. Find weird objects with respect to metric

2. Color, CAR parameters and harmonics amplitudes

21 WEIRDNESS IN OUR SAMPLE 1. Weirdness scores are generally high. i.e., objects are fairly far from each other. 2. Color, CAR parameters and harmonics amplitudes rank high among features in classification importance

22 WEIRDEST OBJECTS: PERIODIC QSOS? LEAST WEIRD OBJECTS: RR LYRAE?

23 CHALLENGING QUESTIONS AHEAD What features are more important in classification and outlier detections? What makes weird sources weird? We need to include the full dataset, and include all possible features. Why are some of the weirdest objects we are finding periodic QSOs? Their periods (~1d) are suspicious. Literature search (or observational follow ups) for the weirdest objects should be next step. What is special about them? Next steps: domain adaptation? Leverage results from, e.g., the Catalina survey. Unsupervised classification of sources (next talk). Data challenge: upcoming discussions. Should the training set be based on observations? Simulations?

24 FINDING THE WEIRDEST TRANSITS IN KEPLER LIGHT CURVES

MERGING CLASSIFICATIONS Stripe 82 sources have been

Spectra and LC shape Sesar et al. 2010 Becker et al.

25 MERGING CLASSIFICATIONS Stripe 82 sources have been independently labeled in previous studies Many sources still unlabeled. Domain adaptation? RR Lyrae: Color cuts, template fitting Eclipsing binaries Spectra and LC shape Sesar et al Becker et al HADS Visual inspection + PCA for colors + Random forest classifier QSOs: Spectra Süveges et al. 2012

26 RESULTS ON STRIPE 82 SOURCES

Classification of Poorly Time Sampled Light Curves of Periodic Variable Stars

Classification of Poorly Time Sampled Light Curves of Periodic Variable Stars James P. Long, Joshua S. Bloom, Noureddine El Karoui, John Rice, and Joseph W. Richards 1 Introduction Classification of variable