Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting

Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Yaguang Li Joint work with Rose Yu, Cyrus Shahabi, Yan Liu Page 1

Introduction Traffic congesting is wasteful of time, money and energy Traffic congestion costs Americans $124 billion + direct/indirect loss in 2013. Accurate traffic forecasting could substantially improve route planning and mitigate traffic congestion. + https://www.forbes.com/sites/federicoguerrini/2014/10/14/traffic-congestion-costs-americans-124-billion-a-year-report-says/ * Image from LaLa Land: http://variety.com/2017/film/awards/la-la-land-hollywood-musical-trend-2017-1201981880/ Page 2

The Problem Existing Solutions Introduction Route 1: Best route according to current traffic conditions Route 2: Let s see. Route 1 Route 2 Destination 7:00AM Page 3

The Problem Existing Solutions Introduction Predictive vs. Real-Time Path-Planning Route 1 Route 2 Destination Evolution of traffic over time 7:15AM Page 4

The Problem Existing Solutions Introduction Traffic forecasting enables better route planning. Route 1 Route 2 Destination Stuck in traffic 7:30AM Page 5

Traffic Prediction Input: road network and past T traffic speed observed at sensors Output: traffic speed for the next T steps Input: Observations Output: Predictions... 7:00 AM... 8:00 AM 8:10AM, 8:20AM,, 9:00 AM Page 6

Challenges for Traffic Prediction Speed (mile/h) Complex Spatial Dependency Non-linear, non-stationary Temporal Dynamics Sensor 1 Sensor 2 Sensor 3 Page 7

Related Work Traffic Prediction without spatial dependency modeling Simulation and queuing theory [Drew 1968] Kalman Filter: [Okutani et al. TRB 83] [Wang et al. TRB 05] ARIMA: [Williams et al. TRB 98] [Pan et al. ICDM 12] Support Vector Regression (SVR): [Muller et al, ICANN' 97] [Wu et al. ITS 04] Gaussian process [Xie et al. TRB 10] [Zhou et al. SIGMOD 15] Recurrent neural networks and deep learning: [Lv et al ITS 15] [Ma et al. TRC 15] [Li et al SDM 17] Model each sensor independently Fail to capture spatial correlation Page 8

Related Work Traffic Prediction with spatial dependency modeling Vector ARIMA [Williams and Hoel JTE 03], [Chandra et al. ITS 09] Spatiotemporal ARIMA [Kamarianakis et al., TRB 03] [Min and Wynter, TRC 11] k-nearest Neighbor [Li et al. ITS 12] [Rice et al. ITS 13] Latent Space Model [Deng et al.kdd 17] Convolutional Neural Network [Ma et al. ITS 17] Either assume linear temporal dependency or fail to capture the non-euclidean spatial dependency Page 9

Big picture Model spatial dependency with proposed diffusion convolution. Model temporal dependency with augmented recurrent neural network * Li, Yaguang et al. Diffusion Convolutional Recurrent Neural Network: Data-driven Traffic Forecasting, ICLR 2018. Page 10

Spatial Dependency Modeling Model spatial dependency with Convolutional Neural Networks (CNN) CNN extracts meaningful spatial patterns using filters. State-of-the-art results on image related tasks Image CNN is only applicable to Euclidean grid graph. Convolutional Filter * Y LeCun et al. Gradient-based learning applied to document recognition. Proc. IEEE 1998 + Image from: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/ Page 11

Spatial Dependency in Traffic Prediction Spatial dependency among traffic flow is non-euclidean and directed Sensor 1 Sensor 2 Close in Euclidean space Similar traffic speed Sensor 3 dist net v i v j dist net v i v j Page 12

Spatial Dependency Modeling Model the network of traffic sensors, i.e., loop detectors, as a directed graph Graph G = (V, A) Vertices V: o sensors Adjacency matrix A: weight between vertices A ij = exp dist net v i, v j 2 σ 2 if dist net v i, v j κ dist net v i, v j : road network distance from v i to v j, κ: threshold to ensure sparsity, σ 2 variance of all pairwise road network distances Page 13

Problem Statement Graph signal: X t R V P, observation on G at time t V : number of vertices P : feature dimension of each vertex. Problem Statement: Learn a function g( ) to map T historical graph signals to future T graph signals X t T +1 X t X t+1 X t+t g. Page 14

Generalize Convolution to Graph Convolution as a weighted combination of neighborhood vertices. l w 6,4 h 7 l Filter weight Convolutional filter on graph centered at v 6 Learning complexity is too high: O V N h 2 l h 8 l Min h 6 l h 1 l h 4 l h 9 l h 5 l h 3 l l h 10 Max V h i l+1 = j N i w l l ij h j N i : neighbor of vertex i w l ij : filter weight of vertex j centered at vertex i layer l h l j : feature of vertex j in layer l h 1 j = X j,:, i.e., the input. Page 15

Generalize Convolution to Graph Diffusion convolution filter: combination of diffusion processes with different steps on the graph. Transition matrices of the diffusion process K 1 X :,p G f θ = θ k D O 1 A k X :,p Learning complexity: O K k=0 = θ 0 + θ 1 + θ 2 + + θ K Example diffusion filter Centered at 0 Step Diffusion G : diffusion convolution, D o : diagonal out-degree matrix. 1 Step Diffusion 2 Step Diffusion Filter weight Min K Step Diffusion Max Page 16

Generalize Convolution to Graph Diffusion convolution filter: combination of diffusion processes with different steps on the graph. Dual directional diffusion to model upstream and downstream separately K 1 X :,p G f θ = k=0 θ k,1 D O 1 A k + θ k,2 D I 1 A k X :,p = θ 0 + θ 1 + θ 2 + + θ K Example diffusion filter Centered at 0 Step Diffusion 1 Step Diffusion G : diffusion convolution, D o : diagonal out-degree matrix, D I : diagonal in-degree matrix 2 Step Diffusion Weight Min K Step Diffusion Max Page 17

Advantage of Diffusion Convolution Efficient K 1 X :,p G f θ = θ k,1 D 1 O A k + θ k,2 D 1 I A k k=0 Learning complexity: O K Time complexity: O K E, E number of edges X :,p Expressive Many popular convolution operations, including the ChebNet [Defferrard et al., NIPS 16], can be represented using the diffusion convolution [Li et al. ICLR 18]. G : diffusion convolution, D o : diagonal out-degree matrix, D I : diagonal in-degree matrix + Defferrard, M., Bresson, X., Vandergheynst, P., Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS, 2016 * Li, Yaguang et al. Diffusion Convolutional Recurrent Neural Network: Data-driven Traffic Forecasting, ICLR, 2018 Page 18

Big picture Model spatial dependency with proposed diffusion convolution Model temporal dependency with augmented recurrent neural network * Li, Yaguang et al. Diffusion Convolutional Recurrent Neural Network: Data-driven Traffic Forecasting, ICLR 2018. Page 19

Model Temporal Dynamics using Recurrent Neural Networks Recurrent Neural Networks (RNN) Non-linear, non-stationary auto-regression State-of-the-art performance in sequence modeling h h 1 h 2 h 3 RNN Unroll RNN RNN RNN Popular example of RNN Long Short-Term Memory unit (LSTM) Gated Recurrent Unit (GRU) x GRU Diffusion + Convolution x 1 x 2 x 3 Page 20

Model Temporal Dynamics using Recurrent Neural Network Multi-step ahead prediction with RNN Current Time Error Propagation Teach the model to deal with its own error. x 4 x 5 x 6 x 1 x 2 x 3 x 4 x 5 x x Model prediction Observation or ground truth Previous model output is fed into the network Page 21

Improve Multi-step ahead Forecasting Traffic prediction as a sequence to sequence learning problem Encoder-decoder framework x 1, x 2, x 3 x 4 x 1, x 2, x 3 x 4, x 5, x 6 Current Time Backprop errors from multiple steps. x 4 x 5 x 6 δ 4 δ 5 δ 6 x 4 x 5 x 6 x 1 x 2 x 3 <GO> x 4 x 5 x x Model prediction Observation or ground truth Encoder * Sutskever et al. Sequence to sequence learning with neural networks, NIPS 2014 Decoder Ground truth becomes unavailable in testing. Page 22

Improve Multi-step ahead Forecasting Improve multi-step ahead forecasting with scheduled sampling Current Time x 4 x 5 x 6 x x x 1 Model prediction Observation or ground truth x 2 Encoder x 3 * Bengio,Samy, et al. Scheduled sampling for sequence prediction with recurrent neural networks. NIPS 2015 <GO> x 4 x 4 x 5 x 5 Decoder Scheduled sampling: Choose to use the previous ground truth or model prediction by flipping a coin Page 23

Improve Multi-step ahead Forecasting probability Improve multi-step ahead forecasting with scheduled sampling Curriculum learning: gradually enables the model to deal with its own error. Easy: Only feed ground truth x t+1 x t x t x x Model prediction Observation or ground truth # iteration * Bengio,Samy, et al. Scheduled sampling for sequence prediction with recurrent neural networks. NIPS 2015 Hard: Only feed model prediction Page 24

Diffusion Convolutional Recurrent Neural Network Diffusion Convolutional Recurrent Neural Network (DCRNN) Model spatial dependency with diffusion convolution Sequence to sequence learning with encoder-decoder framework Improve multi-step ahead forecasting with scheduled sampling * Li, Yaguang et al. Diffusion Convolutional Recurrent Neural Network: Data-driven Traffic Forecasting, ICLR 2018 Page 25

Experiments Datasets METR-LA: 207 traffic sensors in Los Angeles 4 months in 2012 6.5M observations PEMS-BAY: 345 traffic sensors in Bay Area 6 months in 2017 17M observations Page 26

Experiments Baselines Historical Average (HA) Autoregressive Integrated Moving Average (ARIMA) Support Vector Regression (SVR) Vector Auto-Regression (VAR) Feed forward Neural network (FNN) Fully connected LSTM with Sequence to Sequence framework (FC-LSTM) Task Multi-step ahead traffic speed forecasting Page 27

Experimental Results Mean Absolute Error (MAE) Mean Absolute Error (MAE) DCRNN achieves the best performance for all forecasting horizons for both datasets 7.00 METR-LA 3.50 PEMS-BAY 6.00 5.00 4.00 3.00 2.00 3.00 2.50 2.00 1.50 1.00 15 Min 30 Min 1 Hour 1.00 15 Min 30 Min 1 Hour HA ARIMA VAR SVR FNN FC-LSTM DCRNN HA ARIMA VAR SVR FNN FC-LSTM DCRNN Page 28

Effects of Spatiotemporal Dependency Modeling Mean Absolute Error (MAE) w/o temporal: removing sequence to sequence learning. w/o spatial: remove the diffusion convolution. 5 4.5 4 3.5 3 2.5 2 Removing either spatial or temporal modeling results in significantly worse results. 1.5 15 Min 30 Min 1 Hour DCRNN w/o Temporal DCRNN w/o Spatial DCRNN METR-LA Page 29

Example: Prediction Results mile/h DCRNN is more likely to accurately predict abrupt changes in the traffic speed than the best baseline method. Example Prediction Results on METR-LA Page 30

Example: Filter Visualization Visualization of learned filters Filters are localized around the center. Weights diffuse alongside the road network. METR-LA Max 0 Min center Learned filters centered at different verticices. Page 31

Summary Propose diffusion convolution to model the spatial dependency of traffic flow. Propose Diffusion Convolutional Recurrent Neural Network (DCRNN) that captures both spatial and temporal dependencies. DCRNN obtains consistent improvement over state-of-the-art baseline methods. ICLR 2018 https://github.com/liyaguang/dcrnn https://arxiv.org/1707.01926 * Li, Yaguang et al. Diffusion Convolutional Recurrent Neural Network: Data-driven Traffic Forecasting, ICLR 2018. Page 32

Thank You! Q & A Page 33