DriveFaster: Optimizing a Traffic Light Grid System Abstract CS221 Fall 2016: Final Report Team Members: Xiaofan Li, Ahmed Jaffery Traffic lights are the central point of control of traffic for cities and can be an effective tool for keeping traffic moving. The cheapest way to speed up traffic flow for a city is to optimize their traffic light system. This project aims to optimize traffic flow through a grid of street intersections by controlling only the traffic light grid. The goal is to minimize time from a random start point to a random end point for every car. Using an MDP model with Q-Learning will allow the system to adapt to a variety of situations and traffic patterns. This paper will give details into the MDP definition and the effect of various features upon the Q-Learning simulation. 1. Introduction This paper will investigate the use of a Markhov Decision Process (MDP) model to learn traffic patterns and apply Q-Learning to improve traffic flow. The model will control only the state of each traffic light (green/red) and base its decision making on traffic information. It will take into account the number of cars in the system and wait time at each traffic light. To simplify the system and reduce the number of variables, the cars will all have a constant speed of 1 and the intersection length will be 1. By keeping these values constant we can focus more on the features that focus on managing an increasing number of cars. Firstly we summarize a paper that gave us background information and define our problem in detail. We will then end by discussing our model and approach as well as analyzing the results of Q-Learning as more features are added. 2. Related Works Traffic has been increasing significantly over the past few decades as many people continue to migrate to a few key cities. Minimizing traffic delays and improving traffic flows is a problem that has been worked on by many municipal governments with varying results. A paper from the Deakin University in Australia conducted a study on traffic simulation comparing the use of Q-Learning and Neural Networks. Their goal was to reduce average delay time at an intersection. In the MDP model they chose actions that maximize short term reward rather than distant future rewards. They had a minimum time that a light 1
would be in the green state and have a maximum of 2 extensions that had a fixed time of 10 seconds each. The Neural Network they designed used a Genetic Algorithm and Simulated Annealing which improves the initial solution by selecting another neighbor solution and doing small changes by comparing them to the fitness function. There is no minimum fixed green time or fixed extension time. The advantage of this method is having fewer constraints than Q-Learning and being able to determine exact times for green/red light states. The downside they had was having a much larger state space due to more variability. 3. Content & Simulation a. Problem Definition This project aims to optimize traffic flow through a grid of intersections. The goal is minimize time from a random start point to a random end point for every car. For the scope of this project the model will have 2 major constraints in order to reduce the number of variables: Firstly, the road map will have an M*N grid pattern with fixed distances between each intersection. Secondly, all cars will travel at the same speed and take a time score of 1 to cross a road (from one intersection to the next). The cars will be randomly spawned at an entrance position and make their way over to a randomly chosen exit position. All edges are considered as spawn points and there will only be a green light and red light state for every traffic light (go/stop). The primary variable that will be adjusted is the duration of a green or red light state per traffic light. Initially the reward was calculated by summing the total wait time each car experienced. As the project progressed, the reward was changed to the total time it took for the system to solve (time it took for all cars to reach their destination). b. Car and Traffic Light Generation The MDP is initialized with a set number of cars, a set number of additional cars that will be periodically added, and a grid size of M*N. The number of traffic lights is determined by multiplying M and N. Each traffic light is 1 unit away from each other. The number of cars initialized was arbitrarily set to equal half the number of traffic lights. After each action cycle additional cars are generated at random start positions. Adding additional cars simulates varying traffic trends throughout the day and helps create a more realistic traffic model. These two functions compromise the setup of our simulation. 2
c. Approaches i. Oracle The oracle was designed to have a pre-determined traffic pattern and based on this it ran a search algorithm for the optimal action at a given state. A state will take into account traffic at all intersections and determine how to process based on the successorstate function. We used a 2x2 grid with 4 cars. The result was an reward of 0 here because the system knew the positions of each car and enabled the green light before the car arrived at the intersection. ii. Baseline The baseline algorithm takes a greedy approach and simply activates the greenlight for the direction in which there are maximum number of cars at that one intersection. We simulated a 2x2 grid with 4 cars. This is the approach that is used in most intelligent intersections currently. There is also an upper bound defined in terms of maximum time limit that a traffic light can stay Green in one direction per cycle. The result of the baseline was very low as expected. The reward here was 6 due to each car having a wait time of 1.5. iii. MDP a. State The MDP state is defined using a tuple of ([Car States], [Light States]) which describes the state space. The list [Car States] contains a list of car states for each car, each of which has information about the car position, direction and place in line (if waiting for a light). The list [Light States] contains a list of light states, each of which records the current light color for the up/down direction, and the number of cars waiting at each direction. b. Actions The available actions from a state will be defined as an M*N vector of traffic light colors in the next iteration. This will control the next state of each traffic light which will move the cars throughout the streets. c. Randomness In order to introduce randomness of car actions and simulate different driving patterns, cars can choose to make a turn with probabilities as follows: 3
P (turning) = dist(currentp os, endp os). 5(1[is currentdirection blocked by light]) d. Random Drivers Decisions P (go straight) = 1 P (turning) The decisions made by each car also adds to the randomness of the environment. The drivers follow the rules below: 1. Each driver always make progress towards the goal by always turning towards the goal positions. 2. If the goal is on the left side and in front of the current position: there s a 50% chance that the car will turn left and 50% that the car will go straight. In this case, the driver s decision is independent of the current signal color. 3. If the goal is on the right side and in front of the current position: 3.1. if the current light is red, then the car has a 80% chance of turning right and 20% chance of waiting for the light to turn green. 3.2. If the current light is green, then the car has 20% chance of turning right and 80% chance of going straight. 4. It is impossible for the goal to be behind the current position based on the above rules and the fact that we always initialize the starting positions on the borders of the grid. The intuition behind the rules is to simulate realistic decision making for the drivers. e. State Transition Complexity At each time step, we update State based on the current Action. First the light state is updated because it is most directly impacted by the current action vector; then the car state is updated based on direction, position as well as the updated light state. Light states and car states can each be updated in parallel because any light or car is independent of one another. The only dependency is that the car states have to be updated after light states. If there are L lights and N cars then at each time step: S equential update complexity : O(L + N ) 4
P arallel update complexity : O(1) iv. Learning We used a standard Q-learning algorithm in order to minimize global wait time, which is the sum of all wait times for all cars spawned. Q ˆ opt(s, a) (1 η)[qˆ opt(s, a) + η(r + γv ˆ opt(s 0 ))] V ˆ opt(s 0 ) = max [a Action(s )]Qˆ opt(s, a ) Beside from standard Q-learning algorithm, we also explored various feature extractors: 1. Traffic Light Information Features total_action Keep track of all possible actions at a given state. total_red Total number of lights that are red. total_green - Total number of lights that are green. total_cars Total number of cars on the grid currently. cars_per_light_per_direction Total number of cars at each traffic light per direction. Will be stored in a hash. This will help the computer learn that it is not good to have too many cars in the que. wait_time_per_light_per_direction How much total wait time is in the traffic light queue currently. This will help keep track of cars that have been on the grid for a long time and help prioritize those cars. 2. Spacial Features 2D_Neighborhood - This will extract features based on the 2D neighborhood of a given light. When we know that all the lights surrounding a particular intersection are green/red, we can assign more/less weight accordingly. The spacial feature will extract the geometric features in the intersection grid and use it as a heavily weighted feature. This will help optimize sub sections of the grid map rather than just looking at the whole grid map. 4. Results a. Evaluation Metric 5
To evaluate the effectiveness of our MDP model we ran Q-Learning for various grid sizes with 30 cars and 1000 iterations. We will also investigate the effect of each feature individually and plot how features affect the overall reward of the system. The graph will display a comparison of the number of features versus total reward. This will display the effectivity of our feature extractor and give a comparison of the model vs the optimal result as features are added. b. Analysis In order to better analyze the results, we introduce the concept of Accuracy, where it is defined as: Accuracy = -1 * 100 / score. With this, we can directly compare the results of the learning algorithm with different features. We also compare standard deviation for every 100 iterations to show how stable the predictions are. In the diagram, the three sets of features compared are the following: 1. Feature Set 1: All basic features such as number of cars at each light, number of current cars and current turning directions etc. 2. Feature Set 2: All temporal features such as cumulative wait time for each car, ticketing number etc. 3. Feature Set 3: All spatial features such as neighborhood features and lighting states in certain positions of the grid etc. In total, we have 7 features across all three feature groups. The results are as follows: 6
In the above diagram, we compare the prediction accuracy with the features we used to do the prediction across each grid size. As we can see, the spatial features really increased the accuracy of the prediction. Especially for small grids, it shows significant improvement with spatial features. This result makes sense because in smaller grids, spatial features such as the neighborhood feature essentially computes the whole state. It is also worth noting that with only temporal feature (with feature 1+2), we are seeing degradation in larger grids (16 and 25 lights). This is probably because the temporal features are most focused on localized states such as car wait time. Therefore, with larger grids, it actually hurts accuracy prediction. However, it does show that the spatial extractor performs with higher deviations with larger grids. We think it is because with larger grids, there are more information being incorporated in the neighborhood feature, thus creating more randomness, which makes the results less stable. Despite the stability issue, the results with the spatial feature extractor still consistently outperform the other features. Overall, we cannot see a trend in the standard deviations across grid sizes. In general, the deviation increases as the grid sizes increase but it is not clear if this observation will continue had we tried with larger sizes. 7
c. Challenges The primary challenges of this approach was keeping track of car wait time, taking into account neighboring local traffic lights, and managing the state size. Initially the car wait time was included in the state, however this was affecting the Q-learning as each additional second the car waited that was being considered as a new state. In reality the system state is the same regardless of the wait time of each individual car. This issue was solved by replacing the car wait time with overall simulation time. This approach also allowed the model to optimize for a global solution rather than only individual cars. For example, if all cars except 1 reach their goal in 5 time units and the last car reaches its goal in 100 time units, the average time for each car is low; however the full system time is 100. Optimizing for the global solution minimizes the full system time and ensures that all cars will reach their destination faster. This also reduced the state size considerably. One of the primary features that was decided upon was taking into account the local grid for each traffic light and the total number of cars in each direction at neighboring lights, rather than looking at each traffic light separately. Dividing the grid however prioritized moving cars through certain sectors in certain directions. If there were more cars traveling in the North/South direction then the cars going West/East would end up waiting until the North/South direction was cleared of cars. This was solved by adding in a maximum time that a traffic light could be green in one direction. After this maximum time was hit, the traffic light would switch directions regardless of the number of cars. 5. Conclusion This paper studied the effectivity of using the MDP model for improving traffic flow through traffic light control. The three main features that are investigated consist of basic traffic traffic light information, cumulative traffic wait time, and local neighborhood features. As seen in the graphs above, the local neighborhood feature was determined to be the most crucial as it alone improved the average score by about 24%. The primary challenge that was unable to be overcome was the large state size and scaling the simulation up. In the future, work will have to be done to decrease the state complexity and use a sampling method to look at every Nth traffic light rather 8
than every single one. In conclusion, applying Q-Learning with a few critical features on an MDP model can be a viable way to solve the traffic light control problem. 6. References 1. http://swarmlab.unimaas.nl/ala2013/papers/tuesession1paper2.pdf 2. http://cs229.stanford.edu/proj2015/369_report.pdf 3. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6722370 9