A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS

A REINFORCEMENT LEARNING APPROACH TO SCHEDULING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS Ji-Eun Roh (), Te-Eog Lee (b) (),(b) Deprtment of Industril nd Systems Engineering, Kore Advnced Institute of Science nd Technology (KAIST), 305-701 Dejeon, Republic of Kore () rje0531@kist.c.kr, (b) telee@kist.c.kr ABSTRACT The cluster tool, which consists of multiple process chmbers re widely used in the semiconductor industry. As the process of wfers becomes more sophisticted, the opertion of cluster tools is lso being improved. To effectively operte cluster tools, severl rule-bsed schedules, such s the swp sequence hve been developed. However, scheduling in time vrince environment is not fully considered. In this pper, we propose cluster tool modeling method, which cn hndle time vrince in dul-rmed cluster tool. Then, we present reinforcement lerning process bsed on the proposed cluster tool model to find new opertionl schedules in specific configurtions. To mesure the performnce of the newly obtined schedule, mkespn is compred under the new policy nd the swp policy. The mkespn reduced compred to the conventionl swp policy, which implies tht the reinforcement lerning well lerned the opertion schedule in the time vrince environment. Keywords: cluster tool, Mrkov decision process, scheduling, reinforcement lerning 1. INTRODUCTION Along with the innovtive development in the semiconductor mnufcturing industry, qulity issues in the wfer mnufcturing process hve been discussed. The technologies of ech process hve rpidly developed nd improved the production qulity of the wfers. To void qulity issues due to btch production, the cluster tool tht process one wfer t time re now widely used in the semiconductor industry (Lee 2008). Chmbers in cluster tools do not process wfers in units of btches but process them individully, so they cn meet the qulity stndrd. The cluster tool consists of usully four to six processing chmbers nd one trnsport robot. In ech chmber, only one wfer is processed t time, nd one of the process steps specified before the strt of the process is processed. In ddition, the trnsport robot moves the wfers inside the cluster tool. To strt the processes, wfer enters the chmber tht is responsible for the first process step. After the process is completed in the chmber, the wfer is trnsported to the next chmber by trnsport robot. The trnsport robot repets the process of unloding the processed wfer from the chmber nd loding the wfer in the next pproprite chmber which is in chrge of the next process step. Once ll the required processes re completed with the proper chmber sequence, the wfer process in the cluster tool finlly is completed. Such configurtion leds cluster tools to hve severl issues in opertion. Since cluster tools consist of only processing chmbers nd single trnsport robot, only one robot opertion is possible t time. Hence, the wfers cnnot be moved t ny time, nd cn only be moved when the trnsport robot moves them. Even if the multiple chmbers re redy for the process, multiple wfers cnnot be delivered t the sme time. In ddition, the cluster tools do not contin ny buffer spce in their interior spce, so the only wy to store the wfers outside the chmber within the cluster tool is to hold it with the trnsport robot. When the robot lods wfer into the chmber, the chmber strts the process, nd the process ends nturlly fter the process time. The chmber processes for period of time without ny decision. Therefore, we cn sy tht the overll cluster tool schedule depends on the order decisions of the trnsport robot. As the trnsport robot unlods or lods the chmbers, the sttus of the cluster tool system hs chnged. This cn be seen s discrete event system becuse the stte of the system chnges s the robot tkes ction. Petri nets, finite stte mchines (FSMs), Mrkov decision processes (MDPs), etc. re used to model discrete event systems (Murt 1989; Choi nd Kng 2013; Putermn 2014). Severl studies bout cluster tools scheduling hve used timed Petri net (TPN) model, in which the plces or trnsitions hve time dely. A TPN is clssified into deterministic nd stochstic TPN whether the time dely is deterministic or stochstic (Murt 1989). Ech of them is modeling technique tht dels with different kinds of time delys. Mny studies hve been conducted on finding policies for the deterministic environment. For exmple, (Lee, Lee nd Shin 2004, Zuberek 2004) hve modeled nlyzed cluster tools by the deterministic TPNs, nd Jung nd Lee (2012) 153

proposed mixed integer progrmming (MIP) model to find the optiml policies in deterministic TPNs. In stochstic environment, there lso hs been severl studies. To del with time vrition in cluster tool, (Kim nd Lee 2008) suggested n extended Petri net nd developed grph-bsed procedure to verify the schedulbility condition. Qio, Wu, nd Zhou (2012) hve introduced two-level rchitecture to del with time vrition, nd proposed some heuristic lgorithms to find schedule for one of the rchitectures. Molly (1982) suggested tht the stochstic TPNs re isomorphic to finite Mrkov processes under the certin conditions. However, new methods for finding good policies in specific systems re studying nowdys. Reinforcement lerning (RL) is widely used in finding policies in mny fields such s mnufcturing systems, utonomous vehicle control, finnce, nd gmes. This reinforcement lerning is one of the solutions to find optiml policies in MDP models (Sutton nd Brto 1998). Hence, modeling the system behvior by MDPs, then find the policies for the system by RL is widely used (Moody et l. 1998; Mnih et l. 2015). In this pper, we propose model using MDPs, which cn hndle time vritions in cluster tools. Then, we report how we lern the new robot sequence using the MDP model to minimize the verge mkespn in time vrition environment, nd mesure the performnce of the obtined sequence. This study is n ttempt to schedule the cluster tool using reinforcement lerning. Since the model is designed to represent the stochstic process time of the cluster tool, the schedules obtined from reinforcement lerning cn be expected to be well pplied in stochstic environment. 2. MODELING DUAL-ARMED CLUSTER TOOLS WITH TIME VARIATIONS To model dul-rmed cluster tools with time vritions, we use MDP model. After introducing the problem of configuring the MDP model, the model we proposed is reported. 2.1. Mrkov Decision Process A Mrkov decision process (MDP) is tuple < S, A, P, r >, where S denotes set of sttes, A set of ctions, P stte trnsition probbility distribution, nd r rewrd function, respectively. The process follows the Mrkov property, which mens the trnsition probbility nd rewrd functions depend only on the current stte nd the ction, not the pst history. In this pper, we ddress only the sttionry environments, which mens the system properties does not chnge s time goes by. According to Putermn (2014), t every decision epoch, stte s S, which is representtion of system, is observed nd n ction A s hs to be chosen by decision mker from the set of llowble ctions in the stte s. As result of the ction, the system stte t the next decision epoch chnges by some trnsition probbility P S A S [0, 1]. P s,s is determined by the current stte s nd the chosen ction. The decision mker gets rewrd r S A S R, nd r s,s is determined by the current stte s, ction, nd the next stte s. Here, we hve concept of decision rule, which describes procedure for ction selection in ech stte. This rule is clled sttionry policy, d S A s. The decision mker psses through sequence of sttes s t, which re determined by trnsition probbilities P s,s nd the ctions t = d(s t ) the decision mker chooses, then the sequence of rewrd r s,s is obtined. The populr performnce metric is discounted rewrd, which is the sum of the discounted rewrd gined over the entire time horizon when we llow the specific policy. The discounted rewrd of policy d strting t stte i is defined s k d(s j ) J d (i) lim Ε [ j=1 γ (j 1) r k sj,s j+1 s 1 = i], where γ [0,1] is discount fctor. By Bertseks (1995), it is d(s) d(s proved tht J d (i) j ) = [ s S P s,s ( r sj,s j+1 + γj d (s ))]. The function J d is clled the vlue function for policy d. There re mny vrints of MDP model, nd one of the them is semi-mrkov decision process (semi-mdp). In semi-mdp, temporl fctors re included in the modeling. The originl MDP ssumes tht ech trnsition tkes the sme time through ll the sttes; however, semi-mdp considers the trnsition time to follow rbitrry probbility distributions. 2.2. Cluster tool modeling with MDP If we simply insert the stte of the chmbers nd ctions the robot tkes into MDPs s previous studies hve done with deterministic TPN models, the MDPs does not properly represent the behvior of the cluster tools. This is due to chrcteristics of cluster tools. The cluster tools re not simply chnging the systems by discrete event; cluster tools re chnging the systems by the time element. The generl MDP ssumes tht every trnsition tkes the sme time, however, the trnsitions in cluster tools cnnot be ssumed to be the sme. Considering the different trnsition times between sttes, we my think of simple wys to model the cluster tool system. 1. Model with MDPs which hve constnt trnsition time, one second. Adjust trnsition probbilities to represent the time element. 2. Model with semi-mdps so we cn insert time informtion to the trnsition distribution. To verify bove two methods, we first consider the exmple cse s shown in Figure 1. The exmple represents sttus chnge in single chmber, which hs two sttes A nd B, nd two ctions 1 nd 2. Consider the cse where the environment stte chnges from A to B fter 50 seconds. By using the first modeling method listed bove, we cn set trnsition probbility to 1/50; thus, the stte chnges from A to B occurs fter 50 seconds in verge. However, the verge cnnot express the ctul individul trnsition time explicitly. 154

Figure 1: Simple Exmple for Cluster Tool System By using the second modeling method, we cn set trnsition time distribution s 50 seconds. When using the second modeling method, we cn express tht the trnsition occurs fter 50 seconds in Figure 1 if we set the trnsition time distribution to 50 seconds (s constnt). However, the cluster tool is tool with multiple chmbers, so the model should reflect chnges in stte in other chmbers. If 20 seconds hve elpsed while the sttus chnges in nother chmber, nd so the chmber of the exmple now needs to chnge its stte fter 30 seconds, the trnsition time distribution should be chnged to 30 seconds. Every time the trnsition occurs in the other chmber, we hve to clculte nd chnge the distribution; the environment of the system chnges, which does not stisfy the mrkov property. The two MDP modeling methods bove re not pproprite for modeling the cluster tool opertions in time vritions, since they do not represent the trnsition time explicitly or do not stisfy the Mrkov property. To stisfy the Mrkov property, stte should contin sufficient informtion, so tht the stte does not require pst history to obtin the next stte or rewrd. Therefore, stte is designed to contin informtion bout chmber nd robot usge, nd the remining process time. In this study, We propose to represent the stte s (W, C 1, C 2,, C n, R, S 1, S 2, Z 1, Z 2,, Z n ), where n is the number of processing chmbers in cluster tool. W is the number of wfers remining in the lodlock. C i is the stte of the chmber, which represents whether the i th chmber is empty ( C i = 0), full ( C i = 1), or completed processing (C i = 2) for i {1, 2,, n}. R is the number of wfers held by the trnsport robot. Since we re deling with dul-rmed cluster tool, R cn hve vlue of 0, 1, or 2. S 1 nd S 2 re the next process steps of the wfers held by the robot. These represent the process steps tht ech wfer should visit in the next step. Z i is the expected remining process time of the ith chmber for i {1, 2,, n}. The vlue is clculted by subtrcting the elpsed time from the initilly set process time, becuse it is not known wht the ctul process time will be in time vrition environment. In ddition, define S to be the stte spce of the proposed stte. Hence, the proposed stte structure contins the informtion bout the chmber nd the wfer inside it, nd includes informtion on which chmbers the wfers on the robot should be sent. Finlly, it roughly contins informtion when the chmbers re going to be finished. Then, the model stisfies the Mrkov property. To better understnd the structure of the sttes, we show stte representtion of Figure 2 by using the proposed sttes structure. Suppose tht the blck nd white wfers indicte tht they re in the first nd second process steps, respectively, nd the htched wfer indictes tht the process is in progress. According to the proposed structure, the stte is (W = 3, C 1 = 2, C 2 = 1, C 3 = 1, C 4 = 2, R = 1, S 1 = 2, S 2 = 0, Z 1 = 0, Z 2 = 4, Z 3 = 4, Z 4 = 0). This mens tht the three wfers re remined in the lodlock, the process is going on in the second nd the third chmbers, wheres the first nd the forth chmber hve completed its process. The robot holds single wfer witing to enter the second process step, nd the remining process times of the chmbers re four for the second nd the third chmbers. Figure 2. A Dul-rmed Cluster Tool with Four Chmbers A is set of ctions tht the trnsport robot cn perform, which is {Wit, U j, L j, SW i }, where j {0,1, 2,, n} nd i {1, 2,, n}. They ll represent the robot tsks: U j, L j, nd SW i indicte unlod, lod, nd swp opertions on the jth or i th chmber. The unlod, lod, nd swp opertions re the common robot tsks in other modeling methods. However, in this study, Wit ction is dded. Wit represents the robot witing until the stte chnges, such s finishing the process in chmber. To mke the model trnsition independent on remining process time, we suggest Wit to be n ction of the robot. If without Wit ction, the stte of the cluster tool chnges when the process completes, even if no ction is selected. Furthermore, for ll sttes {Wit, U j, L j, SW i } cn lwys be selected s n ction; therefore, 3n + 3 ctions re vilble to be chosen in ll sttes. Trnsition probbility is denoted by P s,s, where s is the current stte, is the current ction nd s is the next stte. It stnds for probbility for trnsition to stte s when we choose ction in stte s. In this study, we do not consider unexpected event occurrence such s robot filure, so the system lwys trnsposes exctly to the specified stte. Hence, when the ction is one of U i, L i, nd SW i, P s,s hs vlue of 1 for the pproprite s. Here, the pproprite stte s is when s 155

is likely to be the next stte in stte s. Furthermore, when the s is not likely to be the next stte in s, P s,s = 0. However, when the ction is Wit, trnsition probbility P s,s does not only hve vlue of 0 or 1. Assume tht the ction is Wit. If t lest one chmber is full, which stnds for C i = 1, the chmbers should be selected which chnge their stte to completed processing. In other words, it chooses which chmber the process ends with. Choosing the chmber with the shortest remining process time is the simplest nd most intuitive. However, since we re deling with time vrition environment, we cnnot sy tht the process is finished fter Z i, exctly the expected remining process time. Therefore, the chmber selection should reflect the fct tht unexpected chmber processes my end sooner. In ddition to selecting the chmber, the trnsition time needed to be stochsticlly generted to reflect the time vrition environment. In generl, the process of the chmber is completed within rnge of times, including the verge time. Therefore, we choose to generte the trnsition time using bet distribution tht cn generte the verge vlue within the bounded rnge. Rewrd function is set to get 1 if wfer completes ll processes, otherwise, the function is set to get 0.1 in ll trnsitions except for the Wit ctions. When the Wit ction is selected, the rewrd is set to be 0.1 (trnsition time). Detiled trnsition rules nd rewrd definitions re introduced in Sections 3.2 nd 3.3. 3. LEARNING POLICIES FOR ROBOT MOVES IN CLUSTER TOOLS BY REINFORCEMENT LEARNING There re vrious wys to solve MDP. First, when we know the model, we cn define totl return. Then by dynmic progrmming, we cn find the optiml vlue function of Bellmn optimlity eqution (Bellmn 1954). However, the proposed model hs time element in the stte, it is not possible to solve the MDP through dynmic progrmming due to the curse of dimensionlity. Therefore, we pplied reinforcement lerning (RL) to find the policy for the proposed MDP model. RL requires n environment to interct with. So we built n interctive environment bsed on the proposed MDP model. We then set up tsks to perform the lerning, nd lerned the robot policy using Q- lerning. After lerning, we performed nother experiment to mesure the performnce of the policy. 3.1. Reinforcement lerning Reinforcement lerning (RL) is useful when the clssicl dynmic progrmming is not enough to solve the MDP problems. Dynmic progrmming is not suitble for the problem, when the environment is difficult to build perfect model due to its unknown trnsition probbility (curse of modeling), or when the environment hs too mny sttes to solve (curse of dimensionlity). RL dds the concepts of stochstic pproximtion, temporl differences, nd function pproximtion to clssicl dynmic progrmming, so tht it cn solve the MDP problem even if the trnsition probbility is not explicitly represented, nd the stte dimension is too lrge (Gosvi 2014). 3.2. The environment Since we find the robot policy by RL, we need n environment tht returns the next stte nd rewrd for ction ccording to stte for lerning progress. The environment required for RL is either n environment consisting of rel equipment or n environment built by computer simultion. The best wy to interct with the environment is to build the rel world environment, then conduct the experiments with it. However, using rel world cluster tool is nerly impossible due to its vlue. A single tool is so expensive nd needs lrge re to instll; therefore, even the compny is not ble to rrnge enough cluster tools for their mnufcturing fcilities. To replce the rel world environment, we built virtul environment, which responds to some stimultion. The simultion environment mimics the rel world; however, not fully, only prtilly. To mke the prtilly reflected simultion environment to involve the core elements we wnt, we needed to be creful mking the simultion environment. We mde some ssumptions in building n environment. First, we do not consider mchine filure. Process time cn vry depending on mchine nd time; however, we do not hndle mchine filure in this study. Second, the environment is sttionry, which mens system dynmics does not chnge s time goes by. No mtter how much time hs pssed, they follow the sme distribution. Third, the time dynmics of the systems re governed by the rules we set; hence, it my not fully reflect the rel world cluster tool behviors. The simultion environment shows the chnge of the environment depending on the gent s ction. It shows the pproprite rewrd nd the next stte ccording to the selected ction. Some ctions re vlid only in prticulr sttes. For exmple, robots cn tke Wit ction only when more thn one chmber re filled with the wfer, nd re not completed processing, which mens tht chmber i with C i vlue of 1 exists. Detiled rules for robot ctions re specified in Tble 1. Tble 1: Possible Situtions for the Robot Actions Robot Possible Situtions Actions Wit When t lest one chmber is filled with wfer hs not been completed the process Unlod When the chmber hs processed wfer, but still hve the wfer inside in it Lod When the robot holds some wfers, nd one of the chmbers corresponding to the steps tht the wfers re going to enter is empty Swp When the chmber hs processed wfer, but still hs the wfer inside in it. At the sme time, the robot holds only one wfer to enter the chmber 156

Tble 1 shows the possible situtions for the ction. The ction is vlid only when the right box condition is stisfied. If we select n ction tht is not in those possible situtions, the stte will not chnge. We set the environment to prevent stte trnsitions when inpproprite ctions re tken. We cll n ction tken in n impossible sitution n invlid ction, nd n ction tken in possible sitution vlid ction. For exmple, if chmber hs processed wfers but the wfers re still in it nd t the sme time the robot hs only one wfer in its rm, swp ction on the chmber is vlid ction, while the rest ctions re not. All trnsitions re treted to be deterministic move except for Wit ction, s mentioned in Section 2.2. Other ctions cuse the environment to hve the sme response ech time stte is entered. However, when Wit ction is chosen, the stte trnsition is performed stochsticlly nd the trnsition probbility is defined differently depending on the stte. When only one chmber is in the full stte, the chmber is selected to complete the process. However, when more thn one chmbers re in full stte, chnges in environment is determined by the lgorithm below. With probbility 0.9, the chmber with the minimum remining process time completes the process With probbility of 0.1, the selected chmber completes the process with probbility of inversely proportionl to the remining process time Every time the Wit ction is chosen, the trnsition time is set to be stochsticlly selected long the bet(50,50) distribution. 3.3. Tsk description In generl, set of 25 wfers becomes 'wfer cssette', which is the unit of production for cluster tools. Therefore, the gol is to quickly finish single cssette process rther thn single wfer process. In fct, in rel compnies usully produces 100 to 200 wfers once, which mens produces 4~8 wfer cssette without stopping the opertion of the cluster tool. The problem of minimizing the totl mkespn cn be thought of s reduction in the totl mkespn itself nd reduction in the mkespn of ech subtsk. When the whole mkespn is continuously mesured nd returned t the end, the rewrd structure is hevily sprse to lern the policy. Therefore, in this study, we follow the method of minimizing the mkespn of ech subtsk, which is completing single wfer. In ddition, we follow the structure of receiving rewrd t the end of ech wfer process. In order for ll subtsks to ultimtely reduce the totl mkespn, the totl return is defined s the discounted rewrd. So lerning is done in the direction of receiving the rewrd s soon s possible. Furthermore, fter 100 ~ 200 wfers re produced in rel world, the equipment configurtion my be chnged. In the cse of n episodic tsk tht ends fter the production of certin number of wfers, the whole process cn be divided into strt-up, stedy, nd closedown cycles (Kim, Lee nd Kim 2016). However, in this study, we exmine whether there is policy chnge when the environment chnged from deterministic to stochstic in the stedy cycle. Therefore, we consider the lerning tsk s continuous tsk to lern stedy cycle. The most importnt gol of this problem is to lern to choose n ction tht mximizes the totl return. But we should lern to prevent the system from selecting behvior tht mkes it dedlock nd lern to prevent invlid ctions. 3.4. Lerning procedure The cluster tool behvior ws lerned by computer simultion-bsed reinforcement lerning. Lerning is chieved through the interction of gents nd environments. The gent repets the process of selecting the ction, receiving the corresponding response, nd updting the Q vlue for the stte nd the ction. The ction selection nd Q vlue updte follow the trditionl Q-lerning method (Sutton nd Brto 1998). The lgorithm for Q-lerning is illustrted in Figure 3. We used n ε-greedy policy to select n ction from the current stte, where the ε decresed over the time steps. Even if using the sme lerning method, depending on how the environment is set, the selectble ction set chnges nd the response to the ction chnges. In this study, we set the environment to hve two chmbers, nd wfer flow pttern (1,1) with ll process times to be 4 seconds. Here, the wfer flow pttern refers to the number of prllel chmbers in ech process. The initil stte is set to be full loded, so ll the chmbers re filled with wfers tht hve not yet begun to process. Since we only del with the stedy stte, we only used (C 1, C 2, R, S 1, S 2, Z 1, Z 2 ) s the stte. With this environment settings, the gent continued lerning using Q-lerning in the direction of mximizing the totl return. Figure 3: Q-Lerning Algorithm However, there re other issues tht need to be tken cre of by gents: dedlock ctions nd invlid ctions. 157

We wnt to ensure tht when n gent chooses n ction in the current stte, it does not choose n invlid ction nd dedlock ction tht cuses the next stte to become dedlock. The gent goes through the process of defining possible ction sets for the current stte by two-step look hed method. To remove the invlid ctions for the current stte, the gent requests the environment for the results for ll the ctions, then sves only which ctions re vlid (one-step look hed). For n ction tht is vlid for current stte, second look hed is needed to check if the dedlock occurs. It is time-consuming to lwys hve two-step look hed in every lerning step. To further speed up the procedures, we followed the dedlock prevention rules to the second look hed procedure. As we pply the rules, dedlock ction checking is needed only when ll of the loding ctions for the current stte re invlid nd the number of the holding wfer ( R ) is 1. In ddition, only unloding ctions mong vlid ctions need to be checked. If ny of the loding ctions re vlid, the gent does not need to do dedlock check. A wy to check if given unloding ction is dedlock is to request response for n pproprite loding ction. If the unloded wfer hs process step k to be visited, check tht it is possible to lod to the chmbers tht re in chrge of the kth process step. If ny of the loding ctions re vlid, the given unloding ction is not dedlock ction, otherwise, the given unloding ction is dedlock ction. For exmple, consider the cse illustrted in Figure 2, where chmber size (n) is 4, the current stte is (3, 2, 1, 1, 2, 1, 2, 0, 0, 4, 4, 0), nd the wfer flow pttern is (2,2). Since the vlue of n is 4, the crdinlity of the vilble ction sets for ll sttes is 15, which is the vlue of 3n + 3. Among these 15 ctions, the gent hs to void invlid ctions nd dedlock ctions. The gent first removes the invlid ctions from the ction set, nd then removes the dedlock ctions from the remining ction set. Through interction with the environment, the gent gets the invlid ctions nd stores only the vlid ctions, which is {Wit, U 0, U 1, U 4, SW 4 }. Since there is no loding ction mong the vlid ctions, ctions {U 0, U 1, U 4 } become the cndidtes for the dedlock ctions, by the dedlock checking rule tht we mentioned bove. Therefore, the gent needs to check the vlidity of the pproprite loding ctions for sttes (b), (c), nd (d) in Figure 4. In stte (b), unloded wfer hs to visit the 1 st process step; hence, request responses for ctions {L 0, L 1 }, which re in chrge of the 1 st process step. In this wy, responses for ctions {L 2, L 3 } nd {L 4 } re requested to check the dedness of the stte (c) nd (d), respectively. Then, the gent gets tht ny loding ctions re not vlid for stte (b) nd (c), nd {L 4 } is vlid for stte (d). Therefore, the unloding ctions corresponding to the stte (b) nd (c) re the dedlock ctions, so the ctions {U 0, U 1 } should be removed from the remining vlid ction set, {Wit, U 0, U 1, U 4, SW 4 }. Finlly, ctions {Wit, U 4, SW 4 } cn be chosen from the current stte S 0. Figure 4: Checking Dedlock Action Cndidtes We conducted Q-lerning bsed on the bove ction selection rule. The gent lwys chooses non-invlid nd non-dedlock ction, regrdless of the explortion rte ε. The reson for constructing this dditionl ction selection prt before the ction selection prt in Q- lerning is the lerning time. In the cluster tool, there re too mny invlid or dedlock ctions in the whole ction. If we do not forbid those inpproprite ctions, computtion expense to explore for unnecessry ction occurs nd unnecessry lerning time occurs. Therefore, by dding hnd-crfted djustments specific to this system, lerning time hs been shortened. 158

Tble 2: Q-vlues for Some Sttes 3.5. Performnce mesurement After the lerning is performed bsed on the proposed MDP model, Q(s, ) vlues for every stte s S nd ction A, re obtined. The policy we obtined is greedy policy tht selects the ction with the highest Q- vlue in ech stte, i.e., mx Q(s, ). To report the policy, we hve to list the selected ctions for sttes s shown in Tble 2. However, there re too mny sttes to list ll of them s tble, nd it is difficult to intuitively understnd wht ctions to tke in some cses. Therefore, fter we obtined the policy, we used the mesures tht represent the performnce of the policies, then compred the performnce of the cquired policy ginst the existing swp sequence. Through lerning, Q(s, ) for s S, A is continuously updted; define Q(s, ) for ech thousnd wfers produced s q 0, q 1,..., q t R S A. However, we do not use ll q 0, q 1,..., q t to get the policy; we only use the Q(s, ) vlue fter the convergence. To verify the convergence of sequence q 0, q 1,..., q t, q k q k 1 for k {1,, t} ws checked. If ll elements in the mtrix q k q k 1 is reltively smll, the q k cn be considered to be converged. Figure 5 shows the mximum element in mtrix q k q k 1 for k {1,,543}. The vlue converges to 0, which mens the sequence q 0, q 1,..., q t converged. We used the converged Q(s, ) vlue fter the 543 th itertion, i.e., used q 543 to obtin the policy. To indicte the performnce of the obtined policy, the time it tkes to process 50 wfers, mkespn, is used. Mkespn is mesured in the sme environment s the lerning period. We used two processing chmbers, wfer flow pttern (1,1), verge process time of four, nd the chmber process time is set to follow the bet(50,50) distribution. Becuse the process time is rndomly generted, mkespn cn be mesured differently, even if the mesurement is implemented under the sme policy. Therefore, we mesured mkespn thousnd times under the designted policies nd compred the verges. Figure 5: Q-vlue difference mximum vlue 4. ANALYSIS OF ROBOT POLICY OBTAINED BY LEARNING Figure 6 represents the verge q-vlues of the obtined policy nd the swp policy. Figure 7 nd Tble 3 represent the mkespn differences of the two policies s histogrm nd tble, respectively. M p nd M s represent mkespn under the obtined policy nd the swp policy. We confirmed tht the mkespn of the policy obtined through Q-lerning bsed on the proposed MDP model 159

is on verge shorter thn the swp policy s shown in Figure 6. Tble 3: Frequency Distribution of the Q-vlue Difference Figure 6: Averge Q-vlues of Two Policies Except for 2.3% cses, the others cquired the sme or shorter mkespn with the obtined policy. This mens tht the bove reinforcement lerning hs found good robot sequence nd obtined slightly better policy thn the widely used swp policy. The verge of mkespn is not significntly different s illustrted in Figure 6. It seems tht the smll difference is due to the settings, the verge process time nd the number of the chmbers in our lerning environment. Both of the settings my hve reduced the time vrition effect to the environment. If we were to increse the verge process time nd number of chmbers s much s the ctul cluster tool, then the time vrince would hve incresed, nd the difference in mkespn my hve incresed. However, the fct tht mkespn is reduced in most cses mens tht s policy run in time vrince environment, the policy obtined through lerning in time vrince environment is better thn the policy obtined in deterministic environment. 5. CONCLUSIONS Cluster tool scheduling hs been usully studied under the deterministic environment. In this pper, we modeled dul-rmed cluster tool behviors s Mrkov decision process in the time vrince environment, then pplied reinforcement lerning to obtin robot policies. The newly obtined robot policies reduces the mkespn of 50 wfer processes compre to the conventionl swp sequence in most cses. From this study, it seems tht the proposed MDP models cn dequtely express the cluster tool behviors in time vrince environment. This kind of view mkes it possible to look t the cluster tool from slightly different perspective; hence we cn continue to try scheduling cluster tools through reinforcement lerning. However, since the bove model includes the remining time in the stte in order to consider the time vrince, the stte spce cn be very lrge. Furthermore, the environment settings we used in lerning couldn t fully express the rel world cluster tool behviors. Therefore, the lerning with more generl environment settings is needed. To conduct such lerning in further reserch, functionl pproximtion seems to be pplied due to the time elements in the sttes. Figure 7: Histogrm Grph Showing the Q-vlue Differences ACKNOWLEDGMENTS This reserch ws supported by the Bsic Science Reserch Progrm through the Ntionl Reserch Foundtion of Kore (NRF) funded by the Ministry of Eduction, Science nd Technology (2015R1D1A1A01057131). REFERENCES Bellmn, R., 1954. The theory of dynmic progrmming (No. RAND-P-550). RAND CORP SANTA MONICA CA. 160

Bertseks, D. P., 1995. Dynmic progrmming nd optiml control (Vol. 1, No. 2). Belmont, MA: Athen scientific. Choi, B. K., & Kng, D., 2013. Modeling nd simultion of discrete event systems. John Wiley & Sons. Gosvi, A., 2003. Simultion-bsed optimiztion. prmetric optimiztion techniques nd reinforcement lerning. Jung, C., & Lee, T. E., 2012. An efficient mixed integer progrmming model bsed on timed Petri nets for diverse complex cluster tool scheduling problems. IEEE Trnsctions on Semiconductor Mnufcturing, 25(2), 186-199. Kim, D. K., Lee, T. E., & Kim, H. J., 2016. Optiml scheduling of trnsient cycles for single-rmed cluster tools with prllel chmbers. IEEE Trnsctions on Automtion Science nd Engineering, 13(2), 1165-1175. Kim, J. H., & Lee, T. E., 2008. Schedulbility nlysis of time-constrined cluster tools with bounded time vrition by n extended Petri net. IEEE Trnsctions on Automtion Science nd Engineering, 5(3), 490-503. Lee, T. E., Lee, H. Y., & Shin, Y. H., 2004, December. Worklod blncing nd scheduling of singlermed cluster tool. In Proceedings of the 5th APIEMS Conference (pp. 1-15). Gold Cost, Austrli. Lee, T. E., 2008, December. A review of scheduling theory nd methods for semiconductor mnufcturing cluster tools. In Proceedings of the 40th conference on winter simultion (pp. 2127-2135). Winter Simultion Conference. Mnih, V., Kvukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemre, M. G.,... & Petersen, S., 2015. Humn-level control through deep reinforcement lerning. Nture, 518(7540), 529-533. Molloy, M. K., 1982. Performnce nlysis using stochstic Petri nets. IEEE Trnsctions on computers, 31(9), 913-917. Moody, J., Wu, L., Lio, Y., & Sffell, M., 1998. Performnce functions nd reinforcement lerning for trding systems nd portfolios. Journl of Forecsting, 17(56), 441-470. Murt, T.,1989. Petri nets: Properties, nlysis nd pplictions. Proceedings of the IEEE, 77(4), 541-580. Putermn, M. L., 2014. Mrkov decision processes: discrete stochstic dynmic progrmming. John Wiley & Sons. Qio, Y., Wu, N., & Zhou, M., 2012. Rel-time scheduling of single-rm cluster tools subject to residency time constrints nd bounded ctivity time vrition. IEEE Trnsctions on Automtion Science nd Engineering, 9(3), 564-577. Sutton, R. S., & Brto, A. G., 1998. Reinforcement lerning: An introduction (Vol. 1, No. 1). Cmbridge: MIT press. Zuberek, W. M., 2004. Cluster tools with chmber revisiting-modeling nd nlysis using timed Petri nets. IEEE Trnsctions on semiconductor mnufcturing, 17(3), 333-344. 161