ECE 592 Topics in Data Science

Size: px

Start display at page:

Download "ECE 592 Topics in Data Science"

Coral Cobb
5 years ago
Views:

1 ECE 592 Topics in Data Science Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA

2 Optimization Keywords: linear programming, dynamic programming, convex optimization, non-convex optimization

3 What is Optimization?

4 What is optimization? Wikipedia: In mathematics, computer science and operations research, mathematical optimization (alternatively, mathematical programming or simply, optimization) is the selection of a best element (with regard to some criterion) from some set of available alternatives. 4

Application #1-Classroom scheduling Real story NCSU has classes on multiple campuses, dozens of buildings, etc. We want a good schedule What s good?

5 Application #1-Classroom scheduling Real story NCSU has classes on multiple campuses, dozens of buildings, etc. We want a good schedule What s good? Availability of rooms Proximity of classroom to department Instructors have day/time preferences Match sizes of rooms and anticipated class enrollment Avoid conflicts between course pairs of interest to students 5

6 Application #2-l 1 recovery Among infinitely many solutions, seek one with smallest l 1 norm (sum of absolute values) Relation to compressed sensing recovery (later in course) Can express x=x p -x n xx 1 = NN ii=1 xx pp,ii + xx nn,ii min NN ii=1 xx pp,ii + xx nn,ii subject to (s.t.) y=φx p -Φx n Also need x p, x n to be non-negative 6

Application #3-reducing fuel consumption Suppose gas prices increase a lot Truck fleet company wants to save $ by reducing fuel consumption Things are simple on flat highways Challenges: 1) You see a

7 Application #3-reducing fuel consumption Suppose gas prices increase a lot Truck fleet company wants to save $ by reducing fuel consumption Things are simple on flat highways Challenges: 1) You see a hill; can push engine up hill and coast down, or accelerate before hill, then reduce speed while climbing 2) We see red light; should we coast, accelerate, slam brakes? Main point: dynamic behavior links past, present, future 7

8 Application #4-process design in factories Consider factory with complicated process Want to buy less inputs (chemicals) Want to use less energy Want product to be produced quickly (time) Want robustness to surprises (e.g., power shortages) Goal: tune production process to minimize costs Costs involve inputs, energy, time, robustness, Known as multi objective optimization 8

9 Dynamic Programming (DP) Keywords: Bellman equations, dynamic programming

10 What is dynamic programming (DP)? Wikipedia: In mathematics, management science, economics, computer science, and bioinformatics, dynamic programming (also known as dynamic optimization) is a method for solving a complex problem by breaking it down into a collection of simpler subproblems, solving each of those subproblems just once, and storing their solutions ideally, using a memory-based data structure. 10

11 Resembles divide and conquer Have a large problem Partition into parts Dynamic nature of problem links past, present, future Want decision whose combined costs (current plus future) is best Whereas brute force optimization is computationally intense, DP is fast 11

12 Problem setting t time T time horizon (maximal time) x t state at time t Possible actions aa tt Γ(xx tt ) T(x,a) next state upon choosing action a F(x,a) payoff from action a Want to maximize our payoff up to horizon T 12

13 Solution approach Basis case: t=t-1, have one time left for an action Maximize payoff by maximizing F(x t,a) a * =arg max a Γ F(x t,a) At time T (end of problem) arrive at state x T =T(x,a * ) Don t care about final state, only about payoff 13

14 Solution continued Recursive case: t<t-1, have multiple decisions left Let s keep it simple with t=t-2 Based on basis case, for each X t-1 =X T-2 can calculate a* for last decision (in next time step, t=t-1) Want optimal cost to account for current payoff and payoff in next step aa = arg max aa tt Γ(xx tt ) {FF xx tt, aa + next_payoff(t(xt,a))} 14

15 Recursive solution Let s simplify recursive case for t=t-2 using notation for optimal actions / payoffs at time t aa xx tt - optimal action at time t given state x t Ψ(xx tt ) optimal payoff starting from time t Basis case provides a * (x T-1 ), Ψ xx TT 1, xxxx Recursive case for t=t-2 aa = arg max aa tt Γ(xx tt ) {FF xx tt, aa + next_payoff(t(x t,a))} = arg max aa tt Γ(xx tt ) {FF xx tt, aa + Ψ(xx TT 1 = TT(xx tt, aa))} Repeat recursively for smaller t 15

16 Computationally efficient DP solution Instead of processing from t up to T, reverse order: t=t-1: compute aa xx tt, Ψ(xx tt ) for all possible x t t=t-2: aa = arg max {FF xx tt, aa + Ψ(xx TT 1 = TT(xx TT 2, aa))} aa tt Γ(xx tt ) t=t-3: aa = arg max {FF xx tt, aa + Ψ(xx TT 2 = TT(xx TT 3, aa))} aa tt Γ(xx tt ) General case: Bellman s optimality equations Each time step, store optimal actions and payoffs Lookup table (LUT) for Ψ instead of recomputing Can construct sequence of optimal actions with LUT 16

17 Why computationally efficient? Let s contrast computational complexities Brute force optimization: Γ actions per time step and T time steps Must evaluate Γ T trajectories of actions Θ( Γ T ) DP: Compute FF xx tt, aa + Ψ(xx tt = TT(xx tt, aa))} for Γ actions, T time Θ( Γ T) Whereas brute force optimization is computationally intense, DP is fast 17

18 Variations Deterministic / random Next state and payoff could be random Example: there could be more users than expected; adjust server (action) to account for future trajectory of software Finite / infinite horizon Infinite horizon decision problems require discount factor β to give future payoffs at time t weight β t Payoffs in far future matter less β<1 Discrete / continuous time 18

19 Example [Cormen et al.] Rod cutting problem Have rod of integer length n Have table of prices p i charged for length-i cuts Cutting is free Want to cut rod into parts (or not cut at all) to maximize profit 19

20 Example continued Length n=4 Can charge prices p 1 =1, p 2 =5, p 3 =8, p 4 =9 Could look at all possible sets of cuts (see below) 20

21 Example using DP Unrealistic to consider cutting configurations for large n, use DP instead Basis: n=1, Ψ(1)=p 1 =1 Recursion: n=2, Ψ(2)=max{2Ψ(1),p 2 }=5 n=3, Ψ(3)=max{Ψ(1)+Ψ(2),p 3 }=max{5+1,8}=8 At each stage, maximize over Ψ(k)+Ψ(n-k) for k=1,2,,n-1; and for k=n use p n Ex: Ψ(7)=max{Ψ(1)+Ψ(6),Ψ(2)+Ψ(5),Ψ(3)+Ψ(4),,p 7 } 21

22 Real-world application Viterbi algorithm Decodes convolution codes in CDMA Also used in speech recognition Text is hidden and (noisy) speech observations help estimate text Relies on DP Finds shortest path 22

23 Linear Programming Keywords: linear programming, simplex method

24 Formulation Canonical form max xx ss.tt. AAAA bb,xx 0 cctt xx Note: s.t. = subject to Matrix manipulations/tricks create variations: Ax=b by enforcing and We ve minimized x 1 (instead of c T x) 24

25 What s it good for? Transportation - pose airline costs and revenue as linear model, maximize profit (revenue-costs) w/lp Manufacturing minimize costs by posing them as linear model Common theme: many real-world problems are approximately linear, or can be linearized around working point (Taylor series) 25

26 History Early formulations date back to early 20 th century (rudimentary forms even earlier) Dantzig invented simplex method (solver) in 40s Polynomial average runtime; slow worst case Interior point methods much faster worst case 26

Simplex algorithm Linear constraints, AAAA bb, xx 0 Correspond to convex polytope Linear function being optimized, c T x Optimal on corner point of polytope Simplex = outer shell of convex polytope

27 Simplex algorithm Linear constraints, AAAA bb, xx 0 Correspond to convex polytope Linear function being optimized, c T x Optimal on corner point of polytope Simplex = outer shell of convex polytope Start at some corner point (vertex) Examine neighboring vertices Either c T x already optimal, or it s better at neighbor Move to best neighboring vertex; iterate until done Specific steps correspond to linear algebra 27

28 Keywords: convex optimization Convex Optimization

29 What are convex/concave functions? Consider convex real valued function defined on space XX, ff: XX R Convex: f(λx+(1- λ)y)) λf(x)+(1- λ)f(y), x,y XX, λ (0,1) Concave: Note: f convex if and only if f concave; convex/concave imply negative/positive second derivatives Any local optimum is global optimum 29

30 What is convex optimization? Basic convex problem: xx = arg min xx XX {f x } Set XX and function f(x) must both be convex Alternate form: min ff(xx) ss.tt. gg ii xx 0, ii Functions f, g 1,, g m all convex 30

31 Applications (Why is this interesting?) Many problems can be posed as convex Least squares Entropy maximization Linear programming 31

32 Newton s method Newton s method finds roots of equations, f(x)=0 Instead, derivative f`=0 or gradient = 0 Taylor expansion: f T (x)=f(x t )+f xx f xx2 + Root of derivative: f (x t )+f (x t ) xx=0 Iterate with x t+1 =x t + xx Newton s method is simple but O(1/t) convergence 32

33 Second order methods Challenge: first order approximation to derivative slows down Newton s method Solution: use higher order approximation Instead of f (x t )+f (x t ) xx, use third derivative too Multi-dimensional function? Use gradient, Hessians Second order methods more complicated but faster 33

34 Gradient descent Keywords: gradient descent, line search, golden section search

direction that minimizes cost function fastest How far should we

35 Gradient descent In each iteration, select direction to pursue Coordinate descent move along one of coordinates Gradient descent - direction that minimizes cost function fastest How far should we move along that direction? Undershooting or overshooting bad for convergence 35

36 Line search Key sub-method is to move along direction just enough to minimize the function along that line Line search = optimization along line Many variations binary search, golden rule search Let s make up an example for this and code it! Check course webpage for Matlab script 36

37 Integer Programming Keywords: integer programming, integer linear programs, relaxation

38 What is integer programming? Integer program = optimization problem where some/all variables must be integers Integer linear programs (ILP): xx = arg max Slack variables s ss.tt. AAAA+ss=bb ss 0 xx Z nn cc TT xx 38

39 Example Support set detection y=ax+z Sparse x Want to identify support set where x 0 Can we do perfect support set detection? Are there tiny non-zeros? (yes difficult) What s the SNR (low difficult) 39

Example continued Support set detection, y=ax+z, want support set Algorithm: Consider candidate support set, s {0,1} N Create matrix A s contains column i iff s

40 Example continued Support set detection, y=ax+z, want support set Algorithm: Consider candidate support set, s {0,1} N Create matrix A s contains column i iff s i =1 Run least squares using A s (find low-energy solution to y=a s x) Iterate over all s, select solution with smallest residual Algorithm is optimal & slow 40

41 More about ILP Integer linear programs can be shown to be NP This means they are slow Another algorithmic approach relaxation First ignore integer constraints, solve standard LP Next, round (sort of!) to nearby (not necessarily nearest) integer solution Various applications require integer solutions; we re just skimming the surface 41

42 Keywords: non-convex optimization Non-Convex Optimization

43 What s the challenge? Many functions are non-convex Convex one local min (it s the global min) Non-convex local min need not be global min Various algorithms could get stuck in local min 43

44 Is it hopeless? Maybe initialize an algo many different ways; could get stuck in different local mins, choose best But could be tons of local mins (especially in higher dimensions) 44

45 Markov chain Monte Carlo Markov chain Monte Carlo (MCMC) can solve some non-convex problems Form expression E(x) for energy (analogous to statistical physics) Distribution for signal: Pr(x)=Z exp{-se(x)} s analogous to inverse temperature; normalization term Z Sample next version of x from Gibbs distribution High temperature small s weak pull toward low energy Low temp large s strong pull to low energy Gradual cooling 45

46 MCMC continued 1) Do we sample entire sequence x? Not necessarily. Can consider re-generating one x i at a time; only need marginal distribution for x i 2) Is MCMC guaranteed to converge to global min? Maybe. If you cool down very slowly 3) So is it any good? Depends. MCMC is very slow but can converge to global min; some techniques to accelerate it 46

47 EM Algorithm Keywords: expectation maximization algorithm, Gaussian mixture models, latent variables

48 Main ideas Iterate over estimation (E) & maximization (M) Estimation create function for computing expected log likelihood (based on current parameters) Maximization update parameters to maximize expected log likelihood from E step Details coming up 48

49 Statistical model & motivation Model generates data X Z latent / missing values θ - parameter Likelihood function: L(θ;X,Z)=Pr(X,Z θ) Marginal likelihood: L(θ;X)=Pr(X θ)= L(θ;X,Z)dZ Might be intractable (e.g., due to many possible Z sequences) Want to compute L(θ;X), then optimize parameter Computationally intractable Motivates EM 49

50 Statistical model & motivation Expectation computed expected value of log likelihood for parameter θ (t) in current iteration t QQ θθ θθ tt = EE ZZ XX,θθ (tt) log LL θθ; XX, ZZ Z typically discrete latent variables Given parameter θ (t), sequence Z can be found; typically via fast algorithm, e.g., dynamic programming Maximization θθ (tt+1) = argmax θθ QQ θθ θθ tt 50

51 Example Gaussian mixture models What s a Gaussian mixture model (GMM)? XX~ ii αα ii NN(μμ ii, σσ ii ) Component i has probability αα ii mean μμ ii standard dev σσ ii Could be multi-dimensional data covariance matrix i Useful? Many distributions well-approximated by GMM In principle can model almost everything as GMM Trade-off between # components and model accuracy 51

52 Example continued Challenge - parameters (αα ii, μμ ii, σσ ii ) often unavailable Must estimate from data X To keep simple: N scalar samples X R NN Latent variable ZZ Z NN ; z n correspond to Gaussian components that x n belongs to E step: compute sequence Z given parameters θ=(αα ii, μμ ii, σσ ii ) Optimize θ given Z 52

Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis Lecture XV (04.02.08) Contents: Function Minimization (see E. Lohrmann & V. Blobel) Optimization Problem Set of n independent variables Sometimes in addition some constraints