\( \newcommand{\blah}{blah-blah-blah} \newcommand{\eqb}[1]{\begin{eqnarray*}#1\end{eqnarray*}} \newcommand{\eqbn}[1]{\begin{eqnarray}#1\end{eqnarray}} \newcommand{\bb}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\nchoose}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{\defn}{\stackrel{\vartriangle}{=}} \newcommand{\rvectwo}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{\rvecthree}[3]{\left(\begin{array}{r} #1 \\ #2\\ #3\end{array}\right)} \newcommand{\rvecdots}[3]{\left(\begin{array}{r} #1 \\ #2\\ \vdots\\ #3\end{array}\right)} \newcommand{\vectwo}[2]{\left[\begin{array}{r} #1\\#2\end{array}\right]} \newcommand{\vecthree}[3]{\left[\begin{array}{r} #1 \\ #2\\ #3\end{array}\right]} \newcommand{\vecdots}[3]{\left[\begin{array}{r} #1 \\ #2\\ \vdots\\ #3\end{array}\right]} \newcommand{\eql}{\;\; = \;\;} \newcommand{\eqt}{& \;\; = \;\;&} \newcommand{\dv}[2]{\frac{#1}{#2}} \newcommand{\half}{\frac{1}{2}} \newcommand{\mmod}{\!\!\! \mod} \newcommand{\ops}{\;\; #1 \;\;} \newcommand{\implies}{\Rightarrow\;\;\;\;\;\;\;\;\;\;\;\;} \newcommand{\prob}[1]{\mbox{Pr}\left[ #1 \right]} \newcommand{\expt}[1]{\mbox{E}\left[ #1 \right]} \newcommand{\probm}[1]{\mbox{Pr}\left[ \mbox{#1} \right]} \newcommand{\simm}[1]{\sim\mbox{#1}} \definecolor{dkblue}{RGB}{0,0,120} \definecolor{dkred}{RGB}{120,0,0} \definecolor{dkgreen}{RGB}{0,120,0} \)


Module 10: Optimization


10.1   Types of optimization problems

 

First, what is an optimization problem?

 

Discrete optimization:

 

Continuous optimization:

 

Function vs. parameter (or variable) optimization:

 

Categories of continuous (parameter) optimization problems:

 

In this module:

 


10.2   What we learned in calculus

 

Let's go back to our example:

 
Exercise 3: Go back to your calc book and find an example of a "hard to differentiate" function.
 
Exercise 4: What's an example of a function that's continuous but not differentiable? Consider the (weird) function \(f(x)\) where \(f(x)=1\) if \(x\) is rational and \(f(x)=0\) otherwise. Is this continuous? Differentiable?
 
Exercise 5: Consider the function $$ f(x) \eql \frac{x}{\mu_1-\lambda x} + \frac{1-x}{\mu_2-\lambda(1-x)} $$
Compute the derivative \(f'(x)\). Can you solve \(f'(x)=0\)?
 

One other important point from our calc course:

 


10.3   Simple bracketing search

 

Key ideas:

 
Exercise 6: Download and execute BracketSearch.java.
  • What is the running time in terms of \(M\) and \(N\)?
  • If we keep \(MN\) constant (e.g., \(MN=24\)), what values of \(M\) and \(N\) produce best results?
 
Exercise 7: Draw an example of a function for which bracket-search fails miserably, that is, the true minimum is much lower than what's found by bracket search even for large \(M\) and \(N\).
 

How does one evaluate such algorithms?

  • In the above case, the actual computation inside the innermost loop was simple
              \(\Rightarrow\) Computing \(f(x)\) takes constant time.

  • However, in many real-world examples
              \(\Rightarrow\) Computing \(f(x)\) can take a lot of time.

  • Example: \(f(x)\) may be the result of solving a differential equation
              \(\Rightarrow\) Computing \(f(x)\) itself needs multiple iterations.

  • Thus, one would like to reduce the number of function evaluations.
 
Exercise 8: What is the number of function evaluations in terms of \(M\) and \(N\) for the bracket-search algorithm?
 

Can bracket search work for multivariate functions such as \(f(x,y) = 2x+3y^2\)?

  • Yes, one simply divides every dimension at each step
              \(\Rightarrow\) The brackets become "cells"

  • For example, in 2D (2 variables)
         Algorithm: bracketSearch (a1, b1, a2, b2)
         Input: the initial ranges for each dimension [a1, b1] and [a2, b2]
         1.    for i=1 to N                 
                   // The ranges could be different in each dimension.
         2.        δ1 = (b1 - a1) / M
         3.        δ2 = (b2 - a2) / M
         4.        x1* = a1,  x2* = a2
         5.        bestf = f(x1*, x2*)
         6.        for x1=a1 to b1 step  δ1     
         7.            for x2=a2 to b2 step  δ2
         8.                f = f(x1,x2)
         9.                if f < bestf
         8.                    bestf = f
         9.                    x1* = x1,  x2* = x2
         10.               endif
         11.           endfor
         12.       endfor
                   // Shrink cell.
         13.       a1 = x1* -  δ1,   b1 = x1* + δ1
         14.       a2 = x2* -  δ2,   b2 = x2* + δ2
         15.   endfor
         16.   return x1*, x2*
         

  • Notice that we've used the notation \((x_1,x_2)\) instead of \((x,y)\).
              \(\Rightarrow\) For an n-dimensional space we'd use \((x_1,x_2, ..., x_n)\) to represent a particular point.
 
Exercise 9: What is the number of function evaluations in terms of \(M\) and \(N\) for the 2D bracket-search algorithm? How does this generalize to \(n\) dimensions?
 
Exercise 10: Add code to MultiBracketSearch.java to find the minimum of \(f(x_1,x_2)=(x_1-4.71)^2 + (x_2-3.2)^2 + 2(x_1-4.71)^2(x_2-3.2)^2\).
 

When to stop?

  • We have fixed the number of iterations at \(N\).

  • We could, instead, stop when some desired accuracy is achieved.

  • Let's try the following:
    • Let \(f_k\) be the best value after \(k\) iterations.
    • We will compare \(f_k\) and \(f_{k-1}\).
                If \(|f_k - f_{k-1}| \lt \epsilon\), then stop.
    • Here, \(\epsilon\) is some suitably small number.

  • In pseudocode:
         Algorithm: bracketSearch (a, b)
         Input: the initial range [a,b]
         1.    bestf = some-large-value
         2.    prevBestf = bestf + 2ε         // So that we enter the loop.
         3.    while |bestf - prevBestf| > ε
         4.        δ = (b-a) / M              
         5.        x* = a
         6.        prevBestf = bestf
         7.        bestf = f(x*)
         8.        for x=a to b step δ
         9.            f = f(x)
         10.           if f < bestf
         11.               bestf = f
         12.               x* = x
         13.           endif
         14.       endfor
         15.       a = x* - δ               
         16.       b = x* + δ
         17.       N = N + 1                  // Track N for printing/evaluation
         18.   endfor
               // Print N if desired.
         19.   return x*
         

  • How do we choose \(\epsilon\)?
    • If, in our problem, \(f\) values happen to be very large (e.g, \(10^6\)),
                \(\Rightarrow\) \(\epsilon=0.1\) may be unnecessarily small.
    • If, on the other hand, \(f\) values happen to be very small (e.g, \(10^{-6}\)),
                \(\Rightarrow\) \(\epsilon=0.1\) may be too large.
                \(\Rightarrow\) No optimization occurs.

  • A better solution:
              Stop when $$ \left| \frac{f_k - f_{k-1}}{f_{k-1}}\right| \; \lt \; \epsilon $$
              \(\Rightarrow\) Thus, if the proportional change is very small, stop.
 
Exercise 11: Modify BracketSearch2.java to use the proportional-difference stopping condition.
 


10.4   Golden ratio search

 

First, an important observation:

  • Suppose \(a \lt b\) are two real numbers.
              \(\Rightarrow\) e.g., they represent an interval \([a,b]\).

  • Next, let \(r\) be a number strictly between \(0\) and \(1\).
              \(\Rightarrow\) i.e., \(0 \lt r \lt 1\).

  • Now consider the numbers \(c = ra + (1-r)b\), and \(d = (1-r)a + rb\).

  • Then, it is true that \(a \lt c \lt b\)
              \(\Rightarrow\) That is, \(c\) is between \(a\) and \(b\).

  • Similarly, \(a \lt d \lt b\)

  • Intuition:
    • Pick \(r=0.3\) and \(a=4, b=9\).
    • Then, \(c = 0.3*4 + 0.7*9\) = some weighted average of \(4\) and \(9\)
                \(\Rightarrow\) weighted average must be between \(4\) and \(9\).
     
    Exercise 12: Prove this result.
 

Next, let's look at the ideas in ratio search (before we tackle golden-ratio search):

  • Recall that, in bracket-search, we fixed the number of intervals \(M\).

  • In ratio search, we will use \(M=1\) but adjust the ends.

  • Start with some interval \([a,b]\).

  • Compute the ends of a smaller interval contained within:

  • At each step, shrink the interval to either \([a,d]\) or \([c,b]\).

    • If \(f(c) \leq f(d)\), set \(b = d\).
    • If \(f(c) \gt f(d)\), set \(a = c\).

  • Then, repeat until interval is small enough to stop.

  • Why does this work?

    • First, we assume the function is unimodal
                \(\Rightarrow\) \(f(x)\) decreases from \(a\) to the optimal \(x\), and increases after that.
    • If \(f(c) \leq f(d)\), then the minimum cannot occur to the right of \(d\).
                \(\Rightarrow\) We can shrink the interval on the right.
    • Same reasoning for the left side.

  • After shrinking the interval, we again compute two interior points and repeat.
 

Golden-ratio search:

  • Recall what happens after shrinking an interval:

    • We re-compute new \(c\) and \(d\) values after shrinking.

  • Instead, suppose we could re-use one of the previous values:

  • How do we arrange for this to happen? Is there a value of \(r\) that will make this happen?

  • First, observe


    Thus $$ \frac{(c-a)}{(b-a)} \eql (1-r) $$ Or \((1-r)\) = ratio of the distance \(c-a\) to the whole interval

  • Similarly, $$ \frac{(d-a)}{(b-a)} \eql (1-r) $$
     
    Exercise 13: Prove this result.

  • Next, consider two iterations:


    The ratio $$\eqb{ r \eqt \frac{d' - a'}{b' - a'} \\ \eqt \frac{d' - a'}{d - a} \\ \eqt \frac{c - a}{d - a} \\ \eqt \frac{(1-r)(b - a)}{r(b - a)} \\ \eqt \frac{(1-r)}{r} \\ }$$

  • Simplifying, we get \(r^2 + r - 1 = 0\).

  • This famous quadratic has the solution \(r = 0.618...\) (approx.)
              \(\Rightarrow\) called the golden ratio.

  • Pseudocode:
         Algorithm: goldenRatioSearch (a, b)
         Input: the initial range [a,b]
         1.    Pick c and d so that |(c-d)/d| > ε
         2.    while |(c-d)/d| > ε
         3.        if fc ≤ fd
         4.            b = d                  // Shrink from right side.
         5.            d = c                  // Re-use old c, f(c).
         6.            fd = fc
         7.            c = ra + (1-r)b        // Compute new c, f(c).
         8.            fc = f(c)
         9.        else
         10.           a = c                  // Shrink from left side.
         11.           c = d                  // Re-use old d, f(d).
         12.           fc = fd  
         13.           d = (1-r)a + rb        // Compute new d, f(d).
         14.           fd = f(d)
         15.       endif
         16.   endwhile
         17.   return (c+d)/2
         
 
Exercise 14: Describe what could go wrong if we replaced the while-condition with
     2.    while |(fc-fd)/fd| > ε
     
How would you address this problem?
 
Exercise 15: Implement golden-ratio search in this template: GoldenRatio.java.
 


10.5   Gradient descent

 

Let's start by understanding what gradient means:

  • Consider a (single-dimensional) function \(f(x)\):

    • Let \(f'(x)\) denote the derivative of \(f(x)\).
    • The gradient at a point \(x\) is the value of \(f'(x)\).
                \(\Rightarrow\) Graphically, the slope of the tangent to the curve at \(x\).

  • Observe the following:

    • To the left of the optimal value \(x^*\), the gradient is negative.
    • To the right, it's positive.

  • We seek an iterative algorithm of the form
         while not over
             if gradient < 0
                 move rightwards
             else if gradient > 0
                 move leftwards
             else
                 stop               // gradient = 0 (unlikely in practice, of course)
             endif
         endwhile
         

  • The gradient descent algorithm is exactly this idea:
         while not over
             x = x - α f'(x)
         endwhile
         
    Here, we add a scaling factor \(\alpha\) in case \(f'(x)\) values are of a different order-of-magnitude:

  • Why we need \(\alpha\)
    • For example, it could be that \(x=0.1, x^*=0\) and \(f(0.1)=1000\).
    • Then one iterative step without \(\alpha\) would produce \(x = 0.1 - 1000 = -999.9\)
                \(\Rightarrow\) Which would be way out of bounds.
    • For such a problem, we'd use \(\alpha = 0.0001\) so that
                \(x = 0.1 - 0.0001*1000 = 0.09\)

  • The algorithm parameter \(\alpha\) is sometimes called the stepsize.

  • In pseudocode:
         Algorithm: gradientDescent (a, b)
         Input: the range [a,b]
         1.    x = a                      // Alternatively, x = b.
         2.    while |f'(x)| > ε
         3.        x = x - α f'(x)        // Note: f'() is evaluated at current value of x (before changing x).
         4.    endwhile
         5.    return x
         

  • Stopping conditions:
    • The obvious stopping condition is to see whether \(f'(x)\) is close enough to zero.
    • However, this may not always work:
                \(\Rightarrow\) If \(\alpha\) is too small, the gradient may never get close enough.
    • Thus, it may help to also evaluate the actual proportional change in \(x\), that is, $$ \left| \frac{\mbox{prevX-x}}{\mbox{prevX}} \right| $$
 
Exercise 16: Download and execute GradientDemo.java.
  • How many iterations does it take to get close to the optimum?
  • What is the effect of using a small \(\alpha\) (e.g, \(\alpha=0.001\))?
  • In the method nextStep(), print out the current value of \(x\), and the value of \(xf'(x)\) before the update.
  • Set \(\alpha=1\). Explain what you observe.
  • What happens when \(\alpha=10\)?
 

Picking the right stepsize:

  • Clearly, the performance of gradient descent is sensitive to the choice of the stepsize \(\alpha\).
    • If \(\alpha\) is too small
                \(\Rightarrow\) It can take too long to converge.
    • If \(\alpha\) is too large
                \(\Rightarrow\) It can diverge or oscillate, and never converge.

  • One rule of thumb: the term \(\alpha f'(x)\) should be an order-of-magnitude less than \(x\).
              \(\Rightarrow\) This way, the changes in \(x\) are small relative to \(x\).

  • Another idea: try different \(\alpha\) values in each iteration:
    • Start with a small \(\alpha\).
    • Gradually increase until it "does something bad".
    • What is "bad"?
                \(\Rightarrow\) Causes an increase in \(f(x)\) value.

    • This idea is called line-search
                \(\Rightarrow\) Because we're searching along the (single) dimension of \(\alpha\).

  • At first, it may seem that one could merely try increasing \(α\):
      Algorithm: gradientDescentLineSearch (a, b)
      Input: the range [a,b]
      1.    x = a                      
      2.    while |f'(x)| > ε
      3.        αtrial = αsmall                  // αsmall is the first value to try.
      4.        do                             // Search for the right alpha.
      5.            α = αtrial
      6.            αtrial = α + δα
      7.            f = f(x - αf'(x))
      8.            ftrial = f(x - αtrialf'(x))
      9.        while ftrial < f                // Keep increasing stepsize until you get an increase in f
      10.       x = x - α f'(x)  
      11.   endwhile
      12.   return x
             

  • However, this approach has two problems:
    • It is highly dependent on the choice of \(\delta_{\alpha}\).
    • What if the true best value is extremely small (smaller than \(\delta_{\alpha}\))?
                \(\Rightarrow\) We'd overlook it because it falls between small, αsmallα]

  • The better way is to realize that we can use bracketing search, which automatically adjusts to the "scale" by shrinking intervals:
      Algorithm: gradientDescentLineSearch (a, b)
      Input: the range [a,b]
      1.    x = a                      
      2.    Pick αsmall                   // Left end of  α interval
      3.    Pick αbig                     // Right end of  α interval
      4.    while |f'(x)| > ε
                // Define the function g(α) = f(x - αf'(x)) in the interval [αsmall, αbig]
      5.        α* = bracketSearch (g, αsmall, αbig)
      6.        x = x - α* f'(x)  
      7.    endwhile
      8.    return x
             
 

Let's now turn to an important concept that pervades all of optimization: local vs. global minima.
 

Exercise 17: Download GradientDemo2.java and examine the function being optimized.

  • Fill in the code for computing the derivative.
  • Try an initial value of \(x\) at \(1.8\). Does it converge?
  • Next, try an initial value of \(x\) at \(5.8\). What is the gradient at the point of convergence?
 

About local vs. global optima:

  • Just because the gradient is zero, doesn't mean we've found the optimum.

  • A function can have several different local minima, as we've seen.

     
    Exercise 18: Can a function have several global minima?

  • The gradient \(f'(x)\) at a point \(x\) merely describes the behavior of the function \(f\) near that point.

  • A gradient-descent algorithm will find a local minimum
              \(\Rightarrow\) There's no guarantee that it'll find the global minimum.

  • In general, apart from searching the whole space of solutions, there's no method that guarantees finding a global minimum.

  • Thus, finding a local minimum is considered a high-enough standard for an algorithm.
 

What if we cannot compute the gradient?

  • For some functions, a formula for the gradient may be difficult to obtain.

  • However, it's easy to approximate the gradient:
    • Pick some small value \(s\).
    • Compute
                \(\Rightarrow\) \(f'_{approx}(x) = \frac{f(x+s) - f(x)}{s}\)

  • In pseudocode:
         Algorithm: gradientDescentApprox (a, b)
         Input: the range [a,b]
         1.    x = a                      
         2.    f'approx = 2ε
         3.    while |f'approx| > ε
         4.        f'approx = (f(x+s) - f(x)) / s       // s is an algorithm parameter.
         5.        x = x - α f'approx       
         6.    endwhile
         7.    return x
         
 
Exercise 19: Add code to GradientDemo3.java to implement approximate gradients. Use \(s=0.01\). Explain what could go wrong if \(s\) is too large.
 

Theoretical issues:

  • The big questions:
              Does the gradient-descent algorithm always converge to a local minimum?
              Under what conditions does the algorithm converge?

  • Theoreticians usually describe such iterative algorithms using this notation: $$ x^{(n+1)} \eql x^{(n)} - \alpha f'(x^{(n)}) $$
    Here, \(x^{(n)}\) is the \(n\)-th iterate.

  • The convergence question:
    • Let \(S^*\) be the set of local minima.
    • Question 1: Does the sequence \(x^{(n)}\) have a limit?
    • Question 2: If so, is the limit in \(S^*\)?

  • It can be shown that:
    • If the function \(f\) is "smooth" (twice-differentiable);
    • and if \(\alpha\) is small enough.
    Then, \(x^{(n)} \to x^*\), where \(x^*\) is some point in \(S^*\).

  • In the case that we use approximate gradients, we need an additional condition:
    • Recall that our approximate gradient was computed as:
                \(\Rightarrow\) \(f'_{approx}(x) = \frac{f(x+s) - f(x)}{s}\)
    • We'll write the algorithm as
               while |f'approx| > ε
                   f'approx = (f(x+s) - f(x)) / s 
                   x = x - α f'approx       
                   n = n + 1
               endwhile
           
    • Because \(s\) is never exactly zero, it can never converge to the true minimum.
    • To solve this problem, we need to gradually decrease \(s\) as we iterate.
                \(\Rightarrow\) i.e., let \(s \to 0\) as \(n\) increases.
    • One way to do this:
               while |f'approx| > ε
                   sn = s / n
                   f'approx = (f(x+sn) - f(x)) / sn
                   x = x - α f'approx       
                   n = n + 1
               endwhile
           
 


10.6   Gradient descent in multiple dimensions

 

Recall the intuition behind the gradient-descent algorithm in one dimension:

  • First, what does gradient mean? Roughly, the change in \(f\) with respect to increasing \(x\): $$ f'(x) \; \approx \; \frac{f(x+s) - f(x)}{s} $$ (for small \(s\))

  • We use the gradient to take a "step" in the direction towards the minimum:
             x = x - α f'(x)
         
  • Here, the "step" has two meanings:
    • A direction
                \(\Rightarrow\) determined by the sign of \(f'(x)\).
    • A magnitude
                \(\Rightarrow\) determined by the magnitude of \(f'(x)\).

  • The same idea works in multiple dimensions, provided we define "gradient" correctly.
 

Gradients for multivariable functions:

  • Let's first understand what we need:
    • We want an iterative algorithm for two variables.
    • What would this look like?
               while not over
                  x1 = x1 - (some gradient)
                  x2 = x2 - (some gradient)
               endwhile
               

  • Example: consider the function \(f(x_1,x_2)=(x_1-4.71)^2 + (x_2-3.2)^2 + 2(x_1-4.71)^2(x_2-3.2)^2\).
              \(\Rightarrow\) This is a function of two variables, \(x_1\) and \(x_1\).

  • One option for defining the gradient would be $$ f'(x_1,x_2) \eql \frac{f(x_1+s,x_2+s) - f(x_1,x_2)}{s} $$ (for small \(s\))
    • This results in a single number.
    • This means that the iteration would look like
               while not over
                  x1 = x1 - α f'(x1,x2)
                  x2 = x2 - α f'(x1,x2)
               endwhile
               
    • Thus, the two variables would change by the same amount in the same direction.
     
    Exercise 20: Explain why this won't work. Think up a function for which it won't work.

  • Instead, we need to compute gradients independently for each of the two variables: $$\eqb{ f'_1 \eqt \frac{f(x_1+s,x_2) - f(x_1,x_2)}{s} \\ f'_2 \eqt \frac{f(x_1,x_2+s) - f(x_1,x_2)}{s} }$$ (for small \(s\))
    These kinds of derivatives are called a partial derivatives.
              \(\Rightarrow\) There are two partial derivatives above, one for each variable.

  • In this case, our gradient descent algorithm looks like:
             while not over
                x1 = x1 - α f'1(x1,x2)
                x2 = x2 - α f'2(x1,x2)
             endwhile
             
 
Exercise 21: Compute by hand the partial derivatives of \(f(x_1,x_2)=(x_1-4.71)^2 + (x_2-3.2)^2 + 2(x_1-4.71)^2(x_2-3.2)^2\).
 
Exercise 22: Compute by hand the partial derivatives of $$ f(x_1,x_2) \eql \frac{x_1}{\mu_1-\lambda x_1} \; + \; \frac{x_2}{\mu_2-\lambda x_2} $$
 

A little more detail:

  • What about the stopping condition?
              \(\Rightarrow\) We should keep going as long as any one of the gradients is not close to zero.

  • Same stepsize for both variables?
    • In many problems, a single stepsize will suffice.
    • A line-search will likely result in different stepsizes.

  • At first glance, it may seem that the pseudocode could be written as:
         1.    while |f1'(x1,x2)| > ε or |f2'(x1,x2)| > ε
         2.        x1 = x1 - α f1'(x1,x2)
         3.        x2 = x2 - α f2'(x1,x2)
         4.    endwhile
         
    While mathematically more elegant this, however, creates a small programming issue, as we'll see.

  • Instead, let's use this pseudocode:
         Algorithm: twoVariableGradientDescent (a1, b1, a2, b2)
         Input: the ranges [a1, b1] and [a2, b2]
         1.    x1 = a1, x2 = a2
         2.    f1' = 2ε
         3.    f2' = 2ε
         4.    while |f1'| > ε or |f2'| > ε
         5.        f1' = f1'(x1,x2)
         6.        f2' = f2'(x1,x2)
         7.        x1 = x1 - α f1'
         8.        x2 = x2 - α f2'
         9.    endwhile
         10.   return x1,x2
         
     
    Exercise 23: What would go wrong if we interchanged lines 6 and 7 above?

  • Note: just like the unidimensional case, we could use approximate gradients.
 
Exercise 24: Add code to MultiGradient.java to compute the partial derivatives of the function \(f(x_1,x_2)=(x_1-4.71)^2 + (x_2-3.2)^2 + 2(x_1-4.71)^2(x_2-3.2)^2\). Then execute to find the minimum. You might need to experiment with different values of \(\alpha\).
 


10.7   Stochastic gradient descent and simulation optimization

 

Let's start with an application:

  • Recall the routing problem with two queues:

    • An arriving customer chooses queue 0 with \(\probm{choose 0} = x\).
                \(\Rightarrow\) \(\probm{choose 1} = 1-x\).
    • In the simulation,
               if uniform() < x
                   choose 0
               else
                   choose 1
               

  • Thus, one can consider \(x\) to be a routing variable.

  • For a given routing probability \(x\), there will be some average system time.

    • Here, for a particular value of \(x\), we will get some estimate of the system time.
    • Let rv \(S(x)\) denote the system time when using routing value \(x\).
    The goal: find that value \(x\) that minimizes average system time \(E[S(x)]\).
 
Exercise 25: Examine QueueControl.java and find the part of the code that chooses the queue. Try running the simulation with different values of \(x\) to guess the minimum.
 

Let's try using gradient descent:

  • Recall the gradient-descent algorithm (for one variable):
             x = x - α f'(x)
          

  • How do we know what \(f'(x)\) is?

  • One option: approximate it using finite differences
         while not over
             fxs = estimate system time with x+s       // Run the simulation with x+s
             fx  = estimate system time with x         // Run the simulation with x
             f' = (fxs - fx) / s                       // Compute finite difference.
             x = x - α f'                              // Apply gradient descent.
         endwhile
         
 
Exercise 26: Confirm that QueueOpt.java implements this approach. Run the algorithm - does it work? How many samples are used in each estimate?
 

The problem with noise:

  • Observe that we are not calculating the approximate-gradient but, rather, are estimating it.

  • For any finite number of samples, an estimate will be off by some error.

  • Thus, the algorithm can "wander" all over \(x\) space when using "noisy" estimates.

  • Fortunately, there are (interesting) ways around this problem.
 

Addressing the noise problem:

  • First, we'll identify how many samples are being used in each estimate.

    • Recall that, in the example, we ran the simulation for 1000 departures.
    • Obviously the more departures, the better the estimate.
    • Let \(S(k,x)\) = estimate obtained using \(k\) samples.

  • We will allow the number of such samples to vary by iteration.

  • Second, we'll allow the stepsize to vary by iteration.

  • Putting these ideas together, our iterative algorithm becomes
         while not over
             k = kn                        // k = #samples to be used in iteration n
             fxs = S(k, x+s)               // Run the simulation with x+s
             fx  = S(k, x)                 // Run the simulation with x
             f' = (fxs - fx) / s           // Compute finite difference.
             α = αn                        // Stepsize to be used in iteration n
             x = x - α f'                  // Apply gradient descent.
         endwhile
         

  • We could write this more mathematically as: $$ x^{(n+1)} \eql x^{(n)} - \alpha_n \frac{S(k_n,x^{(n)}+s) - S(k_n,x^{(n)})}{s} $$

  • What values should be used for \(k_n\) and \(\alpha_n\)?
              \(\Rightarrow\) There are two approaches, stepsize-control and sampling-control
    We'll look at each of these next.
 

Stepsize-control:

  • The key idea in stepsize control is to let the stepsize decrease gradually.
    • Initially, we make possibly large noise-driven steps.
    • Later, as we get closer to the optimum, the stepsizes get small.

  • One caveat: if the stepsize gets small too quickly, we may converge to some point away from the optimum.

  • Consider stepsizes that meet these conditions:
    • \(\alpha_n \to 0\).
    • \(\sum_n \alpha_n \to \infty\).
    Thus, the stepsizes \(\alpha_n\) decrease, but not too quickly.
     
    Exercise 27: Give an example of such a sequence. (Hint: we saw one such example in Module 1). Also give an example of a sequence that does decrease quickly, that violates the second condition.

  • What about the numbers of samples \(k_n\)?
              \(\Rightarrow\) We'll pick a fixed number, e.g., \(k_n=1000\)?

  • Then, the algorithm can be written as follows:
      Algorithm: stepsizeControl 
           // Initialize x, other variables ... (not shown)
    
      1.   while not over
      2.       fxs = S(k, x+s)               // Run the simulation with x+s using k samples
      3.       fx  = S(k, x)                 // Run the simulation with x using k samples
      4.       f' = (fxs - fx) / s           // Compute finite difference.
      5.       α = αn                        // stepsize to be used in iteration n
      6.       x = x - α f'                  // Apply gradient descent.
      7.       n = n + 1
      8.  endwhile
      9.  return x
         

  • The stepsize-control approach is sometimes called the Robbins-Munro algorithm.
 
Exercise 28: Modify StepsizeControl.java to use decreasing stepsizes. Does it work?
 

Sampling-control:

  • The other approach is to increase the number of samples:
    • Initially, use fewer samples.
    • Later, as we approach the optimum, use more samples.

  • For example, \(k_n = n\)

  • In this case, the algorithm can be written as:
      Algorithm: samplingControl 
           // Initialize x, other variables ... (not shown)
    
      1.  while not over
      2.       k = kn                        // # samples for iteration n
      3.       fxs = S(k, x+s)               // Run the simulation with x+s using k samples
      4.       fx  = S(k, x)                 // Run the simulation with x using k samples
      5.       f' = (fxs - fx) / s           // Compute finite difference.
      6.       x = x - α f'                  // Note fixed stepsize.
      7.       n = n + 1
      8.  endwhile
      9.  return x
         
 

Finally, how do we know these methods worked for our routing example?

  • It turns out that one can analytically derive the system-time in terms of \(x\): $$ f(x) \eql \frac{x}{\mu_1-\lambda x} \; + \; \frac{1-x}{\mu_2-\lambda(1-x)} $$

  • This can even be minimized analytically (see earlier exercise).

  • However, let's use our gradient-descent algorithm.
 
Exercise 29: Add code to QueueGradientDescent.java to find the minimum. Does this correspond with what you found when using the stepsize-control algorithm?
 


10.8   Summary

 

We have taken an in-depth look at single-variable unconstrained optimization:

  • Simple non-gradient methods: bracketing, golden-ratio search.
  • Gradient descent.
  • The difference between local and global minima.
  • Why gradient descent can get stuck in a local minima.
  • Stochastic version of gradient descent.
  • Multivariable version of gradient descent.
 

Related topics we haven't covered in non-linear optimization:

  • Ways to speed up gradient-descent
              \(\Rightarrow\) Using 2nd derivatives (e.g., Newton-Raphson method).

  • Constrained optimization
              \(\Rightarrow\) A big topic in itself.

  • Function optimization
              \(\Rightarrow\) Another big topic in itself.



© 2008, Rahul Simha