STATISTICS, PROBABILITY, OPTIMIZATION
An explanation of Bayesian Optimization with statistical details
Published in ·
--
Optimizing a function is super important in many of the real life analytics use cases. By optimization we mean, either find an maximum or minimum of the target function with a certain set of parameter combination. Finding out that min or max value as well as the parameters should be the objective. In this article, we will discuss about basics of optimizing an unknown costly function with Bayesian approach.
Costly Blackbox Function Optimization — Calculus
From basic calculus we know that to find an max or min value of a function we need to solve the derivative equation for x like below:
Roots of the equation will be the optimal values of parameter x which in turn gives the optimal value of the function f(x). It is simple as long as you know the full algebraic form of the function f(x). But, in many real life scenarios, things are not so simple. If it is a black box, then you will only know output & input values of any f(x), but the full form will be unknown to you. So, analytically you cannot find the derivatives. On top of that, the function may be very costly to invoke. Consider two use cases like below:
1. Finding out the optimal hyperparameter combination of a neural network
Consider a large & complex Neural Network that solves a classification problem. Hyperparameters may be the number of hidden units, number of hidden layers, etc in that network. Relation between these can be thought as a hypothetical function that takes the hyperparameters as input and returns classification accuracy as output. Off course, you don’t know the actual algebraic form of this function . One way of finding the optimal combination will be trying out various random combinations by training the network repeatedly. The root of the problem lies there. We cannot afford that repeated training as it is a very large & complex network and training takes a lot of time & resources. Bayesian optimization can help here.
2. Excavation of an archeological site — finding optimal ‘digs’
Not only for software (like Neural Netowork case), Bayesian optimization also helps to overcome a challenge in physical world. In an archeological site, the major question comes into the mind of the experts : “where to dig ?”. And need not to be mentioned, digging is a costly & “blackbox” operation, all in terms of man-power, money & time. The function can be thought as if it returns the list of resources available in an site and parameters could be location details, and some other very domain specific stuff. Finding that location is the challenge and Bayesian approach can solve it.
In all of the cases, some initial runs of the function are needed for estimation. Consider following function:
It is “white box” (as derivatives can be found analytically) and does not look very costly at first glance and , but for out convenience, assume that it is costly & “black box” in nature and takes lot of time to return the output. Let’s run some simulations,
x = np.random.randn(5,2)
y = costly_function(x)
pd.DataFrame(data={'y':y, 'x0':x[:,0], 'x1':x[:,1]})
And the result,
The function takes two parameters x0 & x1. For costly functions like the above one, we can only afford to call that a few number of times. It may happen, that the initial function result & input may be given to us in a file or database. We then have to find out the optimal(minimum or maximum) value and the parameter combination responsible (x0 & x1, in this case) for that. We will never know the actual algebraic form, the analytical form of the derivatives and thus have to do the optimization numerically.
In Bayesian Optimization, an initial set of input/output combination is generally given as said above or may be generated from the function. For two use cases discussed above, it can be achieved like below:
- Neural Network is trained a number of times on different hyper-parameter combinations and the accuracies are captured & stored. This set can be used as initial data points.
- In an archeological excavation operation , several initial digs can be performed to gather information about the site. It works as initial data points.
In short, it is a constrained optimization which solves two problem as given below:
i) Finding out the optimal parameters that give optimal value of the black box function in a numerical way as analytically derivatives cannot be found.
ii) Keeping the number of function calls in the overall process as minimum as possible as it is very costly. (Apart from initial few runs)
Bayesian Optimization Nomenclatures
Bayesian approach is based on statistical modelling of the “blackbox” function and intelligent exploration of the parameter space. Few nomenclatures are important to know.
1. Surrogate Model
It is the statistical/probabilistic modelling of the “blackbox” function. It works as a proxy to the later. For experimenting with different parameters, this model is used to simulate function output instead of calling the actual costly function. A Gaussian Process Regression (it is a multivariate Gaussian Stochastic process) is used as “surrogate” in Bayesian Optimization.
2. Acquisition Function
It is a metric function which decides which parameter value that can return the optimal value from the function. There are many variations of it. We will work with the one “Expected Improvement”.
3. Exploration vs Exploitation
It is typical strategy to compensate between local & global optimal values in the parameter space. Consider the graph as given below:
While doing the parameter space exploration, many such local optimal data points can be found where the function has high or low value. But, the process should not stop there as more optimal values may be there in some other area. It is known as “Exploration”. On the other hand, importance should also be given to the points those are returning optimal (high or low) values from the function consistently. It is “Exploitation”. So, both have some significance. It is a trivial decision, “when to explore for more optimal data points in different locations or when to exploit & go in the same direction”. This is the area where Bayesian Optimization beats traditional Random search or Grid search approach for parameter space as it takes a middle ground. It helps to achieve the target more quickly with a small number of actual function calls. Other two methods go for blind searching that completely ignores this fact. Searching has to be very precise & “to the point” to reduce cost. Bayesian approach takes care of it pretty well.
In short, acquisition function uses “Exploration vs Exploitation” strategy to decide optimal parameter search in an iterative manner. Inside these iterations, surrogate model helps to get simulated output of the function. Any Bayesian Approach is based on the concept of “Prior/Posterior” duo. Initial runs of the function as mentioned in previous section are used as starting points or “Priors” and in each iteration, these “Priors” are enriched with “Posterior” data points. After few iterations, optimal data points are reached and the entire process stops there. We will see all of these in action next.
Step by Step implementation
We will use the same “costly_function” with two parameters declared in previous section as our target function to be optimized i.e., we need to maximize it at present case. It is assumed that the function is “black box” and output of few runs will be given to us. We will define a class “BayesianOptimizer” and declare its functions & attributes in a step by step manner.
Let us first create the structure of the class and declare its attributes.
“gauss_pr” is the surrogate model as mentioned in previous sections. “x_init” & “y_init” are as usual the parameters & outputs of the function from the initial set of runs. You will see usage of other attributes in a timely manner.
Now, we will create a function named “_get_expected_improvement” inside the same class which is the heart of Bayesian approach. It is the desired acquisition function mentioned earlier. Let’s discuss a few theory before jumping into this.
Whenever a new data point is tried, we need to compute a metric known as “Expected Improvement” or “EI” that gives the weightage of the data point. The formulae is given by:
where
This looks little complex at first. “mu(f(x))” is the prediction (i.e, the ‘y’ value) from the Gaussian process for a new data point x. “max{f(x)}” is the maximum of the predictions from the entire list of priors at current stage (again it is obtained from the Gaussian process). “sigma(f(x))” is as usual is the standard deviation of the prediction for the new data point x. The difference between “mu(f(x))” & “max{f(x)}” is just to check the improvement of the search process. A higher value indicates that the new data point is returning a high value from the function which is “significantly” higher than the maximum obtained so far. This “significance” is obtained by multiplying the difference with the cumulative probability density. So, it gives an expected or overall “mean” improvement and off course it is the “Exploitation” part. This value is also augmented by a magical factor “sigma(f(x))”. It gives the uncertainty to the “EI” metric and is the secret of handling the “Exploration” part. “mu(f(x))” and “sigma(f(x))” balance each others effect quite well and as a result EI(x) takes a middle ground.
Now, we will see the implementation of this function,
It avoids the actual function call and uses the Gaussian process as a proxy. So, the trial with different points happens through the proxy Gaussian Process, not the actual function. This is one of the secrets of “cost reduction”.
How this Gaussian process is built iteratively, we will see later.
Next, this acquisition function is used to compute the “EI” metrics in a neighborhood of randomly selected data points, numerically it is maximized with numerical derivative computation. Don’t get confused here !! We just again said “numerical derivatives” of the function. But, it is not the target costly function. It is about maximizing the acquisition function. Just think for a moment !! We will only be interested in that data point which gives maximum value of the “EI”, i.e., gives the maximum improvement of the target function (‘y’ value) from the current maximum one.
Off course, it will give the right direction where we should keep searching the parameter space and avoid unnecessary blind exploration . This is the second secret of “cost reduction”.
We will see the implementation next,
A batch (mentioned by “batch_size”) of randomly generated data points are tried and for each one, acquisition function is maximized (actually a negative function is minimized just to match the “scipy” library support. Minimizing negative of a function is equivalent to maximizing it) in that direction. Again, maximum of the “EI” from iterations is taken and returned with the corresponding parameter x value. Maximizing numerically with a starting point and again taking the maximum from the results actually gives right direction to parameter x with a right amount of mean & uncertainty in it. It is a trade-off between breadth-search (uncertainty) and depth-search(mean) and gets best effects from the the both.
Now, the next part is the actual work. You saw that there is a Gaussian process regressor as surrogate model. It is used to build a model iteratively on top of “Prior” data points. “optimize” function is the one which does all of this. It is shown below.
Steps of the above one can be summarized as follows:
i) Find out the maximum and corresponding parameter x from initial “prior” data points
ii) Build a Gaussian Process with the initial “prior” data points. Gaussian Process uses Maximum Likelihood Estimation to find a right joint distribution between parameters.
iii) Get the next best parameter x using acquisition function. As said in earlier sections, it is the step where trial & error with different data points happens with the help of the Gaussian model without calling the actual “costly” target function.
iv) Get the ‘y’ value for that parameter value x using the real target function. It is the “posterior” data point.
v) Update “Priors” with the “Posterior” data, update the current maximum value from the “priors” and then again go to step ii)
vi) At the end, return the current maximum “y” value and the corresponding parameter x.
You will notice that we are also doing some distance computation and capturing current best samples (nothing but current maximum y and the corresponding x). Distances are computed between two consecutive probable next x value from function “_get_next_probable_point”. It shows the progress of the algorithm. You will see that in results.
Results
We will use two sample parameters with 2-dimensions as initial data points. We can use n-dimension as the “costly_function” is generic enough to handle that.
sample_x = np.array([[8,1],[6.2,5.3]])
sample_y = costly_function(sample_x)
Now, it is time to use “BayesianOptimizer” class,
bopt = BayesianOptimizer(target_func=costly_function, x_init=sample_x, y_init=sample_y, n_iter=200, scale=10, batch_size=30)
bopt.optimize()
It triggers the searching with a neighborhood size of 30 and for 200 iterations. The result is shown below:
So, the maximum value of the costly function is 3.7 for parameters x0=1.92 and x1=3.62. You might have noticed that in the definition of costly function we have introduced a random noise just to make its derivatives difficult to compute.
Remember as said earlier many times, body or definition of “costly_function” will never be available to you. Here, for our understanding we just created one dummy function. But, in reality, that won’t be the case. sample x & y may be given to you as dataset or there will be some public/privately hosted “blackbox” API which can be invoked to get y values for any x.
Now, we will some other results from the computations,
Distances can be plotted,
pd.DataFrame(bopt.distances_).plot()
Notice, distances between two consecutive probable next x point is decreasing over the iterations. It is expected. Gaussian model becomes more mature after each iteration and its predictions become more perfect which results accurate “EI” values. Ultimately it reaches more closer to the optimal x after each iteration and thus the distance starts decreasing.
Now, we will see how ‘y’ values are changing over the iterations,
bopt.best_samples_['y'].plot()
Notice, y is reaching to its maximum in a step wise manner. Each iteration is producing a better estimate of maximum ‘y’.
Now, the same plot for computed ‘EI’
bopt.best_samples_['ei'].plot()
‘EI’ is decreasing as expected. Think about it !!. After each iteration, a better y is produced, so that means there is less chance for bigger improvement as we are progressing towards optimal.
Interesting Facts
One big question “How do you decide n_iter ?” Changing its value will definitely affect our estimation of maximum. One thing to be noted here,
n_iter is equal to the number of times the real costly function is called in the entire optimization process except initialization part
So, off course its value depends on how much cost are we ready to bear.
Consider use case 1 of Neural Network model. If there is a cloud environment which charges its customer per “training” run of the deep neural network and a budget constraint is there, then “n_iter” has to be set accordingly (as in this case target function is training of the network & getting the accuracy).
Bayesian approach tries to give an estimate of the function by reducing real calls, so its accuracy may not be as good as RandomSearch or GridSearch in some cases.
Bayesian Optimization is useful when cost is more important rather than very minute level accuracy.
Source code can be found here,