Training modes
Training a model with ApplicationDrivenLearning is done by calling the train!
function. This function has a mode
argument that represents the algorithm used to train the model. In this tutorial, we will show how to use the different modes and explore the trade-off between them.
The data used in this tutorial is the same as in the getting started tutorial so all the code shown in this page assumes that the full model (ApplicationDrivenLearning.Model
) is already defined.
Bilevel mode
The bilevel mode mounts the bilevel optimization problem relative to training the predictive model and uses the BilevelJuMP
package to solve it. For using this mode, we have to install the BilevelJuMP
package, import it and set the BilevelJuMP mode (from BilevelJuMP.jl docs).
Arguments
optimizer
: JuMP.jl optimizer used to solve the bilevel model using BilevelJuMP.jl.mode
: The mode to use for the bilevel optimization problem. It can be any of the modes supported by theBilevelJuMP
package.silent
: Whether to print the progress of the bilevel optimization problem.
Example
julia> import BilevelJuMP
julia> import HiGHS
julia> opt = ApplicationDrivenLearning.Options(
ApplicationDrivenLearning.BilevelMode,
optimizer = HiGHS.Optimizer,
mode = BilevelJuMP.FortunyAmatMcCarlMode(
primal_big_M = 100,
dual_big_M = 100,
)
);
julia> sol = ApplicationDrivenLearning.train!(model, X, Y, opt)
Running HiGHS 1.9.0 (git hash: 66f735e60): Copyright (c) 2024 HiGHS under MIT licence terms
Coefficient ranges:
Matrix [1e+00, 2e+04]
Cost [5e-01, 5e+00]
Bound [1e+00, 1e+00]
RHS [5e-01, 1e+02]
Assessing feasibility of MIP using primal feasibility and integrality tolerance of 1e-06
WARNING: Row 0 has infeasibility of 10 from [lower, value, upper] =
[ 10; 0; 10]
Solution has num max sum
Col infeasibilities 0 0 0
Integer infeasibilities 0 0 0
Row infeasibilities 8 20 42
Row residuals 0 0 0
Attempting to find feasible solution by solving LP for user-supplied values of discrete variables
Coefficient ranges:
Matrix [1e+00, 2e+04]
Cost [5e-01, 5e+00]
Bound [0e+00, 0e+00]
RHS [5e-01, 1e+02]
Presolving model
12 rows, 7 cols, 24 nonzeros 0s
6 rows, 7 cols, 12 nonzeros 0s
4 rows, 5 cols, 8 nonzeros 0s
4 rows, 5 cols, 8 nonzeros 0s
Presolve : Reductions: rows 4(-32); columns 5(-28); elements 8(-80)
Solving the presolved LP
Using EKK dual simplex solver - serial
Iteration Objective Infeasibilities num(sum)
0 -3.1250000000e+02 Ph1: 4(5623); Du: 1(0.3125) 0s
3 3.0000000000e+02 Pr: 0(0) 0s
Solving the original LP from the solution after postsolve
Model status : Optimal
Simplex iterations: 3
Objective value : 3.0000000000e+02
Relative P-D gap : 0.0000000000e+00
HiGHS run time : 0.05
Presolving model
24 rows, 17 cols, 52 nonzeros 0s
17 rows, 14 cols, 40 nonzeros 0s
10 rows, 11 cols, 20 nonzeros 0s
4 rows, 5 cols, 8 nonzeros 0s
MIP start solution is feasible, objective value is 300
Solving MIP model with:
4 rows
5 cols (0 binary, 0 integer, 0 implied int., 5 continuous)
8 nonzeros
Src: B => Branching; C => Central rounding; F => Feasibility pump; H => Heuristic; L => Sub-MIP;
P => Empty MIP; R => Randomized rounding; S => Solve LP; T => Evaluate node; U => Unbounded;
z => Trivial zero; l => Trivial lower; u => Trivial upper; p => Trivial point
Nodes | B&B Tree | Objective Bounds | Dynamic Constraints | Work
Src Proc. InQueue | Leaves Expl. | BestBound BestSol Gap | Cuts
InLp Confl. | LpIters Time
0 0 0 0.00% -inf 300 Large 0
0 0 0 0.0s
1 0 1 100.00% 300 300 0.00% 0
0 0 4 0.1s
Solving report
Status Optimal
Primal bound 300
Dual bound 300
Gap 0% (tolerance: 0.01%)
P-D integral 0
Solution status feasible
300 (objective)
0 (bound viol.)
0 (int. viol.)
0 (row viol.)
Timing 0.11 (total)
0.00 (presolve)
0.00 (solve)
0.00 (postsolve)
Max sub-MIP depth 0
Nodes 1
Repair LPs 0 (0 feasible; 0 iterations)
LP iterations 4 (total)
0 (strong br.)
0 (separation)
0 (heuristics)
ApplicationDrivenLearning.Solution(300.0, Real[20.0f0])
The output shows the solution found by the optimizer. The first value is the cost of the solution and the second value is the predictive model parameters.
Pros
- Is the most accurate option as it represents the bilevel optimization problem exactly.
Cons
- Requires solving a MIP problem that grows in size very fast with the number of data points.
- Relies on parameters specific to the BilevelJuMP package.
- Does not support integer variables in the plan and assess models.
Nelder-Mead mode
Uses the Nelder-Mead algorithm implemented in the Optim.jl
package. This algorithm is a gradient-free optimization algorithm that does not require the gradient of the objective function and is very robust to the choice of the initial guess.
Arguments
initial_simplex
: A n + 1 dimensional vector of n-dimensional vectors, where n is
the dimension of predictive model parameters. It will be used to start the search algorithm.
parameters
: Used to generate parameters for the algorithm. Same parameter from
Optim implementation of Nelder-Mead.
- Any other parameter acceptable on
Optim.Options
such asiterations
,time_limit
and g_tol
can be directly passed.
Example
julia> opt = ApplicationDrivenLearning.Options(
ApplicationDrivenLearning.NelderMeadMode,
show_trace = true
);
julia> sol = ApplicationDrivenLearning.train!(model, X, Y, opt)
Iter Function value √(Σ(yᵢ-ȳ)²)/n
------ -------------- --------------
0 1.473289e+03 5.201782e+00
* time: 0.0009999275207519531
1 1.473289e+03 1.560535e+01
* time: 0.0019998550415039062
2 1.442079e+03 4.681610e+01
* time: 0.003000020980834961
3 1.348446e+03 1.404482e+02
* time: 0.004999876022338867
4 1.067550e+03 2.962982e+02
* time: 0.006000041961669922
5 4.749537e+02 3.450557e+01
* time: 0.007999897003173828
6 4.059426e+02 3.511205e+01
* time: 0.009999990463256836
7 3.357185e+02 6.064606e-01
* time: 0.013000011444091797
8 3.345056e+02 8.778015e+00
* time: 0.018999814987182617
9 3.169496e+02 8.171539e+00
* time: 0.020999908447265625
10 3.006065e+02 1.588013e+00
* time: 0.021999835968017578
11 3.006065e+02 5.784607e-02
* time: 0.023999929428100586
12 3.004908e+02 1.371613e-01
* time: 0.026000022888183594
13 3.002165e+02 7.929993e-02
* time: 0.029999971389770508
14 3.000579e+02 2.355957e-02
* time: 0.03399991989135742
15 3.000107e+02 2.166748e-03
* time: 0.03600001335144043
16 3.000064e+02 2.151543e-03
* time: 0.03699994087219238
17 3.000021e+02 5.342755e-04
* time: 0.03900003433227539
18 3.000010e+02 4.882812e-04
* time: 0.039999961853027344
19 3.000001e+02 9.155273e-05
* time: 0.04199981689453125
20 3.000001e+02 3.051758e-05
* time: 0.046000003814697266
21 3.000000e+02 0.000000e+00
* time: 0.04999995231628418
ApplicationDrivenLearning.Solution(300.0f0, Real[20.0f0])
Pros
- Is the most general option as it can be used with basically any plan, assess and forecast model.
- Does not require the gradient of the objective function.
Cons
- Relies on a lot of iterations to accurately optimize the model parameters.
- Depends on the initial guess that can be given by the user or generated randomly by the Optim.jl package.
Gradient mode
By computing the gradient of the assessed cost with respect to the forecast values, this mode propagates the gradient to the predictive model parameters, guiding the parameter update process. Since it uses the model structure end-to-end to guide training, it typically requires fewer iterations to achieve good results. However, it may suffer from known issues of gradient-based optimization methods, such as sensitivity to learning rate selection.
The gradient algorithm runs for a specified number of iterations, saving the best available parameter values until the end or convergence.
Arguments
rule
: The optimizer for the gradient algorithm. Has to be an instance of optimization
rules from Flux.jl.
epochs
: The number of iterations to run the gradient algorithm.batch_size
: The batch size to use for the gradient algorithm. If-1
, the entire dataset is used.verbose
: Whether to print the progress of the gradient algorithm.compute_cost_every
: Allows for cost computation of every sample to be run only
after a specified number of epochs. This enables faster iterations with the drawback of possibly missing sets of parameters with low associated cost.
time_limit
: The time limit for the gradient algorithm in seconds.g_tol
: Convergence condition on the infinite norm of the gradient vector. Below, we illustrate the use of NelderMeadMode to optimize the predictive model used in the ongoing example.
Example
julia> opt = ApplicationDrivenLearning.Options(
ApplicationDrivenLearning.GradientMode,
rule = Flux.Descent(0.01),
);
julia> sol = ApplicationDrivenLearning.train!(model, X, Y, opt)
Epoch 1 | Time = 0.0s | Cost = 1411.12
Epoch 2 | Time = 0.0s | Cost = 1330.12
Epoch 3 | Time = 0.1s | Cost = 1249.12
Epoch 4 | Time = 0.1s | Cost = 1168.12
Epoch 5 | Time = 0.1s | Cost = 1087.12
Epoch 6 | Time = 0.1s | Cost = 1006.12
Epoch 7 | Time = 0.2s | Cost = 925.12
Epoch 8 | Time = 0.2s | Cost = 844.12
Epoch 9 | Time = 0.2s | Cost = 763.12
Epoch 10 | Time = 0.2s | Cost = 682.12
Epoch 11 | Time = 0.2s | Cost = 601.12
Epoch 12 | Time = 0.3s | Cost = 573.37
Epoch 13 | Time = 0.3s | Cost = 564.37
Epoch 14 | Time = 0.3s | Cost = 555.37
Epoch 15 | Time = 0.3s | Cost = 546.37
Epoch 16 | Time = 0.4s | Cost = 537.37
Epoch 17 | Time = 0.4s | Cost = 528.37
Epoch 18 | Time = 0.4s | Cost = 519.37
Epoch 19 | Time = 0.4s | Cost = 510.37
Epoch 20 | Time = 0.5s | Cost = 501.37
Epoch 21 | Time = 0.5s | Cost = 492.37
Epoch 22 | Time = 0.5s | Cost = 483.37
Epoch 23 | Time = 0.5s | Cost = 474.37
Epoch 24 | Time = 0.5s | Cost = 465.37
Epoch 25 | Time = 0.6s | Cost = 456.37
Epoch 26 | Time = 0.6s | Cost = 447.37
Epoch 27 | Time = 0.6s | Cost = 438.37
Epoch 28 | Time = 0.7s | Cost = 429.37
Epoch 29 | Time = 0.7s | Cost = 420.37
Epoch 30 | Time = 0.7s | Cost = 411.37
Epoch 31 | Time = 0.7s | Cost = 402.37
Epoch 32 | Time = 0.7s | Cost = 393.37
Epoch 33 | Time = 0.8s | Cost = 384.37
Epoch 34 | Time = 0.8s | Cost = 375.37
Epoch 35 | Time = 0.8s | Cost = 366.37
Epoch 36 | Time = 0.8s | Cost = 357.37
Epoch 37 | Time = 0.9s | Cost = 348.37
Epoch 38 | Time = 0.9s | Cost = 339.37
Epoch 39 | Time = 0.9s | Cost = 330.37
Epoch 40 | Time = 0.9s | Cost = 321.37
Epoch 41 | Time = 1.0s | Cost = 312.38
Epoch 42 | Time = 1.0s | Cost = 303.38
Epoch 43 | Time = 1.0s | Cost = 305.62
Epoch 44 | Time = 1.0s | Cost = 303.38
Epoch 45 | Time = 1.1s | Cost = 305.62
Epoch 46 | Time = 1.1s | Cost = 303.38
Epoch 47 | Time = 1.1s | Cost = 305.62
Epoch 48 | Time = 1.1s | Cost = 303.38
Epoch 49 | Time = 1.2s | Cost = 305.62
Epoch 50 | Time = 1.2s | Cost = 303.38
Epoch 51 | Time = 1.2s | Cost = 305.62
Epoch 52 | Time = 1.2s | Cost = 303.38
Epoch 53 | Time = 1.3s | Cost = 305.62
Epoch 54 | Time = 1.3s | Cost = 303.38
Epoch 55 | Time = 1.3s | Cost = 305.62
Epoch 56 | Time = 1.3s | Cost = 303.38
Epoch 57 | Time = 1.4s | Cost = 305.62
Epoch 58 | Time = 1.4s | Cost = 303.38
Epoch 59 | Time = 1.4s | Cost = 305.62
Epoch 60 | Time = 1.4s | Cost = 303.38
Epoch 61 | Time = 1.5s | Cost = 305.62
Epoch 62 | Time = 1.5s | Cost = 303.38
Epoch 63 | Time = 1.5s | Cost = 305.62
Epoch 64 | Time = 1.5s | Cost = 303.38
Epoch 65 | Time = 1.6s | Cost = 305.62
Epoch 66 | Time = 1.6s | Cost = 303.38
Epoch 67 | Time = 1.6s | Cost = 305.62
Epoch 68 | Time = 1.6s | Cost = 303.38
Epoch 69 | Time = 1.7s | Cost = 305.62
Epoch 70 | Time = 1.7s | Cost = 303.38
Epoch 71 | Time = 1.7s | Cost = 305.62
Epoch 72 | Time = 1.7s | Cost = 303.38
Epoch 73 | Time = 1.8s | Cost = 305.62
Epoch 74 | Time = 1.8s | Cost = 303.38
Epoch 75 | Time = 1.8s | Cost = 305.62
Epoch 76 | Time = 1.8s | Cost = 303.38
Epoch 77 | Time = 1.9s | Cost = 305.62
Epoch 78 | Time = 1.9s | Cost = 303.38
Epoch 79 | Time = 1.9s | Cost = 305.62
Epoch 80 | Time = 1.9s | Cost = 303.38
Epoch 81 | Time = 2.0s | Cost = 305.62
Epoch 82 | Time = 2.0s | Cost = 303.38
Epoch 83 | Time = 2.0s | Cost = 305.62
Epoch 84 | Time = 2.0s | Cost = 303.38
Epoch 85 | Time = 2.1s | Cost = 305.62
Epoch 86 | Time = 2.1s | Cost = 303.38
Epoch 87 | Time = 2.1s | Cost = 305.62
Epoch 88 | Time = 2.1s | Cost = 303.38
Epoch 89 | Time = 2.2s | Cost = 305.62
Epoch 90 | Time = 2.2s | Cost = 303.38
Epoch 91 | Time = 2.2s | Cost = 305.62
Epoch 92 | Time = 2.2s | Cost = 303.38
Epoch 93 | Time = 2.3s | Cost = 305.62
Epoch 94 | Time = 2.3s | Cost = 303.38
Epoch 95 | Time = 2.3s | Cost = 305.62
Epoch 96 | Time = 2.3s | Cost = 303.38
Epoch 97 | Time = 2.4s | Cost = 305.62
Epoch 98 | Time = 2.4s | Cost = 303.38
Epoch 99 | Time = 2.4s | Cost = 305.62
Epoch 100 | Time = 2.4s | Cost = 303.38
ApplicationDrivenLearning.Solution(303.3750343322754, Real[19.887499f0])
Pros
- Tends to require fewer iterations to achieve good results.
- Can use different optimizers from Flux.jl and strategies such as stochastic gradient descent with small batches.
Cons
- Relies on the gradient of the objective function, which can be costly to compute, sensitive to the learning rate and on the initial guess.
- Does not support integer variables in the plan and assess models.