Execution Time Estimator for Computational Processes
Background
It is straightforward to mathematically estimate the execution time for a computational process. However, when such an estimate is applied in the real world, it may often end up being almost useless. This is typically because:
The effects of constant terms are neglected at short timescales.
Processing-system related issues are ignored at long timescales.
For this reason, the current system for execution-time estimation is based on:
The specification of a computational order, combined with a set of empirical measurements of the process behavior.
In combination, these are then used to construct an estimator function, which is applied to predict execution-time behavior.
Each measurement consists of:
A single scalar reference value
N_ref, combined with the associated wall-clock execution time for the process step under considerationdt.
An a priori computational order (selected from O_0, O_N, O_LOGN, O_N2, …) is used to convert this N_ref into a value N, which is then used to construct the estimator function as dt(N). This construction uses a spline-fitting technique, with a constrained maximum degree. Such an estimator will produce highly accurate values for N_ref falling within the interval of the measurements used in its construction, but somewhat less accurate values for N_ref falling outside of that interval.
Design objectives
Use an efficient technique to obtain empirical timing measurements from relevant subsections of the running code. This technique should:
Be as flexible and non-intrusive as possible
Easy for a developer to apply initially
Easy to modify where it is applied.
Provide automatic identification of which section of code a measurement comes from.
Provide automatic detection of changes to the
N_reffunction used to produce the estimator. This allows the user to be notified when the existing set of empirical measurements is no longer valid, and should be discarded.Note that detecting changes to the selected order is not a requirement, as the estimator can easily be reconstructed from existing measurements when the order is modified.
The technique must work with:
Methods in a
Serviceclass.Methods in a
Recipeclass.Any subsections of code within those methods.
Both a decorator form and context-manager form of the WallClockTime profiler should be provided.
The decorator must decorate either a class or a function (in the discussion,
RecipeandService.methodare used as examples).When a class is decorated, the method to profile must be specified explicitly. For example
Recipe.cookis usually specified when aRecipeclass is decorated.For context-manager application, the local stack frame determines the calling method automatically, but this is insufficient to identify the precise location in the code. This creates the additional requirement to specify an explicit step name in this case.
In all cases, an explicit step name (optional for decorator application) may be specified, and in general doing this will provide additional clarity to a developer attempting to read and understand the recorded measurement values.
Time estimates should be available whenever required for any running process step. The estimated completion time for each step should be logged.
Process steps may be arbitrarily nested. Logging from process substeps should be abbreviated.
Customized display of in-progress process steps should be facilitated by deriving from the
Logging.Handlerclass. For example, in this manner the existing logging updates might be used to control a progress bar. Any required numerical information for such use should be entered into theLogRecord.extradictso that it can be accessed by such a derivedHandler.
Implementation details
The primary class implementing both timing measurement and execution-time estimation is ProgressRecorder.
Applied by calling
record()at the start of an execution step, andstop()at the end.Exposed via its decorator / context-manager
WallClockTime. This is what should normally be used;record()andstop()should not be called directly.
Timing data from ProgressRecorder is persistent:
Stored at:
${user.application.data.home}/workflows_data/timingRetained timing data is:
limited by the number of retained files
limited by the size of each file.
Progress-recording steps are automatically named using details from the scope of the decorated function / class, or from the local scope where the context manager is applied. When used as a context manager, an explicit step name must be provided (but this is optional when used as a decorator).
Explicit and implicit step-name information is combined to generate a step key to identify the progress-recording step.
Steps must be unique under this keying system, but otherwise they may be arbitrarily nested.
A stack keeps track of which steps are currently being recorded (“active steps”). Steps that become active while others are also active are called “substeps”.
Logging and estimation
For each application, either as a decorator or a context manager, a suitable N_ref function must be specified. At the start of each execution step, the current estimator instance is used to predict execution time. In order to compute this prediction, the current value of N_ref is required. Often, special cases occur during the calculation of N_ref which indicate that no estimate can be calculated. For example, when running the reduction workflow in live_data mode, no time estimate is presently possible.
In such cases, the
N_reffunction should returnNone, and the logging for the step will display simply no data available.
For a step that is not a substep, an automatic logging chain is initiated:
This chain reports progress of execution at regular intervals, until either the step completes, or the step exceeds its time estimate.
Logging from substeps is much reduced (no automatic logging chain starts).
Each generated LogRecord includes details (step key, time estimate, remaining time) in its extra dict that will be useful for an ergonomic display. For example, a class might be derived from Logging.Handler, and then connected to a GUI progress bar.
Updating the estimator
At successful completion of a step:
The current measurement is recorded.
If an exception occurs, the step is marked as completed (and therefore inactive), but no measurement is recorded.
The current estimate is compared to the actual execution time.
If the disparity is above a certain threshold, the estimator updates.
The estimator function updates only if enough measurements are available (usually this will require at least three).
Preferably, measurements should be at distinct
N_refvalues (although this is not strictly required).Over time, estimator quality improves rapidly.
Known issues and what to expect when testing
Only primary workflows and a few key recipes have been decorated so far. It would be possible to decorate all recipes automatically, but this has not been attempted due to the discrepancy between the derivation of
RecipeandGenericRecipe.Progress logging is only activated for selected target steps. This means that obtaining as much useful profiling data as possible does not necessarily increase the logging output.
To facilitate testing, a new IPC-based logging feature is provided which allows logging output to be viewed in separate terminal windows:
IPC-handler names are specified in
application.yml.Each handler has a list of logger names that are associated with it.
Example: the
ProgressRecorderlogger associated with theSNAPRed-progressIPC handler.
For security reasons, IPC-based logging uses Unix-domain sockets (
UDS), and not localhost.
At present, the provided logging output for progress recording is not particularly ergonomic:
When displayed in a separate terminal window, the current output might be adequate for some use cases (e.g
SNAPWrap).A better method to display this data is almost certainly needed for both:
SNAPRedGUI panelSNAPRedbackend viaSNAPWrap.
For testing, it may help if you already have a diffraction calibration and a normalization calibration for the run numbers you intend to use. Due to the required N_ref treatment of special-cases mentioned above:
At present, profiling only covers the case where both a diffraction calibration and a normalization calibration are available.
To test - Dev testing
Remove any data files at
${user.application.data.home}/workflows_data/timing.These may have been auto-generated.
It’s OK to simply delete these JSON files, whenever required.
Start
SNAPRed(either asenv=dev python -m snapred, or frommantid_workbench).Make a note of the process id (PID) using
ps -u USER | grep python. Note here thatmantid_workbenchmay have launched multiple python processes; use the first one.In a separate terminal (activate the
SNAPRedconda environment), start an IPC-server for theSNAPRed-progresshandler:python tests/cis_tests/util/logging/IPC_server.py -n SNAPRed-progress -p ${PID}
Run any workflow, e.g., reduction. Start by reducing data for one run.
Exit. In an IPC-logging terminal, type CNTL-C to exit. Look at the JSON in
${user.application.data.home}/workflows_data/timing. Details ofdtand estimateddt_estshould have been recorded for several service steps and possibly from various recipes, depending on which workflow was executed. Note that the default estimator is linear: 1GB in 3.0s – this default instance should still be current after running only one workflow. Check the automatically generated step keys.Repeat steps 1-3 to start a new process with an IPC-logger.
Run the reduction workflow for several distinct runs (e.g., 58810, 58812, 58813, 59039).
Somewhere in the middle of step 8, note that the time estimates indicated in the log will become much more accurate.
Exit as before. Inspect the latest JSON files – estimates should reflect improved accuracy, and you can see that the applied spline degree has increased.
Notes on estimator accuracy
The default linear estimator is not very accurate. It will update automatically when enough timing measurements are available.
The effect of this update can be seen in the JSON data at
${user.application.data.home}/workflows_data/timing.If distinct
N_refvalues are not used, the estimator may become accurate for those run numbers only; extrapolation to other run numbers will be less accurate.Regardless, automatic updating continues as needed.
Given enough distinct
N_refvalues, the estimator will quickly become accurate, even when extrapolation is required.
Definition of the N_ref function
Special cases in workflows complicate the definition of the N_ref function. For example, in the normalization workflow, the distinction between cylindrical / spherical samples has not yet been treated.
The system presently identifies but does not attempt to estimate such special cases.
When defining an
N_refmethod, it’s important to returnNoneif a case is identified that should not be estimated. This ensures that any recorded measurement can be effectively used to create a useful time estimate for the workflow.If every single measurement were recorded, certainly the estimate would be able to predict the average execution time for any workflow, but that average might be over cases including so many special parameter combinations that it would not be useful to the end user.
The ideal way to fix this issue of special casing the N_ref, in order to allow these estimators to function more broadly, is probably to move such special cases out of the service layer and up to workflow level. Alternatively, the step keying system could be made more elaborate.