Execution Time Estimator for Computational Processes

Background

It is straightforward to mathematically estimate the execution time for a computational process. However, when such an estimate is applied in the real world, it may often end up being almost useless. This is typically because:

The effects of constant terms are neglected at short timescales.
Processing-system related issues are ignored at long timescales.

For this reason, the current system for execution-time estimation is based on:

The specification of a computational order, combined with a set of empirical measurements of the process behavior.

In combination, these are then used to construct an estimator function, which is applied to predict execution-time behavior.

Each measurement consists of:

A single scalar reference value N_ref, combined with the associated wall-clock execution time for the process step under consideration dt.

An a priori computational order (selected from O_0, O_N, O_LOGN, O_N2, …) is used to convert this N_ref into a value N, which is then used to construct the estimator function as dt(N). This construction uses a spline-fitting technique, with a constrained maximum degree. Such an estimator will produce highly accurate values for N_ref falling within the interval of the measurements used in its construction, but somewhat less accurate values for N_ref falling outside of that interval.

Design objectives

Use an efficient technique to obtain empirical timing measurements from relevant subsections of the running code. This technique should:

Be as flexible and non-intrusive as possible
- Easy for a developer to apply initially
- Easy to modify where it is applied.
Provide automatic identification of which section of code a measurement comes from.
Provide automatic detection of changes to the N_ref function used to produce the estimator. This allows the user to be notified when the existing set of empirical measurements is no longer valid, and should be discarded.
- Note that detecting changes to the selected order is not a requirement, as the estimator can easily be reconstructed from existing measurements when the order is modified.

The technique must work with:

Methods in a Service class.
Methods in a Recipe class.
Any subsections of code within those methods.

Both a decorator form and context-manager form of the WallClockTime profiler should be provided.

The decorator must decorate either a class or a function (in the discussion, Recipe and Service.method are used as examples).
When a class is decorated, the method to profile must be specified explicitly. For example Recipe.cook is usually specified when a Recipe class is decorated.
For context-manager application, the local stack frame determines the calling method automatically, but this is insufficient to identify the precise location in the code. This creates the additional requirement to specify an explicit step name in this case.
In all cases, an explicit step name (optional for decorator application) may be specified, and in general doing this will provide additional clarity to a developer attempting to read and understand the recorded measurement values.

Time estimates should be available whenever required for any running process step. The estimated completion time for each step should be logged.

Process steps may be arbitrarily nested. Logging from process substeps should be abbreviated.
Customized display of in-progress process steps should be facilitated by deriving from the Logging.Handler class. For example, in this manner the existing logging updates might be used to control a progress bar. Any required numerical information for such use should be entered into the LogRecord.extra dict so that it can be accessed by such a derived Handler.

Implementation details

The primary class implementing both timing measurement and execution-time estimation is ProgressRecorder.

Applied by calling record() at the start of an execution step, and stop() at the end.
Exposed via its decorator / context-manager WallClockTime. This is what should normally be used; record() and stop() should not be called directly.

Timing data from ProgressRecorder is persistent:

Stored at: ${user.application.data.home}/workflows_data/timing
Retained timing data is:
- limited by the number of retained files
- limited by the size of each file.

Progress-recording steps are automatically named using details from the scope of the decorated function / class, or from the local scope where the context manager is applied. When used as a context manager, an explicit step name must be provided (but this is optional when used as a decorator).

Explicit and implicit step-name information is combined to generate a step key to identify the progress-recording step.
Steps must be unique under this keying system, but otherwise they may be arbitrarily nested.
A stack keeps track of which steps are currently being recorded (“active steps”). Steps that become active while others are also active are called “substeps”.

Logging and estimation

For each application, either as a decorator or a context manager, a suitable N_ref function must be specified. At the start of each execution step, the current estimator instance is used to predict execution time. In order to compute this prediction, the current value of N_ref is required. Often, special cases occur during the calculation of N_ref which indicate that no estimate can be calculated. For example, when running the reduction workflow in live_data mode, no time estimate is presently possible.

In such cases, the N_ref function should return None, and the logging for the step will display simply no data available.

For a step that is not a substep, an automatic logging chain is initiated:

This chain reports progress of execution at regular intervals, until either the step completes, or the step exceeds its time estimate.
Logging from substeps is much reduced (no automatic logging chain starts).

Each generated LogRecord includes details (step key, time estimate, remaining time) in its extra dict that will be useful for an ergonomic display. For example, a class might be derived from Logging.Handler, and then connected to a GUI progress bar.

Updating the estimator

At successful completion of a step:

The current measurement is recorded.
- If an exception occurs, the step is marked as completed (and therefore inactive), but no measurement is recorded.
The current estimate is compared to the actual execution time.
- If the disparity is above a certain threshold, the estimator updates.
The estimator function updates only if enough measurements are available (usually this will require at least three).
Preferably, measurements should be at distinct N_ref values (although this is not strictly required).
Over time, estimator quality improves rapidly.

Known issues and what to expect when testing

Only primary workflows and a few key recipes have been decorated so far. It would be possible to decorate all recipes automatically, but this has not been attempted due to the discrepancy between the derivation of Recipe and GenericRecipe.
Progress logging is only activated for selected target steps. This means that obtaining as much useful profiling data as possible does not necessarily increase the logging output.

To facilitate testing, a new IPC-based logging feature is provided which allows logging output to be viewed in separate terminal windows:

IPC-handler names are specified in application.yml.
Each handler has a list of logger names that are associated with it.
- Example: the ProgressRecorder logger associated with the SNAPRed-progress IPC handler.
For security reasons, IPC-based logging uses Unix-domain sockets (UDS), and not localhost.

At present, the provided logging output for progress recording is not particularly ergonomic:

When displayed in a separate terminal window, the current output might be adequate for some use cases (e.g SNAPWrap).
A better method to display this data is almost certainly needed for both:
- SNAPRed GUI panel
- SNAPRed backend via SNAPWrap.

For testing, it may help if you already have a diffraction calibration and a normalization calibration for the run numbers you intend to use. Due to the required N_ref treatment of special-cases mentioned above:

At present, profiling only covers the case where both a diffraction calibration and a normalization calibration are available.

To test - Dev testing

Remove any data files at ${user.application.data.home}/workflows_data/timing.
- These may have been auto-generated.
- It’s OK to simply delete these JSON files, whenever required.
Start SNAPRed (either as env=dev python -m snapred, or from mantid_workbench).
Make a note of the process id (PID) using ps -u USER | grep python. Note here that mantid_workbench may have launched multiple python processes; use the first one.
In a separate terminal (activate the SNAPRed conda environment), start an IPC-server for the SNAPRed-progress handler:
- python tests/cis_tests/util/logging/IPC_server.py -n SNAPRed-progress -p ${PID}
Run any workflow, e.g., reduction. Start by reducing data for one run.
Exit. In an IPC-logging terminal, type CNTL-C to exit. Look at the JSON in ${user.application.data.home}/workflows_data/timing. Details of dt and estimated dt_est should have been recorded for several service steps and possibly from various recipes, depending on which workflow was executed. Note that the default estimator is linear: 1GB in 3.0s – this default instance should still be current after running only one workflow. Check the automatically generated step keys.
Repeat steps 1-3 to start a new process with an IPC-logger.
Run the reduction workflow for several distinct runs (e.g., 58810, 58812, 58813, 59039).
Somewhere in the middle of step 8, note that the time estimates indicated in the log will become much more accurate.
Exit as before. Inspect the latest JSON files – estimates should reflect improved accuracy, and you can see that the applied spline degree has increased.

Notes on estimator accuracy

The default linear estimator is not very accurate. It will update automatically when enough timing measurements are available.
The effect of this update can be seen in the JSON data at ${user.application.data.home}/workflows_data/timing.
If distinct N_ref values are not used, the estimator may become accurate for those run numbers only; extrapolation to other run numbers will be less accurate.
- Regardless, automatic updating continues as needed.
Given enough distinct N_ref values, the estimator will quickly become accurate, even when extrapolation is required.

Definition of the `N_ref` function

Special cases in workflows complicate the definition of the N_ref function. For example, in the normalization workflow, the distinction between cylindrical / spherical samples has not yet been treated.

The system presently identifies but does not attempt to estimate such special cases.
When defining an N_ref method, it’s important to return None if a case is identified that should not be estimated. This ensures that any recorded measurement can be effectively used to create a useful time estimate for the workflow.
- If every single measurement were recorded, certainly the estimate would be able to predict the average execution time for any workflow, but that average might be over cases including so many special parameter combinations that it would not be useful to the end user.

The ideal way to fix this issue of special casing the N_ref, in order to allow these estimators to function more broadly, is probably to move such special cases out of the service layer and up to workflow level. Alternatively, the step keying system could be made more elaborate.