Main CIF observation operator standard/std#

Description#

This is the main observation operator for pyCIF. It is called by most execution modes and heavily relies on so-called transforms for elementary operations.

Indeed, the observation operator can be decomposed as follows in sub-operations:

\[\mathcal{H}(\mathbf{x}) = ( \mathcal{H}_1 \circ \mathcal{H}_2 \circ \cdots \circ \mathcal{H}_N ) (\mathbf{x})\]

See details about the transforms here, in particular their individual documentation and the general input output format.

Transform pipeline#

In pyCIF, the successive transforms are arranged into a so-called pipeline. The steps to initialize a pipeline consistent with the user-defined configuration are carried out in the function:

pycif.plugins.obsoperators.standard.transforms.init_transform(self)[source]

Initialize the complete transform pipeline for the observation operator.

Assembles the ordered Transform pipeline that the operator will execute at run time. The pipeline is built from four sub-pipelines applied in sequence:

  1. Observation-vector side — transforms from obsvect.transform_pipe plus a mandatory toobsvect step for each observed species, and a satellites step for satellite components (via init_obsvect_transformations()).

  2. Main pipe — transforms from obsoperator.transform_pipe (defaults to a single run_model step when none are specified), via init_mainpipe().

  3. Control-vector side — transforms from controlvect.transform_pipe (via init_control_transformations()).

  4. Dump / load wrappersdump2inputs and loadfromoutputs transforms inserted automatically where force_dump or force_loadout flags are set (via dump_read_inout()).

After assembly the pipeline is ordered so that precursor transforms always run before their successors via period_pipe(). Data availability is verified by check_datavect(), and a human-readable description is written to disk by dump_transform_description().

If self.batch_computation is configured, the pipeline is further modified for Monte-Carlo batch execution via batch_computation().

Parameters:

self (ObsOperator) – the obs-operator plugin instance. On return, self.transform_pipe, self.period_order_fwd, and self.period_order_adj are populated.

Note

To compute a given pipeline, the observation operator first walks the pipeline backwards in a dry-run mode. This initialization step allows propagating metadata about what output format is needed for transformations.

For instance, metadata about observations need to be propagated backwards, so pyCIF knows where to extract concentrations in the CTM, before running it forward.

Main pipeline#

The observation vector builds the transformation pipeline according to information specified in the control vector transform_pipe, in the observation vector transform_pipe and in the observation operator transform_pipe

The functions used to determine the main pipe are the following (by order of execution):

pycif.plugins.obsoperators.standard.transforms.init_mainpipe(self, all_transforms, backup_comps, mapper)[source]

Initialize the core of the transform pipeline.

Reads self.transform_pipe (transforms defined directly on the observation operator in the YAML) and inserts each of its transforms before the first element already present in self.mainpipe. If no transform_pipe is defined and self.ignore_model is False, a default run_model transform is added automatically.

Warning

If transform_pipe is specified in the observation operator, only the explicitly listed transforms are used — the CTM model is not added automatically. To run the model on top of custom transforms, include run_model explicitly in the list. For most applications it is preferable to define extra transforms in the controlvect or obsvect transform_pipe instead.

Parameters:
  • self (ObsOperator) – the obs-operator plugin instance. On return, self.mainpipe is updated with the IDs of the newly inserted transforms.

  • all_transforms – the Transform object holding all transforms; modified in-place.

  • backup_comps (dict) – backed-up component definitions forwarded to add_default().

  • mapper (dict) – the pipeline mapper dictionary; updated in-place.

pycif.plugins.obsoperators.standard.transforms.init_control_transformations(self, all_transforms, controlvect, backup_comps, mapper)[source]

Initialize transforms on the control-vector side.

Reads controlvect.transform_pipe and inserts each of its transforms before the first element of self.mainpipe in all_transforms, preserving the user-defined order.

Also loops over all components/tracers of the datavect and, for those that specify the unit_conversion argument, automatically inserts a unit_conversion transform.

Parameters:
  • self (ObsOperator) – the obs-operator plugin instance; uses self.mainpipe to determine the insertion point.

  • all_transforms – the Transform object holding all transforms; modified in-place.

  • controlvect (ControlVect) – control-vector object; its transform_pipe is read to determine which transforms to insert.

  • backup_comps (dict) – backed-up component definitions forwarded to add_default().

  • mapper (dict) – the pipeline mapper dictionary; updated in-place.

pycif.plugins.obsoperators.standard.transforms.init_obsvect_transformations(self, all_transforms, obsvect, backup_comps, mapper)[source]

Initialize transforms on the observation-vector side.

Appends a toobsvect transform to the pipeline for every observed species (component/tracer pair where param.isobs is True). For satellites components, a satellites transform is inserted immediately before the corresponding toobsvect step.

Then reads obsvect.transform_pipe and prepends each of its transforms before all other transforms in all_transforms, preserving the user-defined order.

Parameters:
  • self (ObsOperator) – the obs-operator plugin instance. On return, self.mainpipe is populated with the IDs of the newly inserted toobsvect (and satellites) transforms.

  • all_transforms – the Transform object holding all transforms; modified in-place.

  • obsvect (ObsVect) – observation-vector object; its transform_pipe and datavect are read to determine which transforms to insert.

  • backup_comps (dict) – backed-up component definitions forwarded to add_default().

  • mapper (dict) – the pipeline mapper dictionary; updated in-place.

Connecting and ordering transforms into a pipeline#

pycif.plugins.obsoperators.standard.transforms.connect_pipes(all_transforms, mapper, transform)[source]

Connect transforms based on their inputs and outputs

pycif.plugins.obsoperators.standard.transforms.period_pipe(self, all_transforms, mapper)[source]

Arrange all transforms into ordered forward and adjoint execution pipes.

Determines the chronologically correct execution order for every (transform, sub-simulation date) pair by:

  1. Propagating sub-simulation periods from each transform to its precursors and successors via default_subsimus().

  2. Building a dependency graph and walking it in forward order with fwd_adj_pipe() (mode='forward').

  3. Walking the same graph in reverse order (mode='adjoint').

Each returned pipe is a list of (date, transform_id, direction) tuples, where direction is either 'forward' or 'adjoint' and controls whether a transform runs in its normal or dry-run mode.

Parameters:
  • self (ObsOperator) – the obs-operator plugin instance.

  • all_transforms – the Transform object holding all initialized transforms.

  • mapper (dict) – the pipeline mapper dictionary mapping transform IDs to their sub-simulation, input/output and precursor/successor metadata.

Returns:

(pipe_fwd, pipe_adj) where each element is a list of (datetime.datetime, str, str) tuples giving the execution order for forward and adjoint runs respectively.

Return type:

tuple[list, list]

Automatic pipeline#

After initializing the main pipeline of required transforms, the observation operator, checks the consistency of the horizontal and vertical extent, of the temporal resolution, and of the data unit to determine extra intermediate transformations to be carried out.

More precisely, for every successive transform of the main pipeline, the observation operator checks whether the output format of the precursor transform is consistent with the input format of the successor transform. This check includes the definition of the domain (horizontal and vertical extent), of the input_dates (temporal definition) and of the unit.

The corresponding transforms that may be included at this step are:

  1. regrid

  2. time_interpolation

  3. vertical_interpolation

  4. unit_conversion

For each of the above-mentioned transforms, it is possible to explicitly specify extra parameters in the related component/tracer of the datavect as follows:

datavect :
  components:
    flux:
      parameters:
        CO2:
          dir: XXX
          file: XXX
          regrid:
            method: mass-conservation

All these operations are done in the function:

pycif.plugins.obsoperators.standard.transforms.utils.init_default_transformations(self, all_transforms, backup_comps, mapper, transform, do_pipe_entry=False, trid_to_check=None)[source]

Initialize default transformations based on compatibility of input/output formats of successive transforms.

Debugging options#

Two options help inspect what happens at each step of the pipeline without modifying the transforms themselves:

  • save_debug — dumps the full inputs and outputs of every transform to $workdir/obsoperator/$run_id/transform_debug/. Useful for detailed inspection but slow and disk-intensive.

  • save_debug_meta — requires save_debug: True. Replaces the full NetCDF / datastore files with lightweight plain-text summaries that record dimensions, value ranges, and NaN flags. Suitable for routine debugging or large runs where a full dump would be impractical.

YAML arguments#

The following arguments are used to configure the plugin. pyCIF will return an exception at the initialization if mandatory arguments are not specified, or if any argument does not fit accepted values or type:

Optional arguments#

autorestart : bool, optional, default False

if interrupted, computations restart from the last simulated period. WARNING: the CIF cannot detect whether this period has been correctly written or is corrupt: it is necessary to check manually in the relevant directories and remove the last simulated period if a file has not been correctly written.

autoflush : bool, optional, default False

Remove big temporary files when the run is done

force-full-flush : bool, optional, default False

Complementary to autoflush. Also flushes files needed to run an adjoint. Use this option when no adjoint is needed later. The option is triggered only if autoflush is True

save_debug : bool, optional, default False

Force transforms to save debugging information. Intermediate datastores will be saved in the directory $workdir/obsoperator/$run_id/transform_debug/

Warning

This option saves every intermediate states of the transformation pipeline. It slows drastically the computation of the obsvervation operator and can take a lot of disk space. Should be used only for debugging or understanding what happens along the way.

save_debug_meta : bool, optional, default False

Complementary to save_debug. When both save_debug and save_debug_meta are True, only lightweight plain-text metadata files are written instead of full NetCDF / datastore files. Dramatically reduces wall-time and disk overhead while still capturing enough information to trace data flow and detect anomalies.

Each text file records, per intermediate datastore entry:

  • For xr.Dataset — dimension names and sizes, min/max values, and NaN presence for the spec variable (and incr when present).

  • For pd.DataFrame — row count and, for each of the maindata, spec, and incr columns that exist, min/max values and NaN presence.

Files are written to the same $workdir/obsoperator/$run_id/transform_debug/ directory as the full debug files, but use _meta_ in their names rather than _debug_, and carry a .txt extension.

Has no effect if save_debug is False.

force_full_operator : bool, optional, default False

Force computing all transforms in the observation operator, event if no observation is to be simulated.

init_inputs : optional

Structure of components and parameters to initialize. Doing so, there is no need to define an execution mode. Only inputs that were required will be computed. Moreover, with this option, it is possible to provide a partial yaml paragraph for the datavect object: only components required to generate those required are checked before execution.

Argument structure:
any_key : optional

Name of a given component to be initialized

Argument structure:
parameters : list, optional

List of parameters to initialize for the corresponding component. Initialize all parameters if not specified

transform_pipe : optional

List of transformations to build the main observation operator pipeline

Argument structure:
any_key : optional

Name of a given transformation to be included. The name has no impact on the way the observation operator is computed, although it is recommended to use explicit names to help debugging.

Argument structure:
**args : optional

Arguments to set-up the given transform

parallel : optional

Physical parallelization of the computation of the TL and adjoint

Argument structure:
segments : str, mandatory

Length of each parallel segment

overlap : str, mandatory

Length of the initial overlap with previous segments

subprocess : bool, optional, default False

If True submit the segments in subprocesses, else submit them in new jobs with the platform plugin

nproc : int, optional

number of proc to attribute to each segments when ‘subprocess’ is True (work with LMDz only)

ref_fwd_dir : str, optional, default “”

Path to a reference forward run. This is used when using the approximate operator to accelerate its computation.

approx_operator : optional

Approximate the observation operator outside the given interval

Argument structure:
datei : str, mandatory

Start date of the interval on which to compute the real operator

datef : str, mandatory

Start date of the interval on which to compute the real operator

batch_computation : optional

Compute perturbed samples of the control vector within the same observation operator

Argument structure:
nsamples : int, mandatory

Number of samples to generate

dir_samples : str, mandatory

Directory where to fetch sample control vectors

file_samples : str, optional, default “controlvect_ensemble.pickle”

Sample control vectors file name

dont_propagate : list, optional

list of (component, parameter) tuples that should not be propagated

dont_propagate_obsvect : list, optional

list of (component, parameter) tuples that ‘toobsvect’ transformation should not be propagated

ignore_model : bool, optional, default False

Do not run the model as part of the observation operator.

force_propagate_attributes : bool, optional, default False

Force the propagation of attributes throughout transforms. Use with caution.

monitor_memory : bool, optional, default False

Print memory usage for each transform.

clean_memory : bool, optional, default True

Clean datastores that are not used anymore

autokill_time : str, optional

Stops the running simulation after a given time and re-submit it automatically in a new job. Should be one of Pandas’ offset aliases for example use ‘23h’ to stop the simulation after 23 hours. When using this option, a platform plugin with the options needed for submitting a job is required.

max_resubmissions : int, optional, default 0

Maximum number of times the simulation can be automatically re-submitted in a job.

rename_resubmit_logfile : int, optional, default True

Rename logfile for re-submitted sumulations.

onlyinit : bool, optional, default False

Does the initialization of the observation operator only

use_dask : bool, optional, default False

Prototype: Use dask to manage transform graph tree

Requirements#

The current plugin requires the present plugins to run properly:

Requirement name

Requirement type

Explicit definition

Any valid

Default name

Default version

model

Model

False

True

None

None

obsvect

ObsVect

True

True

standard

std

controlvect

ControlVect

True

True

standard

std

datavect

DataVect

True

True

standard

std

platform

Platform

True

True

None

None

YAML template#

Please find below a template for a YAML configuration:

 1obsoperator:
 2  plugin:
 3    name: standard
 4    version: std
 5    type: obsoperator
 6
 7  # Optional arguments
 8  autorestart: XXXXX  # bool
 9  autoflush: XXXXX  # bool
10  force-full-flush: XXXXX  # bool
11  save_debug: XXXXX  # bool
12  save_debug_meta: XXXXX  # bool
13  force_full_operator: XXXXX  # bool
14  init_inputs:
15    any_key:
16      parameters: XXXXX  # list
17  transform_pipe:
18    any_key:
19      **args: XXXXX  # any
20  parallel:
21    segments: XXXXX  # str
22    overlap: XXXXX  # str
23    subprocess: XXXXX  # bool
24    nproc: XXXXX  # int
25  ref_fwd_dir: XXXXX  # str
26  approx_operator:
27    datei: XXXXX  # str
28    datef: XXXXX  # str
29  batch_computation:
30    nsamples: XXXXX  # int
31    dir_samples: XXXXX  # str
32    file_samples: XXXXX  # str
33    dont_propagate: XXXXX  # list
34    dont_propagate_obsvect: XXXXX  # list
35  ignore_model: XXXXX  # bool
36  force_propagate_attributes: XXXXX  # bool
37  monitor_memory: XXXXX  # bool
38  clean_memory: XXXXX  # bool
39  autokill_time: XXXXX  # str
40  max_resubmissions: XXXXX  # int
41  rename_resubmit_logfile: XXXXX  # int
42  onlyinit: XXXXX  # bool
43  use_dask: XXXXX  # bool