Standard CIF data vector standard/std

Standard CIF data vector standard/std#

Description#

This is the standard pyCIF implementation of the datavect class. Information about inputs are split into component/parameter categories. component/parameter categories are fully flexible in terms of names, but should be consistent with the rest of the configuration.

General component categories include for instance:

concs:

observed concentrations

fluxes:

emission fluxes

inicond:

initial conditions

meteo:

meteorological fields

For each component, multiple parameters can be defined depending on diverse species, sectors, etc.

The datavect object is used to define the controlvect and obsvect objects. Therefore, complementary arguments than those specific to the datavect can be used in each component/parameter. Please see details of such additional arguments here and here.

YAML arguments#

The following arguments are used to configure the plugin. pyCIF will return an exception at the initialization if mandatory arguments are not specified, or if any argument does not fit accepted values or type:

Optional arguments#

dump_debug : bool, optional, default False

Save extra information for debugging purpose. It includes the list of files and dates for each input saved in $workdir/datavect/

components : optional

List of components in the data vector

Argument structure:
any_key : optional

Name of a given component

Argument structure:
dir : str, optional, default “”

Path to the corresponding component. This value is used if not provided in parameters

file : str, optional, default “”

File format in the given directory. This value is used if not provided in parameters

varname : str, optional, default “”

Variable name to use to read data filesinstead of the parameter name if different to the parameter name

file_freq : str, optional, default “”

Temporal frequency to fetch files

split_freq : str, optional

Force splitting the processing at a given frequency different to file_freq

parameters : optional

Store the list of parameters for this component

Argument structure:
any_key : optional

Name of a given parameter

Argument structure:
dir : str, optional, default “”

Path to the corresponding component. This value is used if not provided in parameters

file : str, optional, default “”

File format in the given directory. This value is used if not provided in parameters

varname : str, optional, default “”

Variable name to use to read data filesinstead of the parameter name if different to the parameter name

file_freq : str, optional, default “”

Temporal frequency to fetch files

split_freq : str, optional

Force splitting the processing at a given frequency different to file_freq

hresol : “hpixels” or “regions” or “hbands” or “ibands” or “global”, optional

the horizontal resolution of the control vector.

Warning

This argument determines whether the parameter is included in the control vector. All other arguments will be ignored if this one is not specified.

  • “hpixels”: use the native resolution of the corresponding data

  • “regions”: aggregate pixels into regions using a mask specified by the user

  • “hbands”: aggregate pixels by lon/lat bands

  • “ibands”: aggregate pixels by column/row index bands

  • “global”: optimize one factor for the whole spatial extent of the data

vresol : “vpixels” or “kbands” or “column”, optional, default “column”

the vertical resolution of the control vector.

  • “vpixels”: use the native resolution of the corresponding data

  • “kbands”: aggregate pixels into vertical bands by level index

  • “column”: (default) optimize one factor for the whole vertical extent of the data

tresol : str, optional

the main temporal resolution of the control vector. Should be a pandas syntax string value. If not specified, only one increment for the full inversion window

tsubresol : None, optional

secondary resolution for the control vector. If tsubresol is not a divider of tresol, the final temporal resolution will keep tresol as anchors and them split them accordingly to tsubresol and fitting the size of the last sub-period of each period.

For instance if tresol is 1MS and tsubresol is 10D, the control vector will have a monthly resolution with 3 sub-periods per month: the two first periods are 10-days long according to tsubresol and the third sub-period fills the remaining days of the months, hence between 8 days (for February) to 11 days for 31-day-long months

type : “scalar” or “physical”, optional, default “scalar”

type of increments

  • “scalar”: (default) multiplicative increments. The control vector and the uncertainty matrix store unitless scaling factors

  • “physical”: additive increments. The control vector and the uncertainty matrix store the values in the original prior data set

xb_scale : float, optional

a scalar to apply to the prior before any computation

xb_value : float, optional

an offset to apply to the prior before any computation

err : float, optional

scaling factor to apply to the prior to compute the standard deviation of prior uncertainties.

err_type : “max” or “avg”, optional, default “avg”

complement to err; approach used to compute prior uncertainties from prior values; used only when type = physical:

  • “max”: Take the maximum prior value of the surrounding grid cells and scale it by err.

  • “avg”: (default) Take the average prior value of all the spatial extent of the prior data and scale it by err.

lower_bound : float, optional

lower boundary for the value of this control variable

upper_bound : float, optional

upper boundary for the value of this control variable.

glob_err : optional

used only when type = physical. Can be used to specify a total error for the spatial extent of the prior. The standard deviation of each spatial component of the control vector is scale, so that the total error (accounting for the horizontal correlations if any) matches the one specified

Argument structure:
total : float, mandatory

the area-weighted sum of all prior values is scaled according to this value

unit_scale : float, optional, default 1

scaling factor to apply to the sum of prior values. Use if the value specified in total is not in the same unit as the one in the prior values

surface_unit : bool, optional, default False

set to True if the total value is given per unit of surface

frequency_unit : bool, optional, default False

set to True if the total value is given per unit of time

account_correlations : bool, optional, default True

account or not for correlations to compute the total errors, i.e. also summing non-diagonal terms of the covariance matrix

lowlim_error : optional

lower limit for the standard deviation of prior uncertainties. The threshold is computed using the physical values of the prior data

Argument structure:
err : float, mandatory

lower threshold for errors

unit_scale : float, optional, default 1

scaling factor to apply to prior values. Use if the value specified in err is not in the same unit as the one in the prior values

hcorrelations : optional

horizontal correlations. In most cases, the matrix B is not explicitly built. Instead, Kronecker products are used for each temporal slice of the control vector, horizontal correlations are used

Argument structure:
sigma : float, optional

the horizontal correlation length in kilometers

landsea : bool, optional, default False

separate land and sea pixels

sigma_land : float, optional

the horizontal correlation length for land pixels

sigma_sea : float, optional

the horizontal correlation length for sea pixels

filelsm : str, optional

the path to the land-sea mask; it is a NetCDF with a variable lsm; ocean pixels are pixels with lsm < 0.5

dump_hcorr : bool, optional, default False

save horizontal correlations (as eigen vectors and values) for later use; they are saved in the folder $WORKDIR/controlvect/correlations/; the name of each file is: horcor_{hresol}_{nlon}x{nlat}_cs{sigma_sea}_cl{sigma_land}.bin; a suffix _lbc is appended if correlations are computed for a lateral boundary condition component

dircorrel : str, optional

where to look for pre-computed correlations; files are looked for in the folder following the same format as for dump_hcorr

evalmin : float, optional, default 0

minimal value for eigen values to filter out

crop_chi : bool, optional, default False

if True, the regularized vector \(\mathbf{\chi}\) has a reduced dimension (consistent with evalmin) compared to the full control vector

tcorrelations : optional

lower limit for the standard deviation of prior uncertainties. The threshold is computed using the physical values of the prior data

Argument structure:
multi_sigmas : bool, optional, default False

it is possible to convolve multiple temporal correlation lengths and type (see below). if multi_sigmas is True, add a sub-paragraph sigmas, with multiple entries; for each entry (the name has no importance), specify the sigma_t and type; this read as follows:

tcorrelations:
  multi_sigmas: True
  sigmas:
    sigma1:
      type: isotrope
      sigma_t: "3D"
    sigma2:
      type: frequency
      freq: "1D"
      sigma_t: "10D"
    sigma3:
      type: category
      scale: "hourofday"
      sigma_t: "50D"

Note

Please note the if multi_sigmas is True, only the correlation values below sigmas will be accounted for.

sigmas : optional

temporal correlation lengths and types, to be used with multi_sigmas

Argument structure:
any_key : optional

correlation length and type

Argument structure:
sigma_t : float, mandatory

correlation length

type : str, mandatory

correlation type

sigma_t : str, optional

temporal correlation length; should be a pandas frequency string

type : “isotrope” or “frequency” or “category”, optional

the horizontal correlation length for land pixels

  • “isotrope”: correlations are simply computed following the temporal distance: \(r = \exp((\delta t / \sigma_t) ^ 2)\)

  • “frequency”: only control vector components separated by a period of exactly the given frequency will be correlated, still following the same formula as for isotrope; for instance if frequency = 1D, only components at the same hour of the day will be correlated with each others

  • “category”: the temporal distance to apply the correlation formula is calculated by temporal categories accepted values: [hourofday, dayofweek,:bash:monthofyear] for instance, with hourofday, a component at 12:00 on a given day will be more correlated to a component at 13:00 for another day, than with a component at 18:00 of the same day

dump_tcorr : bool, optional, default False

save horizontal correlations (as eigen vectors and values) for later use; they are saved in the folder $WORKDIR/controlvect/correlations/; the name of each file is: tempcor_{datei}_{datef}_per{period}_ct{ sigma_t}_{sigma_type}.bin; a suffix _lbc is appended if

dircorrel : str, optional

where to look for pre-computed correlations

evalmin : float, optional, default 0

minimal value for eigen values to filter out

crop_chi : None, optional, default False

if True, the regularized vector \(\mathbf{\chi}\) has a reduced dimension (consistent with evalmin) compared to the full control vector

bands_lat, bands_lon : list, optional

To be used with hpixels = bands. A list of longitudes/latitudes defining a chess-board for aggregating the pixels. The values are the side of each band, hence one need N + 1 values for N bands

bands_i, bands_j : list, optional

To be used with hpixels = ibands. same as bands_lat / bands_lon but with column/row indexes

regions_infos : optional

To be used with hpixels = regions. Information about the file to be read to define regions.

The region file format can either follow a default format, which is a NetCDF file with a variable regions; the variable should have the same dimension as the domain of the prior data; It is possible to use the format of another data type as recognized by pycif. In that case, a plugin sub-paragraph should be included in regions_infos

Argument structure:
dir : str, mandatory

Path where to find the region-defining file

file : str, mandatory

name of the file

plugin : mandatory

plugin used to read the region-defining file

Argument structure:
name : str, mandatory

name of the plugin

version : str, mandatory

version of the plugin

regions_lsm : bool, optional, default False

To be used with hpixels = regions. Use the index of each regions to determine land and ocean regions. Positive indexes are land regions. Negative and null indexes are ocean regions. This information is used to computed horizontal correlations if the correlation length is different for land and ocean.

Requirements#

The current plugin requires the present plugins to run properly:

Requirement name

Requirement type

Explicit definition

Any valid

Default name

Default version

domain

Domain

True

True

None

None

model

Model

True

True

None

None

components

DataStream

True

True

None

None

YAML template#

Please find below a template for a YAML configuration:

 1datavect:
 2  plugin:
 3    name: standard
 4    version: std
 5    type: datavect
 6
 7  # Optional arguments
 8  dump_debug: XXXXX  # bool
 9  components:
10    any_key:
11      dir: XXXXX  # str
12      file: XXXXX  # str
13      varname: XXXXX  # str
14      file_freq: XXXXX  # str
15      split_freq: XXXXX  # str
16      parameters:
17        any_key:
18          dir: XXXXX  # str
19          file: XXXXX  # str
20          varname: XXXXX  # str
21          file_freq: XXXXX  # str
22          split_freq: XXXXX  # str
23          hresol: XXXXX  # hpixels|regions|hbands|ibands|global
24          vresol: XXXXX  # vpixels|kbands|column
25          tresol: XXXXX  # str
26          tsubresol: XXXXX  # None
27          type: XXXXX  # scalar|physical
28          xb_scale: XXXXX  # float
29          xb_value: XXXXX  # float
30          err: XXXXX  # float
31          err_type: XXXXX  # max|avg
32          lower_bound: XXXXX  # float
33          upper_bound: XXXXX  # float
34          glob_err:
35            total: XXXXX  # float
36            unit_scale: XXXXX  # float
37            surface_unit: XXXXX  # bool
38            frequency_unit: XXXXX  # bool
39            account_correlations: XXXXX  # bool
40          lowlim_error:
41            err: XXXXX  # float
42            unit_scale: XXXXX  # float
43          hcorrelations:
44            sigma: XXXXX  # float
45            landsea: XXXXX  # bool
46            sigma_land: XXXXX  # float
47            sigma_sea: XXXXX  # float
48            filelsm: XXXXX  # str
49            dump_hcorr: XXXXX  # bool
50            dircorrel: XXXXX  # str
51            evalmin: XXXXX  # float
52            crop_chi: XXXXX  # bool
53          tcorrelations:
54            multi_sigmas: XXXXX  # bool
55            sigmas:
56              any_key:
57                sigma_t: XXXXX  # float
58                type: XXXXX  # str
59            sigma_t: XXXXX  # str
60            type: XXXXX  # isotrope|frequency|category
61          dump_tcorr: XXXXX  # bool
62          dircorrel: XXXXX  # str
63          evalmin: XXXXX  # float
64          crop_chi: XXXXX  # None
65          bands_lat, bands_lon: XXXXX  # list
66          bands_i, bands_j: XXXXX  # list
67          regions_infos:
68            dir: XXXXX  # str
69            file: XXXXX  # str
70            plugin:
71              name: XXXXX  # str
72              version: XXXXX  # str
73          regions_lsm: XXXXX  # bool