Observations

pyCIF uses observations to optimize fluxes or other control variables, as well as to simply compare to forward simulations. The user must provide information on the observations in a NetCDF file. Then, the observations are read by pyCIF and some more information are deduced such as the relation of observations to the model’s world. The data is stored in a Pandas DataFrame when dealt with by pyCIF and can be dumped/read to/from compatible NetCDF files. pyCIF uses storage NetCDF monitor.nc files for observations, the format of which is inspired from the ObsPACK standard format.

Observation data to provide

The user must provide information in one NetCDF file per species per dataset. The observations are ordered along an index, which is simply the ID number of each observation data.

The basic information are:

date:

date at which the observation begins. The date must be a datetime object. Dumped in a NetCDF file, it is coded as “seconds since YYYY-MM-DD HH:MM:SS” and correctly read and interpreted in pyCIF afterwards (see example codes)

duration:

duration of the observation in hours

station:

ID of the station/instrument/satellite

network:

name of the network/retrieval

parameter:

name of the observed parameter or species

lon:

longitude of the measurement in degrees East i.e. between -180 degrees (to the West) and 180 degrees (to the East)

lat:

latitude of the measurement in degrees North i.e. between -90 degrees (to the South) and 90 degrees (to the North)

obs:

observed value in units consistent with the units obtained by the observation operator. As of today, ppm should be used as a standard

obserror:

error \(\epsilon\) on the observation, in the same units as the observation (used to defined the matrix \(\mathbf{R}\) so that \(R_{ii} = \epsilon_i^2\))

Additional information for in-situ data (fixed surface sites, mobile such as aircraft):

alt:

altitude of the measurement in m a.s.l (above sea level)

level:

level number in the model

Note

Please note that the vertical correspondance between observations and simulations is still under construction. It means that the alt information is not yet processed in the CIF. Only two options are accepted:

  1. to provide the number of the level matching each measurement site in a text file, indicated in the yaml (see the tutorials for comparing observations to simulations), with columns name of the site, latitude, longitude, number of the level in the model (counting begins at 0 at the surface) and altitude; at the moment, only the column site and level are used in the CIF; see an example of such a file below for the model LMDZ (the first row is not read by the CIF)

  2. to provide the number of the level matching each data directly in the netCDF file in the column level (see example code here)

#STAT LAT     LON     lmdzLEV  alt(m)
ALT   82.45   -62.52  1        210
JFJ   46.548  7.987   10       3580
MHD   53.33   -9.9    1        8

Additional information for satellite data:

pavg0:

the pressure grid, including the pressures at the tops of the satellite levels plus the surface pressure. In Pa or hPa, to be specified in the yaml. Dimensions are index (number of data) and level_pressure, the maximum number of levels of the satellite’s vertical grid plus one for the surface.

qa0:

the prior profiles used for the retrieval, on the satellite’s pressure grid. Units must be consistent with the field to which the formula is applied. Dimensions are index (number of data) and level, the maximum number of levels of the satellite’s vertical grid.

ak:

the averaging kernels, on the satellite’s pressure grid. Usually no units. Dimensions are index (number of data) and level, the maximum number of levels of the satellite’s vertical grid.

The vertical ordering must be the same for the three fields (i.e. the three fields must be either from bottom to top or the reverse).

In these three fields, fill-in with NaNs if not available/relevant.

For more information on the formulas used for satellites, please consult the dedicated page.

Structure of the netCDF file

The netCDF file provided for each species and dataset must have the following structure:

  • for in-situ data

dimensions:
 index = IIII ;
 variables:
  int64 date(index) ;
      date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
      date:calendar = "proleptic_gregorian" ;
  double duration(index) ;
      duration:_FillValue = NaN ;
  string station(index) ;
  string network(index) ;
  string parameter(index) ;
  double lon(index) ;
      lon:_FillValue = NaN ;
  double lat(index) ;
      lat:_FillValue = NaN ;
      double alt(index) ;
      alt:_FillValue = NaN ;
  double obs(index) ;
      obs:_FillValue = NaN ;
  double obserror(index) ;
      obserror:_FillValue = NaN ;

Code example: here (random values) and here (from given csv)

  • for satellite data

dimensions:
 index = IIII ;
 level = LL ;
 level_pressure = LL+1 ;
variables:
 double qa0(index, level) ;
     qa0:_FillValue = NaN ;
 double ak(index, level) ;
     ak:_FillValue = NaN ;
 double pavg0(index, level_pressure) ;
     pavg0:_FillValue = NaN ;
 int64 date(index) ;
     date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
     date:calendar = "proleptic_gregorian" ;
 double duration(index) ;
     duration:_FillValue = NaN ;
 string station(index) ;
 string network(index) ;
 string parameter(index) ;
 double lon(index) ;
     lon:_FillValue = NaN ;
 double lat(index) ;
     lat:_FillValue = NaN ;
 double obs(index) ;
     obs:_FillValue = NaN ;
 double obserror(index) ;
     obserror:_FillValue = NaN ;

Code example: here

How to specifiy what to do in the yml file

The information on the observations is described in the section datavect.

For in-situ data, the paragraph is named concs (see documentation for standard case) and is hierachised as in the following example:

concs:
   parameters: # Put as many parameters as necessary
     HCHO :  # this name much match names of species in the model
       dir: /home/users/ipison/cif/  # directory where the monitor files are located
       file: coman_monitor_lastformat.nc  # monitor file dedicated to this parameter, here it contains the matching levels in the model OR the automatic placement in the vertical is enabled

For satellite data, the paragraph is named satellites, related to the plugin satellites and is hierachised as in the following example (see also tutorials such as this one):

satellites:
   parameters: # Put as many parameters as necessary
     NO2:  # this name much match names of species in the model
       dir: /home/users/afortems/PYTHON/pycif_monitor/ # directory where the monitor files are located
         file: monitor_OMIQA4ECV_NO2_201502_7day.nc    # monitor file dedicated this parameter as provided by the satellite described in the following items
         formula: 3    # formula to use to compute the equivalent of the data (e.g. from model's cells to columns)
         pressure: Pa  # units in which the pressure levels are provided
         product: column  # type of data provided (often column, can be a level i.e. partial column)
         chosenlev: 0    # level retained (relevant when partial columns are provided)

Observation information computed by pyCIF

From the information provided by the user, pyCIF deduces information which are useful in the model’s world:

i:

latitudinal index of the grid point corresponding to lon/lat

j:

longitudinal index of the grid point corresponding to lon/lat

level:

level number in the model

tstep:

time-step number at which the observation starts in the sub-simulation

dtstep:

number of time steps in the model over which the measurement spans for the corresponding sub-period

dtstep_glo:

number of time steps in the model over which the measurement spans

tstep_glo:

time-step number in the complete chain of model sub-simulations

Observations are often not snapshots. Instead, they overlap with several model time steps. In pyCIF, time steps are computed as illustrated below:

digraph {
        tbl [

    shape=plaintext
    label=<

      <table border='0' cellborder='1' color='blue' cellspacing='0' width="500">
        <tr><td></td><td>1st sub-period</td><td>2nd sub-period</td></tr>

        <tr>
        <td>Global time scale</td>
        <td cellpadding='6'>
          <table color='orange' cellspacing='0' width="180" cellpadding="0">
            <tr><td width="30">1  </td><td width="30">2  </td><td width="30">3</td><td width="30">4  </td><td width="30">5  </td><td width="30">6</td></tr>
          </table>
        </td><td cellpadding='6'>
          <table color='orange' cellspacing='0' width="180" cellpadding="0">
            <tr><td width="30">7  </td><td width="30">8  </td><td width="30">9</td><td width="30">10  </td><td width="30">11  </td><td width="30">12</td></tr>
          </table>
        </td>
        </tr>
        <tr>
        <td>Local time scale</td>
        <td cellpadding='6'>
          <table color='orange' cellspacing='0' width="180" cellpadding="0">
            <tr><td width="30">1  </td><td width="30">2  </td><td width="30">3</td><td width="30">4  </td><td width="30">5  </td><td width="30">6</td></tr>
          </table>
        </td><td cellpadding='6'>
          <table color='orange' cellspacing='0' width="180" cellpadding="0">
            <tr><td width="30">1  </td><td width="30">2  </td><td width="30">3</td><td width="30">4  </td><td width="30">5  </td><td width="30">6</td></tr>
          </table>
        </td>
        </tr>

        <tr>
        <td>Observation 1</td>
        <td colspan="2" style="padding: 40px 10px 5px 5px;">
        | sampling period |
        </td>
        </tr>

        <tr>
        <td>Observation 2</td>
        <td colspan="2" style="padding: 40px 10px 5px 5px;">
                                               | sampling period |
        </td>
        </tr>

      </table>


    >];
    }

In the example, the transport model is run over two chained sub-periods of six time steps each.

The observation #1 starts during time-step 5 of the first period and ends during time-step 2 of the second period. In that case, tstep_glo is 5, and dtstep_glo is 4. When pyCIF will compute a model sub-period, it will update time steps locally: tstep is 5 and 1 for the first and second period respectively. dtstep is 2 for the two periods.

The observation #2 starts during time-step 1 of the first period and ends during time-step 4 of the same period. In that case, tstep is 1 for the second period, tstep_glo is 7. dtstep and dtstep_glo are 4.

When the relevant simulations have run, pyCIF provides, for each observation:

sim:

the matching simulated value

sim_tl:

the matching simulated increments in the tangent-linear model

obs_incr:

the matching increment computed by the adjoint

Utility functions are available in pyCIF to read and dump observation files.

Observation data structure in pyCIF

pyCIF handles observations as pandas.DataFrames. The python module pandas offers very powerful data manipulations. One can find some details on pandas as used in the CIF here

It is possible to import/export observations from/to a monitor.nc file with the following commands:

from pycif.utils.datastores import dump

# Reading the monitor file into a Pandas dataframe
datastore = dump.read_datastore(your_monitor_file)

print datastore

# Exporting to a NetCDF
dump.dump_datastore(datastore, file_monit=your_monitor_file)

If one needs to generate manually a pyCIF datastore, it is possible to fill it manually:

from pycif.utils.datastores import empty
from pycif.utils.datastores.dump import dump_datastore

# Initializes an empty datastore
datastore = empty.init_empty()

# Fill the datastore manually

# Save the datastore to some netCDF
dump_datastore(datastore, file_monitor=/some/path/where/to/save)

Updating monitors from other versions of pyCIF

Monitor file format are evolving with pyCIF. It is possible that you have monitor files that are not any more compatible with the latest version of pyCIF. Below are examples to update your monitor files

Old monitors with dates as index

The index in some versions of pyCIF used to be the date corresponding to the measurements. In the latest version, the index is only a list of IDs and the date is stored as an independent column.

import xarray as xr

# Open with xarray
ds = xr.open_dataset("old_monitor.nc")

# Create a new column
ds["date"] = (("index"), ds.index.data)

# Replace the index
ds.assign_coords(index=range(len(ds.index.data)))

# Save new monitor
ds.to_netcdf("new_monitor.nc")