Observations#
pyCIF uses observations to optimise fluxes or other control variables,
as well as to compare forward simulations with measurements.
The user must provide observations in a NetCDF file; pyCIF then reads
them and derives further information linking each observation to the
model grid and time steps.
The data is handled internally as a Pandas DataFrame and can be
dumped to or read from NetCDF monitor.nc files, whose format is
inspired by the ObsPACK
standard.
Observation data to provide#
The user must provide information in one NetCDF file per species per dataset. The observations are ordered along an index, which is simply the ID number of each observation data.
The basic information are:#
- date:
date at which the observation begins. The date must be a datetime object. Dumped in a NetCDF file, it is coded as “seconds since YYYY-MM-DD HH:MM:SS” and correctly read and interpreted in pyCIF afterwards (see example codes)
- duration:
duration of the observation in hours
- station:
ID of the station/instrument/satellite
- network:
name of the network/retrieval
- parameter:
name of the observed parameter or species
- lon:
longitude of the measurement in degrees East i.e. between -180 degrees (to the West) and 180 degrees (to the East)
- lat:
latitude of the measurement in degrees North i.e. between -90 degrees (to the South) and 90 degrees (to the North)
- obs:
observed value in units consistent with the units returned by the observation operator (typically dry-air mole fraction in ppm)
- obserror:
error \(\epsilon\) on the observation, in the same units as the observation (used to defined the matrix \(\mathbf{R}\) so that \(R_{ii} = \epsilon_i^2\))
Additional information for in-situ data (fixed surface sites, mobile such as aircraft):#
- alt:
altitude of the measurement in m a.s.l (above sea level)
- level:
level number in the model
Note
Please note that the vertical correspondence between observations and simulations is still under construction.
It means that the alt information is not yet processed in the CIF.
Only two options are accepted:
to provide the number of the level matching each measurement site in a text file, indicated in the yaml (see the tutorials for comparing observations to simulations), with columns name of the site, latitude, longitude, number of the level in the model (counting begins at 0 at the surface) and altitude; at the moment, only the column
siteandlevelare used in the CIF; see an example of such a file below for the model LMDZ (the first row is not read by the CIF)to provide the number of the level matching each data directly in the netCDF file in the column level (see example code here)
#STAT LAT LON lmdzLEV alt(m)
ALT 82.45 -62.52 1 210
JFJ 46.548 7.987 10 3580
MHD 53.33 -9.9 1 8
Additional information for satellite data:#
- pavg0:
the pressure grid, including the pressures at the tops of the satellite levels plus the surface pressure. In Pa or hPa, to be specified in the yaml. Dimensions are index (number of data) and level_pressure, the maximum number of levels of the satellite’s vertical grid plus one for the surface.
- qa0:
the prior profiles used for the retrieval, on the satellite’s pressure grid. Units must be consistent with the field to which the formula is applied. Dimensions are index (number of data) and level, the maximum number of levels of the satellite’s vertical grid.
- ak:
the averaging kernels, on the satellite’s pressure grid. Usually no units. Dimensions are index (number of data) and level, the maximum number of levels of the satellite’s vertical grid.
The vertical ordering must be the same for the three fields (i.e. the three fields must be either from bottom to top or the reverse).
In these three fields, fill-in with NaNs if not available/relevant.
For more information on the formulas used for satellites, please consult the dedicated page.
Structure of the netCDF file#
The netCDF file provided for each species and dataset must have the following structure:
for in-situ data
dimensions:
index = IIII ;
variables:
int64 date(index) ;
date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
date:calendar = "proleptic_gregorian" ;
double duration(index) ;
duration:_FillValue = NaN ;
string station(index) ;
string network(index) ;
string parameter(index) ;
double lon(index) ;
lon:_FillValue = NaN ;
double lat(index) ;
lat:_FillValue = NaN ;
double alt(index) ;
alt:_FillValue = NaN ;
double obs(index) ;
obs:_FillValue = NaN ;
double obserror(index) ;
obserror:_FillValue = NaN ;
Code example: here (random values) and here (from given csv)
for satellite data
dimensions:
index = IIII ;
level = LL ;
level_pressure = LL+1 ;
variables:
double qa0(index, level) ;
qa0:_FillValue = NaN ;
double ak(index, level) ;
ak:_FillValue = NaN ;
double pavg0(index, level_pressure) ;
pavg0:_FillValue = NaN ;
int64 date(index) ;
date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
date:calendar = "proleptic_gregorian" ;
double duration(index) ;
duration:_FillValue = NaN ;
string station(index) ;
string network(index) ;
string parameter(index) ;
double lon(index) ;
lon:_FillValue = NaN ;
double lat(index) ;
lat:_FillValue = NaN ;
double obs(index) ;
obs:_FillValue = NaN ;
double obserror(index) ;
obserror:_FillValue = NaN ;
Code example: here
How to specify what to do in the YAML file#
Observation data streams are declared in the datavect section of the
YAML configuration file, under the components key.
For in-situ data, the component is named concs (see
documentation for the standard datavect)
and is structured as in the following example:
datavect:
plugin:
name: standard
version: std
components:
concs:
parameters: # one entry per species
HCHO: # must match the species name used in the model
plugin:
name: insitu
version: nc
dir: /path/to/monitor/files/
file: monitor_HCHO.nc
For satellite data, the component is named satellites, linked to the
satellites transform
(see also this tutorial):
datavect:
plugin:
name: standard
version: std
components:
satellites:
parameters: # one entry per species
NO2: # must match the species name used in the model
plugin:
name: satellites
version: std
dir: /path/to/monitor/files/
file: monitor_NO2.nc
formula: 3 # formula used to compute the column equivalent
pressure: Pa # pressure units in the satellite file
product: column # type of data (column or partial column)
chosenlev: 0 # level retained for partial columns
Observation information computed by pyCIF#
From the information provided by the user, pyCIF deduces information which are useful in the model’s world:
- i:
latitudinal index of the grid point corresponding to lon/lat
- j:
longitudinal index of the grid point corresponding to lon/lat
- level:
level number in the model
- tstep:
time-step number at which the observation starts in the sub-simulation
- dtstep:
number of time steps in the model over which the measurement spans for the corresponding sub-period
- dtstep_glo:
number of time steps in the model over which the measurement spans
- tstep_glo:
time-step number in the complete chain of model sub-simulations
Observations are often not snapshots. Instead, they overlap with several model time steps. In pyCIF, time steps are computed as illustrated below:
![digraph {
tbl [
shape=plaintext
label=<
<table border='0' cellborder='1' color='blue' cellspacing='0' width="500">
<tr><td></td><td>1st sub-period</td><td>2nd sub-period</td></tr>
<tr>
<td>Global time scale</td>
<td cellpadding='6'>
<table color='orange' cellspacing='0' width="180" cellpadding="0">
<tr><td width="30">1 </td><td width="30">2 </td><td width="30">3</td><td width="30">4 </td><td width="30">5 </td><td width="30">6</td></tr>
</table>
</td><td cellpadding='6'>
<table color='orange' cellspacing='0' width="180" cellpadding="0">
<tr><td width="30">7 </td><td width="30">8 </td><td width="30">9</td><td width="30">10 </td><td width="30">11 </td><td width="30">12</td></tr>
</table>
</td>
</tr>
<tr>
<td>Local time scale</td>
<td cellpadding='6'>
<table color='orange' cellspacing='0' width="180" cellpadding="0">
<tr><td width="30">1 </td><td width="30">2 </td><td width="30">3</td><td width="30">4 </td><td width="30">5 </td><td width="30">6</td></tr>
</table>
</td><td cellpadding='6'>
<table color='orange' cellspacing='0' width="180" cellpadding="0">
<tr><td width="30">1 </td><td width="30">2 </td><td width="30">3</td><td width="30">4 </td><td width="30">5 </td><td width="30">6</td></tr>
</table>
</td>
</tr>
<tr>
<td>Observation 1</td>
<td colspan="2" style="padding: 40px 10px 5px 5px;">
| sampling period |
</td>
</tr>
<tr>
<td>Observation 2</td>
<td colspan="2" style="padding: 40px 10px 5px 5px;">
| sampling period |
</td>
</tr>
</table>
>];
}](../../_images/graphviz-611969706b4c07898423ffd3b8a36f3465ac4d07.png)
In the example, the transport model is run over two chained sub-periods of six time steps each.
The observation #1 starts during time-step 5 of the first period and ends during time-step 2 of the second period. In that case, tstep_glo is 5, and dtstep_glo is 4. When pyCIF will compute a model sub-period, it will update time steps locally: tstep is 5 and 1 for the first and second period respectively. dtstep is 2 for the two periods.
The observation #2 starts during time-step 1 of the first period and ends during time-step 4 of the same period. In that case, tstep is 1 for the second period, tstep_glo is 7. dtstep and dtstep_glo are 4.
When the relevant simulations have run, pyCIF provides, for each observation:
- sim:
the matching simulated value
- sim_tl:
the matching simulated increments in the tangent-linear model
- obs_incr:
the matching increment computed by the adjoint
Utility functions are available in pyCIF to read and dump observation files.
Observation data structure in pyCIF#
pyCIF handles observations as pandas DataFrames. See here for details on how pandas is used in CIF.
Reading and writing monitor.nc files:
from pycif.utils.datastores import dump
# Read a monitor file into a Pandas DataFrame
datastore = dump.read_datastore(your_monitor_file)
print(datastore)
# Write back to NetCDF
dump.dump_datastore(datastore, file_monit=your_monitor_file)
Generating a new datastore from scratch:
from pycif.utils.datastores import empty
from pycif.utils.datastores.dump import dump_datastore
# Initialise an empty datastore with the standard column schema
datastore = empty.init_empty()
# Fill columns manually (see monitor.nc field descriptions above)
# ...
# Save to NetCDF
dump_datastore(datastore, file_monit='/some/path/monitor.nc')
Updating monitors from other versions of pyCIF#
Monitor file format are evolving with pyCIF. It is possible that you have monitor files that are not any more compatible with the latest version of pyCIF. Below are examples to update your monitor files
Old monitors with dates as index#
The index in some versions of pyCIF used to be the date corresponding to the measurements. In the latest version, the index is only a list of IDs and the date is stored as an independent column.
import xarray as xr
# Open with xarray
ds = xr.open_dataset("old_monitor.nc")
# Create a new column
ds["date"] = (("index"), ds.index.data)
# Replace the index
ds.assign_coords(index=range(len(ds.index.data)))
# Save new monitor
ds.to_netcdf("new_monitor.nc")