How to make super-observations

Tutorial on how to make super-observations from a set of data and for a given set-up of a CTM.

Super-observations are data to be assimilated which are built from a set of more numerous data. Super-observations can be averages, medians or other combinations of these “raw” data.

Making super-observations is a way to reduce the number of data to assimilate while keeping a relevant amount of information and, potentially, avoiding the assimilation of incompatible information or of redundant information.

The computation of super-observations is not included as such in the CIF as it requires to be carefully thought out for each case.

Nevertheless, following the main steps described here ensures an optimal use of the CIF’s capacities and a good consistency in the various studies.

1. Run a forward simulation with the so-called raw data.

XXX renvoyer a comment faire + points importants = bien ajuster la config du modele = a la fin de cette simu, tout doit etre fixe sauf le jeu d’obs XXX

2. Use the output monitor to select data

From this first simulation, a monitor file is generated: $WORKIDR/obsoperator/fwd_0000/obsvect/satellites/SPEC/monitor.nc. It contains information computed by the CIF with the configuration of the CTM: the spatial coordinates of the grid cell matching each data (i, j, level), its temporal coodinates (tstep, tstep_glo, dtstep) and the equivalent of the observation (sim).

In case of satellite data, an information file is also generated to keep track of the specific data associated to each obs: $WORKDIR/obsoperator/fwd_0000/chain/satellites/default_00001/infos_DATEDEB.nc with DATEDEB the date of the beginning of the simulation XXwhy??XX.

This complete set of information makes it possible to filter out data according to criteria taking into account the simulation e.g. filtering out data with too large a discrepancy with the simulation.

XXX example codes for plotting the data+sim?XXXX

An example code to use monitor and info together, apply some filtering and changes and generate a “clean” monitor is provided here:

Show/Hide Code

from pycif.utils.datastores import dump
import copy
import numpy as np
import pandas as pd
from pycif.utils.netcdf import readnc
import xarray as xr
from itertools import zip_longest


def retrieve_aks(fin,index,list_var): 

   # fin is the original input monitor 
   # containing satellite-specific information: ak, qa0, etc
   # fin is read to retrieve this info for data number index
   index = int(index)
   qa0, ak, pavg0, dryair = readnc(fin, list_var)

   return qa0[index].tolist(), ak[index].tolist(), pavg0[index].tolist(), dryair[index].tolist()

# WORKDIR of the forward simulation with all "raw" data
refdir = '/home/chimereicos/fwd_all_data/obsoperator/fwd_0000/'
# directory for the info file
dirinfo = 'chain/satellites/default_00001'
# directory for the output monitor file
dirmonit = 'obsvect/satellites/CH4/'
# files to use
monitor_in = '{}{}/monitor.nc'.format(refdir,dirmonit)
infos = '{}{}/infos_201901010000.nc'.format(refdir,dirinfo)

# basic data to provide in an input monitor
list_basic_cols = [ 'date', 'duration', 'station', 'network', 'parameter', 'lon', 'lat', 'obs', 'obserror' ]

# Monitor and info data to use
ds = dump.read_datastore(monitor_in)
dsinf = dump.read_datastore(infos,col2dump= ['orig_index', 'file_aks'])

# paste ds and dsinf together to get full information for each obs
ds3 = pd.concat( [ds,dsinf],axis = 1)

# example filter: filter out obs outside the domain
ds3 = ds3.loc[ ~np.isnan(ds3['orig_index']) ] 

# example change: set obserror at a given value
ds3 = ds3.assign(obserror = 20)

# generate a satellite input monitor with these new data:
# for each line in ds2, retrieve qa0,ak,pavg0,dryair from the right file
list_var = ['qa0', 'ak', 'pavg0', 'dryair']
ds4 = ds3.apply(lambda x: retrieve_aks(x.file_aks,x.orig_index,list_var), axis = 1)
# reformat ds4 to put satellite specific info in a xarray
# keep index to keep track of selection - no impact on the CIF
ds5 = pd.DataFrame((item for item in ds4), columns = list_var, index=ds3.index)
fixed_nlevels = len(ds5['ak'].iloc[0])
list_tab = []
for k,var in enumerate(list_var):
  list_tab.append(pd.DataFrame((item for item in ds5[var])).values)
# satellite specific info
ds_sat = xr.Dataset({'qa0': (['index', 'level'], list_tab[0]),
                     'ak': (['index', 'level'], list_tab[1]),
                     'pavg0': (['index', 'level_pressure'], list_tab[2]),
                     'dryair': (['index', 'level'], list_tab[3])},
                     coords={'index': ds5.index,
                             'level': range(fixed_nlevels),
                              'level_pressure': range(fixed_nlevels + 1)})
# basic data 
basic_data = ds3[list_basic_cols].to_xarray()

# merge basic and satellite-specific data
alldata = xr.merge([ds_sat, basic_data])

# create new clean input monitor
alldata.to_netcdf('new_monitor.nc')


XXX what about le bouchage de trous??XXXX

From this clean monitor, it may be useful to re-run a forward simulation, same as in step 1.

3. Generate super-observations

From an output monitor, it is easy to combine data at given spatial and temporal scales, either in the data’s world (using lat, lon, date) or in the model’s grid (using i, j, tstep_glo). Using the matching info file, a super-obs input monitor can then be generated. Using the same type of code as above, the super-observations can be generated from grouping on the dataframe containing the full information i.e. ds3 (after or instead of lines 40-44) then performing the retrieving of the relevant satellite specific information in the same way.

XXX put example codes XXX

4. Use the super-observations

Use the monitor containing the super-observations for a forward simulation and a adjoint test.

If everything is normal, it can then be used as input for inversions with the same configuration of the CTM.