############
Observations
############

.. role:: bash(code)
   :language: bash


pyCIF uses observations to optimize fluxes or other control variables,
as well as to simply compare to forward simulations.
The user must provide information on the observations in a NetCDF file.
Then, the observations are read by pyCIF and some more information are deduced such as the relation of observations to the model's world.
The data is stored in a Pandas DataFrame when dealt with by pyCIF and can be dumped/read to/from compatible NetCDF files.
pyCIF uses storage NetCDF :bash:`monitor.nc` files for observations, the format of which is inspired from the
`ObsPACK <https://www.esrl.noaa.gov/gmd/ccgg/obspack/>`__ standard format.

Observation data to provide
^^^^^^^^^^^^^^^^^^^^^^^^^^^
The user must provide information in one NetCDF file per species per dataset.
The observations are ordered along an index, which is simply the ID number of each observation data.

The basic information are:
++++++++++++++++++++++++++

:date: 
  date at which the observation begins.
  The date must be a datetime object.
  Dumped in a NetCDF file, it is coded as "seconds since YYYY-MM-DD HH:MM:SS" and correctly read and interpreted in pyCIF afterwards (see example codes)
:duration: 
  duration of the observation in hours
:station: 
  ID of the station/instrument/satellite 
:network: 
  name of the network/retrieval 
:parameter: 
  name of the observed parameter or species
:lon:
    longitude of the measurement in degrees East i.e. between -180 degrees (to the West) and 180 degrees (to the East)
:lat:
    latitude of the measurement in degrees North i.e. between -90 degrees (to the South) and 90 degrees (to the North)
:obs:
    observed value in units consistent with the units obtained by the observation operator. As of today, ppm should be used as a standard
:obserror:
    error :math:`\epsilon` on the observation, in the same units as the observation (used to defined the matrix :math:`\mathbf{R}` so that :math:`R_{ii} = \epsilon_i^2`)

Additional information for in-situ data (fixed surface sites, mobile such as aircraft):
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
:alt:
    altitude of the measurement in m a.s.l (above sea level)
:level:
    level number in the model

.. note::
    Please note that the vertical correspondance between observations and simulations is still under construction.
    It means that the :bash:`alt` information is not yet processed in the CIF.
    Only two options are accepted:

        #. to provide the number of the level matching each measurement site in a text file,
           indicated in the yaml (see :doc:`the tutorials for comparing observations to simulations</usertutos/first-comp-to-obs/index>`),
           with columns name of the site, latitude, longitude, number of the level in the model (counting begins at 0 at the surface) and altitude;
           at the moment, only the column :bash:`site` and :bash:`level` are used in the CIF; see an example of such a file below for the model LMDZ (the first row is not read by the CIF)

        #. to provide the number of the level matching each data directly in the netCDF file in the column `level` (see example code :doc:`here<code_reformat_monitor_insitu>`)

.. code-block:: text

    #STAT LAT     LON     lmdzLEV  alt(m)
    ALT	  82.45	  -62.52  1        210
    JFJ	  46.548  7.987   10       3580
    MHD	  53.33	  -9.9    1        8


.. I can provide codes for getting the number of level(s) for a station from CHIMERE METEO.nc files 

Additional information for satellite data:
++++++++++++++++++++++++++++++++++++++++++


:pavg0:
  the pressure grid, including the pressures at the tops of the satellite levels plus the surface pressure. In Pa or hPa, to be specified in the yaml. Dimensions are `index` (number of data) and `level_pressure`, the maximum number of levels of the satellite's vertical grid plus one for the surface.

:qa0:
  the prior profiles used for the retrieval, on the satellite's pressure grid. Units must be consistent with the field to which the formula is applied. Dimensions are `index` (number of data) and `level`, the maximum number of levels of the satellite's vertical grid.

:ak:
  the averaging kernels, on the satellite's pressure grid. Usually no units. Dimensions are `index` (number of data) and `level`, the maximum number of levels of the satellite's vertical grid. 

The vertical ordering must be the same for the three fields (i.e. the three fields must be either from bottom to top or the reverse).

In these three fields, fill-in with NaNs if not available/relevant.

For more information on the formulas used for satellites, please consult  :doc:`the dedicated page</documentation/plugins/transforms/satellites>`.

Structure of the netCDF file
++++++++++++++++++++++++++++
The netCDF file provided for each species and dataset must have the following structure:

* for in-situ data

.. code-block:: text

  dimensions:
   index = IIII ;
   variables:
    int64 date(index) ;
        date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
        date:calendar = "proleptic_gregorian" ;
    double duration(index) ;
        duration:_FillValue = NaN ;
    string station(index) ;
    string network(index) ;
    string parameter(index) ;
    double lon(index) ;
        lon:_FillValue = NaN ;
    double lat(index) ;
        lat:_FillValue = NaN ;
        double alt(index) ;
        alt:_FillValue = NaN ;
    double obs(index) ;
        obs:_FillValue = NaN ;
    double obserror(index) ;
        obserror:_FillValue = NaN ;


Code example: :doc:`here (random values) <code_insitu>` and :doc:`here (from given csv) <code_insitu_csv>`

* for satellite data

.. code-block:: text

   dimensions:
    index = IIII ;
    level = LL ;
    level_pressure = LL+1 ;
   variables:
    double qa0(index, level) ;
        qa0:_FillValue = NaN ;
    double ak(index, level) ;
        ak:_FillValue = NaN ;
    double pavg0(index, level_pressure) ;
        pavg0:_FillValue = NaN ;
    int64 date(index) ;
        date:units = "seconds since YYYY-MM-DD HH:MM:SS" ;
        date:calendar = "proleptic_gregorian" ;
    double duration(index) ;
        duration:_FillValue = NaN ;
    string station(index) ;
    string network(index) ;
    string parameter(index) ;
    double lon(index) ;
        lon:_FillValue = NaN ;
    double lat(index) ;
        lat:_FillValue = NaN ;
    double obs(index) ;
        obs:_FillValue = NaN ;
    double obserror(index) ;
        obserror:_FillValue = NaN ;

Code example: :doc:`here<code_satellite>`

How to specifiy what to do in the yml file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The information on the observations is described in the section datavect.

For in-situ data, the paragraph is named concs (see :doc:`documentation for standard case</documentation/plugins/datavects/standard>`) and is hierachised as in the following example:

.. code-block:: yaml

   concs:
      parameters: # Put as many parameters as necessary
        HCHO :  # this name much match names of species in the model 
          dir: /home/users/ipison/cif/  # directory where the monitor files are located
          file: coman_monitor_lastformat.nc  # monitor file dedicated to this parameter, here it contains the matching levels in the model OR the automatic placement in the vertical is enabled

For satellite data, the paragraph is named satellites, related to the plugin
:doc:`satellites</documentation/plugins/transforms/complex/satellites>`
and is hierachised as in the following example (see also tutorials such as :doc:`this one</documentation/usertutos/first-comp-to-obs/chimere>`):

.. code-block:: yaml

 satellites:
    parameters: # Put as many parameters as necessary
      NO2:  # this name much match names of species in the model 
        dir: /home/users/afortems/PYTHON/pycif_monitor/ # directory where the monitor files are located
          file: monitor_OMIQA4ECV_NO2_201502_7day.nc    # monitor file dedicated this parameter as provided by the satellite described in the following items
          formula: 3    # formula to use to compute the equivalent of the data (e.g. from model's cells to columns)
          pressure: Pa  # units in which the pressure levels are provided
          product: column  # type of data provided (often column, can be a level i.e. partial column)
          chosenlev: 0    # level retained (relevant when partial columns are provided)


Observation information computed by pyCIF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
From the information provided by the user, pyCIF deduces information which are useful in the model's world:

:i:
    latitudinal index of the grid point corresponding to lon/lat
:j:
    longitudinal index of the grid point corresponding to lon/lat
:level:
    level number in the model
:tstep:
    time-step number at which the observation starts in the sub-simulation 
:dtstep:
    number of time steps in the model over which the measurement spans for the corresponding sub-period
:dtstep_glo:
    number of time steps in the model over which the measurement spans
:tstep_glo:
    time-step number in the complete chain of model sub-simulations


Observations are often not snapshots. Instead, they overlap with several model time steps.
In pyCIF, time steps are computed as illustrated below:

.. graphviz:: ../time_steps.dot
    :align: center

In the example, the transport model is run over two chained sub-periods of six time steps each.

The observation #1 starts during time-step 5 of the first period and ends during time-step 2 of the second period.
In that case, `tstep_glo` is 5, and `dtstep_glo` is 4.
When pyCIF will compute a model sub-period, it will update time steps locally: `tstep` is 5 and 1 for the first and second period respectively. `dtstep` is 2 for the two periods.

The observation #2 starts during time-step 1 of the first period and ends during time-step 4 of the same period.
In that case, `tstep` is 1 for the second period, `tstep_glo` is 7. `dtstep` and `dtstep_glo` are 4.


When the relevant simulations have run, pyCIF provides, for each observation:

:sim:
    the matching simulated value
:sim_tl:
    the matching simulated increments in the tangent-linear model
:obs_incr:
    the matching increment computed by the adjoint

Utility functions are available in pyCIF to read and dump observation files.


Observation  data structure in pyCIF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

pyCIF handles observations as `pandas.DataFrames <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`__.
The python module pandas offers very powerful data manipulations.
One can find some details on pandas as used in the CIF :doc:`here<pandas>`

It is possible to import/export observations from/to a :bash:`monitor.nc` file with the following commands:

.. code:: python

        from pycif.utils.datastores import dump

        # Reading the monitor file into a Pandas dataframe
        datastore = dump.read_datastore(your_monitor_file)

        print datastore

        # Exporting to a NetCDF
        dump.dump_datastore(datastore, file_monit=your_monitor_file)

If one needs to generate manually a pyCIF datastore, it is possible to fill it manually:

.. code:: python

        from pycif.utils.datastores import empty
        from pycif.utils.datastores.dump import dump_datastore

        # Initializes an empty datastore
        datastore = empty.init_empty()

        # Fill the datastore manually

        # Save the datastore to some netCDF
        dump_datastore(datastore, file_monitor=/some/path/where/to/save)

Updating monitors from other versions of pyCIF
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Monitor file format are evolving with pyCIF.
It is possible that you have monitor files that are not any more compatible with the latest version of pyCIF.
Below are examples to update your monitor files

Old monitors with dates as index
++++++++++++++++++++++++++++++++

The index in some versions of pyCIF used to be the date corresponding to the measurements.
In the latest version, the index is only a list of IDs and the date is stored as an independent column.

.. code:: python

    import xarray as xr

    # Open with xarray
    ds = xr.open_dataset("old_monitor.nc")

    # Create a new column
    ds["date"] = (("index"), ds.index.data)

    # Replace the index
    ds.assign_coords(index=range(len(ds.index.data)))

    # Save new monitor
    ds.to_netcdf("new_monitor.nc")