Datastreams datastream#

Available Datastreams datastream#

The following sub-types and datastreams are implemented in pyCIF so far:

Documentation#

Description#

The datastream Plugin type include interfaces to input data for pycif, with the exception of observations. It includes the sub-types flux, meteo and field.

It is used for the following purposes:

  1. fetching relevant input files for direct use by, e.g, CTMs, only linking to the original file

  2. reading relevant input files when data manipulation is required, for, e.g., defining the control vector, or auxiliary transformations, such as temporal interpolation or horizontal regridding

  3. writing data from pycif to the corresponding format; this can either be used when data from pycif needs to be read as input for a CTM, or for sharing data from pycif with a known standard data format

Required parameters, dependencies and functions#

Functions#

A given datastream Plugin requires the following functions to work properly within pycif:

  • fetch

  • get_domain (optional)

  • read

  • write (optional)

Please find below details on these functions.

fetch#

The fetch function determines what files and corresponding dates are available for running the present case. The structure of the fetch function is shown below:

pycif.plugins.datastreams.fluxes.flux_plugin_template.fetch(ref_dir, ref_file, input_dates, target_dir, tracer=None, component=None, **kwargs)[source]

Fetch files and dates for the given simulation interval. Determine what dates are available in the input data within the simulation interval. Link reference files to the working directory to avoid interactions with the outer world.

Should include input data dates encompassing the simulation interval, which means that, e.g, if input data are at the monthly scale and the simulation interval starts on 2010-01-15 to 2010-03-15, the output should at least include the input data dates for 2010-01, 2010-02 and 2010-03.

Note:

The three main arguments (ref_dir, ref_file and file freq) can either be defined as dir, file and file_freq respectively in the relevant davavect/flux/my_spec paragrah in the yaml, or, if not available, they are fetched from the corresponding components/flux paragraph. If one of the three needs to have a default value, it can be integrated in the input_arguments dictionary in __init__.py

Args:

ref_dir (str): the path to the input files ref_file (str): format of the input files input_dates (list): simulation interval (start and end dates) target_dir (str): where to copy tracer: the tracer Plugin, corresponding to the paragraph

datavect/components/fluxes/parameters/my_species in the configuration yaml; can be needed to fetch extra information given by the user

component: the component Plugin, same as tracer; corresponds to the paragraph

datavect/components/fluxes in the configuration yaml

Return:

(dict, dict): returns two dictionaries: list_files and list_dates

list_files: for each date that begins a period, a list containing

the names of the files that are available for the dates within this period

list_dates: for each date that begins a period, a list containing

the date intervals (in the form of a list of two dates each) matching the files listed in list_files

Note:

The output format can be illustrated as follows (the dates are shown as strings, but datetime.datetime objects are expected):

list_dates = {
    "2019-01-01 00:00":
        [["2019-01-01 00:00", "2019-01-01 03:00"],
         ["2019-01-01 03:00", "2019-01-01 06:00"],
         ["2019-01-01 06:00", "2019-01-01 09:00"],
         ["2019-01-01 09:00", "2019-01-01 12:00"]],
    "2019-01-01 12:00":
        [["2019-01-01 12:00", "2019-01-01 15:00"],
         ["2019-01-01 15:00", "2019-01-01 18:00"],
         ["2019-01-01 18:00", "2019-01-01 21:00"],
         ["2019-01-01 21:00", "2019-01-02 00:00"]]
}

list_files = {
    "2019-01-01 00:00":
        ["path_to_file_for_20190101_0000",
         "path_to_file_for_20190101_0300",
         "path_to_file_for_20190101_0600",
         "path_to_file_for_20190101_0900"],
    "2019-01-01 12:00":
        ["path_to_file_for_20190101_1200",
         "path_to_file_for_20190101_1500",
         "path_to_file_for_20190101_1800",
         "path_to_file_for_20190101_2100"]
}

In the example above, the native temporal resolution is 3-hourly, and files are available every 12 hours

Note:

There is no specific rule for sorting dates and files into separate keys of the output dictionaries. The usage rule would be to have one dictionary key per input file, therein unfolding all available dates in the corresponding file; in that rule, the content of list_files is a duplicate of the same file over again in every given key of the dictionary.

But any combination of the keys is valid as long as the list of dates of each key corresponds exactly to the file with the same index. Hence, it is acceptable to have, e.g., one key with all dates and files, or one key per date even though there are several date per file.

The balance between the number of keys and the size of each key should be determined by the standard usage expected with the data. overall, a good practice is to have one key in the input data for each sub-simulation for which it will be used afterwards by the model.

For instance, CHIMERE emission files store hourly emissions for CHIMERE sub-simulations, typically 24-hour long. It thus makes sense to have one key per 24-hour period and in each key the hour emissions.




get_domain (optional)#
pycif.plugins.datastreams.fluxes.flux_plugin_template.get_domain(ref_dir, ref_file, input_dates, target_dir, tracer=None)[source]

Read information to define the data horizontal and, if relevant, vertical domain.

There are several possible approaches:

  • read a reference file that is necessary in ref_dir

  • read a file among the available data files

  • read a file specified in the yaml, by using the corresponding variable name; for instance, tracer.my_file

From the chosen file, obtain the coordinates of the centers and/or the corners of the grid cells. If corners or centers are not available, deduce them from the available information.

Warning:

the grid must not be overlapping: e.g for a global grid, the last grid cell must not be the same as the first

Warning:

Longitudes must be in the range [-180, 180]. For dataset with longitudes beyond -180 or 180, please shift them. and adapt the read function accordingly

Warning:

Order the centers and corners latitudes and longitudes in increasing order

Note:

If the domain information need to be read from one of the files returned by the fetch function, one should use the variable tracer.input_files as follow:

ref_file = list(itertools.chain.from_iterable(tracer.input_files.values()))[0]
Args:

ref_dir (str): the path to the input files ref_file (str): format of the input files input_dates (list): simulation interval (start and end dates) target_dir (str): where to copy tracer: the tracer Plugin, corresponding to the paragraph

datavect/components/fluxes/parameters/my_species in the configuration yaml; can be needed to fetch extra information given by the user

Return:
Domain: a domain class object, with the definition of the center grid

cells coordinates, as well as corners




read#
pycif.plugins.datastreams.fluxes.flux_plugin_template.read(self, name, varnames, dates, files, interpol_flx=False, tracer=None, model=None, ddi=None, **kwargs)[source]

Get fluxes from raw files and load them into a pyCIF variables.

The list of date intervals and corresponding files is directly provided, coming from what is returned by the fetch function. One should loop on dates and files and extract the corresponding temporal slice of data

Warning:

Make sure to optimize the opening of files. There is high chances that the same file has to be open and closed over and over again to loop on the dates. If this is the case, make sure not to close it between each date.

Args:

name (str): name of the component varnames (list[str]): original names of variables to read; use name

if varnames is empty

dates (list): list of the date intervals to extract files (list): list of the files matching dates

Return:
xr.DataArray: the actual data with dimension:

time, levels, latitudes, longitudes




write (optional)#