.. _release.1.0.0-notes: ========================= smash 1.0.0 Release Notes ========================= The smash 1.0.0 release brings major updates aimed at simplifying the user experience while continuing the ongoing work to improve the handling, fix possible bugs, clarify the documentation. The highlights are: - ``PyPI installation`` `smash` is now available on `PyPI `__ with the following distribution name ``hydro-smash``. - ``CPU parallel computation`` CPU parallel computation in simulation and optimization. This has been made possible using the `OpenMP `__ library, whose most of the directives are handled by Tapenade to generate a parallel adjoint code. - ``Memory management`` The memory management has been improved in two critical sections: - Atmospheric data storage The memory management of atmospheric data has been improved by using sparse matrices to store precipitation, potential evapotranspiration and snow data. - Storage of variables during the adjoint code forward sweep The time-stepping loop fills up the memory space during the forward sweep which can cause the code to crash. To reduce the memory peak, checkpoints have been added during the time-stepping loop. - ``lat-lon meshing`` It is now possible to work with spatial data whose unit is in degree. - ``bayesian estimation`` A bayesian estimation module has been added to perform parameter estimation and uncertainty quantification .. hint:: See the :ref:`math_num_documentation.bayesian_estimation` section. - ``modular structure`` The structure of the model has been modularised so that the snow, hydrological and routing modules can now be selected independently. - ``API redesign`` The API has been redesigned. The essence remains the same, but there have been a lot of changes to the names of methods and arguments and the way variables are passed to functions. ------------ Contributors ------------ This release was made possible thanks to the contributions of: - François Colleoni (``__) - Ngo Nghi Truyen Huynh (``__) - Benjamin Renard (``__) - Thomas de Fournas (``__) - Apolline El Baz (``__) - Pierre-André Garambois (``__) - Maxime Jay-Allemand (``__) --------------- Compatibilities --------------- - The `smash` package is now available on `PyPI `__ and can be installed as follows: .. code-block:: none pip install hydro-smash - The `gdal `__ package has been replaced by `rasterio `__. This change has been made purely to improve the distribution of `smash`. Despite the better performance in terms of raster reading time, gdal does not provide a binary wheel directly, unlike rasterio (see `discussion `__). - The `pytest-cov `__ plugin has been added to the list of dependencies to perform test coverage. - The `black `__ package has been removed from the list of dependencies and replaced by the `ruff `__ package. - The `SALib `__ package has been removed from the list of dependencies. The sensitivity analysis can still be performed but external to `smash`. ------------ Deprecations ------------ Signatures sensitivity ********************** The method to perform sensitivity analysis on hydrological signatures is no longer available. Sensitivity analysis is not specific to `smash` and can be considered as `smash` post-processing. This is why we have decided to withdraw the direct use of sensitivity analysis tools in `smash`, without preventing independent post-processing. Nelder-Mead optimizer ********************* One of the optimizers available for spatially uniform parameter optimization was the ``Nelder-Mead`` optimizer. Seeing no particular advantage in using this optimizer rather than the ``step-by-step`` optimizer, it was decided to withdraw it. kge2 efficiency metric ********************** This efficiency metric has been removed as it is not used and is of no interest compared to the ``kge`` efficiency metric. ------------ Improvements ------------ Atmospheric data sparse storage ******************************* The memory management of atmospheric data has been improved by using sparse matrices to store precipitation, potential evapotranspiration and snow data. The previous sparse storage method consisted of simply passing from a 2D matrix storage of size (``nrow``, ``ncol``) to a 1D vector of active cells of size (``nac``) where :math:`\text{nac} \leq \text{nrow} \times \text{ncol}`. In this version, this method has been improved by adding management of redundant values in the input data. We know for precipitation, potential evapotranspiration and snow that most of the arrays are filled with 0. If more than 50% of the array is filled with 0, then it's worth saving only the indices and values of cells whose value is greater than 0. This method drastically reduces the amount of memory used by not storing arrays where it is not raining or arrays of potential evapotranspiration in the middle of the night. Adjoint code time-stepping checkpoints ************************************** We use checkpoint to reduce the maximum memory usage of the adjoint model. Without checkpoints, the maximum memory required is equal to :math:`K \times T`, where :math:`K \in \left[0, \inf \right[` is the memory used at each time step and :math:`T \in \left[1, \inf \right[` the total number of time steps. With checkpoints, the maximum memory required is equal to :math:`K \times C + K \times \frac{T}{C}`, where :math:`C \in \left[1, T \right]` is the number of checkpoints. Finding out what value of :math:`C` minimizes it, :math:`C` must be equal to the square root of the number of time steps :math:`T`. :math:`K \times C + K \times \frac{T}{C}` becomes :math:`2K \sqrt{T}`. Therefore, the memory gain is equivalent to :math:`M = 1 - \frac{2}{\sqrt{T}}` Modular structure ***************** In the previous version, the model structure was defined by modifying the structure option in setup. The possible choices were ``gr-a``, ``gr-b``, ``gr-c`` and ``gr-d`` and described both the ``hydrological`` module and the ``routing`` module. Now, the structure is defined via 3 modules, ``snow`` module containing ``zero`` and ``ssn``, ``hydrological`` module containing ``gr4``, ``gr5``, ``grd``, ``loieau`` and ``vic3l`` and ``routing`` module containing ``lag0``, ``lr`` and ``kw``. The combination of the 3, forms a structure. Below is the table of equivalence between the two versions: .. list-table:: Equivalence between v0.5 and v1.0 :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``gr-a`` - No equivalence * - ``gr-b`` - No equivalence * - ``gr-c`` - No equivalence * - ``gr-d`` - ``zero`` - ``grd`` - ``lr`` Mesh generation *************** The mesh generation has been improved to be equally effective at delineating a catchment regardless of the domain size of the input flow direction file. In the previous version, it was necessary to crop the flow direction file over a region to speed up calculations. Now, before delineating a catchment, the flow direction file is clipped by taking as its maximum extension the number of cells required to delineate a hypothetically rectangular catchment one cell wide. In this way, it is possible to create a mesh of a French catchment using worldwide flow direction. Reading of atmospheric and physiograhic data ******************************************** The reading of geo-referenced data (i.e. precipitation, snow, descriptor, etc) has been improved to manage: - ``Resolution missmatch`` Nearest-Neighbour resample algorithm is used to match the resolution of the data and the resolution of the mesh. - ``Overlap missmatch`` The reading window is correctly shifted to overlap the data and the mesh on the nearest overlapping cell. - ``Out Of Bound`` In the previous version, when the mesh extent was larger than the data extent a not a very informative error was returned. It is now possible to read data whose extent partially include the mesh extent. The corresponding out of bound extent will be filled in with no data. A warning is returned if one of the 3 cases below is encountered when reading the data. Mean atmospheric data ********************* Computing the spatial averages of atmospheric data for each catchment took a considerable amount of time, particularly on large domains with a lot of catchments. It still take some times but has been reduced by more than half. The spatial averages of atmospheric data are optional and are only required for precipitation when optimizing with signatures. Therefore, we added a new option in the setup, ``compute_mean_atmos`` which by default is set to True, but can be set to False to disable the computation of the spatial averages of atmospheric data. Hydrograph segmentation and signatures-based optimization ********************************************************* The baseflow separation method used in the hydrograph segmentation algorithm has been rewritten in Fortran to improve the efficiency of signature calculation, especially for signature-based optimization. The multi-criteria optimization method is now applicable for all studied event-based signatures. Model attributes **************** As the API has been redesigned, the attributes of the Model object have changed, below is the equivalence between the two versions: .. list-table:: Equivalence between v0.5 and v1.0 :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``Model.setup`` - ``Model.setup`` * - ``Model.mesh`` - ``Model.mesh`` * - ``Model.input_data`` - ``Model.response_data``, ``Model.physio_data``, ``Model.atmos_data`` * - ``Model.parameters`` - ``Model.rr_parameters`` * - ``Model.states`` - ``Model.rr_initial_states`` * - ``Model.output`` - ``Model.response``, ``Model.rr_final_states`` Only two attributes have not been changed, ``setup`` and ``mesh``. For the rest, either they have been split into several attributes (``input_data`` and ``ouput``) or the structure of the attribute itself has been modified (``parameters`` and ``states``). Below some equivalences for each modified attributes on how to access variables. .. list-table:: Equivalence between v0.5 and v1.0 for Model.input_data :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``Model.input_data.qobs`` - ``Model.response_data.q`` * - ``Model.input_data.prcp`` - ``Model.atmos_data.prcp`` * - ``Model.input_data.mean_prcp`` - ``Model.atmos_data.mean_prcp`` * - ``Model.input_data.descriptor`` - ``Model.physio_data.descriptor`` .. list-table:: Equivalence between v0.5 and v1.0 for Model.parameters :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``Model.parameters.cp`` - ``Model.get_rr_parameters("cp")`` .. list-table:: Equivalence between v0.5 and v1.0 for Model.states :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``Model.states.hp`` - ``Model.get_rr_initial_states("hp")`` .. list-table:: Equivalence between v0.5 and v1.0 for Model.output :widths: 25 25 :header-rows: 1 * - v0.5 - v1.0 * - ``Model.output.qsim`` - ``Model.response.q`` * - ``Model.output.fstates`` - ``Model.rr_final_states`` * - ``Model.output.fstates.hp`` - ``Model.get_rr_final_states("hp")`` ------------ New Features ------------ CPU parallel computation ************************ It is now possible to run parallel simulations within a single Model object. This has been made possible using the `OpenMP `__ library, whose most of the directives are handled by Tapenade to generate a parallel adjoint code. To activate this option, simply pass a number of CPUs greater than 1 to the ``common_options`` argument of the various simulation methods (``forward_run``, ``optimize``, ``bayesian_optimize``, etc.). .. code-block:: python >>> model.forward_run() # Sequential run >>> model.forward_run(common_options={"ncpu": 5}) # Parallel run on 5 CPUs Lat-Lon meshing *************** It is now possible to work with spatial data whose unit is in degree. A mesh can be produced from flow direction whose unit is the degree, so that atmospheric data can be read on the same projection. Bayesian estimation ******************* A bayesian estimation module has been added to perform parameter estimation and uncertainty quantification. .. hint:: See the :ref:`math_num_documentation.bayesian_estimation` section and `smash.Model.bayesian_optimize` API reference. This bayesian estimation method can be invoked as follows: .. code-block:: python >>> model.bayesian_optimize() Strutural error mu and sigma can be accessed as follows once the bayesian estimation has been performed: .. code-block:: python >>> model.get_serr_mu() >>> model.get_serr_sigma() New snow module *************** A new snow module has been introduced: - ``ssn`` This snow module is a simple degree-day snow module. .. hint:: See the :ref:`Snow Module ` section. This snow module can be selected using the ``snow_module`` setup option: .. code-block:: yaml snow_module: "ssn" Precipitation partitioning ************************** Since a snow module has been introduced, a partitioning of the total precipitation into liquid and solid precipitation has been implemented. The partitioning method is derived from a classical parametric S-shaped curve :cite:p:`garavaglia2017impact`. The precipitation partitioning can be selected using the ``prcp_partitioning`` setup option: .. code-block:: yaml prcp_partitioning: True New hydrological modules ************************ 4 new hydrological modules have been introduced: - ``gr4`` This hydrological module is derived from the GR4 model :cite:p:`perrin2003improvement`. - ``gr5`` This hydrological module is derived from the GR5 model :cite:p:`LeMoine_2008`. - ``loieau`` This hydrological module is derived from the GR model :cite:p:`Folton_2020`. - ``vic3l`` This hydrological module is derived from the VIC model :cite:p:`liang1994simple`. .. hint:: See the :ref:`Hydrological Module ` section. These hydrological modules can be selected using the ``hydrological_module`` setup option. .. code-block:: yaml hydrological_module: "gr5" hydrological_module: "loieau" hydrological_module: "vic3l" New routing modules ******************* 2 new routing modules have been introduced: - ``lag0`` This routing module is a simple aggregation of upstream discharge to downstream following the drainage plan - ``kw`` This routing module is based on a conceptual 1D kinematic wave model that is numerically solved with a linearized implicit numerical scheme :cite:p:`ChowAppliedhydrology`. This is applicable given the drainage plan :math:`\mathcal{D}_{\Omega}\left(x\right)` that enables reducing the routing problem to 1D. .. hint:: See the :ref:`Routing Module ` section. These routing modules can be selected using the ``routing_module`` setup option. .. code-block:: yaml routing_module: "lag0" routing_module: "kw" New efficiency metrics ********************** 4 new efficiency metrics have been introduced: - ``NNSE`` The Normalized Nash-Sutcliffe Efficiency - ``MAE`` The Mean Absolute Error - ``MAPE`` The Mean Absolute Percentage Error - ``MSE`` The Mean Squared Error .. hint:: See the :ref:`math_num_documentation.efficiency_error_metric` section. These efficiency metrics can be selected using the ``cost_options`` argument in a simulation method. .. code-block:: python >>> model.forward_run(cost_options={"jobs_cmpt": "mape"}) >>> model.optimize(cost_options={"jobs_cmpt": "nnse"}) or in the `smash.metrics` method .. code-block:: python >>> smash.metrics(model, "mse") Efficiency metric discharge transformation ****************************************** It is now possible to apply transformations to the discharge to calculate the efficiency metrics. Two transformations are available: - ``sqrt`` Square root transformation - ``inv`` Multiplicative inverse transformation This can be used as follows: .. code-block:: python >>> model.optimize(cost_options={"jobs_cmpt": ["kge", "kge"], ... "jobs_cmpt_tfm": ["keep", "inv"]}) Return optional variables ************************* It is now possible to return certain optional variables directly from the simulation function call. Each function has its own possible optional variables to return. It can be used as follows: .. code-block:: python >>> ret = model.optimize(return_options={"q_domain": True, "iter_cost": True, "cost": True}) >>> ret cost: iter_cost: q_domain: time_step: >>> ret.cost 0.09739679098129272 >>> ret.iter_cost [0.64318967 0.09739679] >>> ret.q_domain [[[-9.9000000e+01 -9.9000000e+01 -9.9000000e+01 ... -9.9000000e+01 -9.9000000e+01 -9.9000000e+01] ... [-9.9000000e+01 -9.9000000e+01 -9.9000000e+01 ... -9.9000000e+01 -9.9000000e+01 -9.9000000e+01]]] >>> ret.q_domain.shape # shape (nrow, ncol, ntime_step) (28, 28, 1440) Two variables can be returned at certain time steps only, ``q_domain`` and ``rr_states`` through the ``time_step`` option, by default all the time steps are returned. .. code-block:: python >>> ret = model.optimize(return_options={"q_domain": True, "rr_states": True}) >>> ret.q_domain.shape # shape (nrow, ncol, ntime_step) (28, 28, 1440) >>> len(ret.rr_states) 1440 >>> ret = model.optimize(return_options={"q_domain": True, "rr_states": True, ... "time_step": model.setup.end_time}) >>> ret.q_domain.shape # shape (nrow, ncol, ntime_step) (28, 28, 1) >>> len(ret.rr_states) 1 Multi-linear/polynomial descriptors *********************************** It is now possible, only for ``multi-linear`` and ``multi-polynomial`` mappings, to choose the descriptors to be linked to the optimized parameters. In the previous version, all the descriptors where choosen for all optimized parameters. It can be specified as follows: .. code-block:: python >>> model.optimize("multi-linear", ... optimize_options={"descriptor": dict(cp=["slope", "dd"], ct=["slope"], ... kexc=["slope", "dd"], llr=["dd"])}) New early stopping feature for ANN-based regionalization ******************************************************** A new feature for ANN-based regionalization method to stop training when the loss function does not decrease below the current optimal value for ``early_stopping`` consecutive epochs. It can be used as follows: .. code-block:: python >>> model.optimize(mapping="ann", ... optimize_options={"termination_crit": dict(early_stopping=10)}) Optimize At epoch 1 J = 1.150821 |proj g| = 0.002341 At epoch 2 J = 1.142061 |proj g| = 0.002610 ... At epoch 26 J = 0.103324 |proj g| = 0.005455 At epoch 27 J = 0.089769 |proj g| = 0.016618 At epoch 28 J = 0.104150 |proj g| = 0.019718 ... At epoch 36 J = 0.185123 |proj g| = 0.011936 At epoch 37 J = 0.179911 |proj g| = 0.011819 Training: 92%|██████████████████████████▊ | 37/40 [00:30<00:02, 1.23it/s] New multiple optimize method **************************** A new simulation method has been introduced, ``smash.Model.multiple_optimize``. In a similar way to ``smash.Model.multiple_forward_run``, it is now possible to run multiple optimization from different starting points in parallel. The output of this method can be passed to the ``smash.Model.multiset_estimate`` method. It can be used as follows: .. code-block:: python >>> problem = { 'num_vars': 4, 'names': ['cp', 'ct', 'kexc', 'llr'], 'bounds': [[1, 2000], [1, 1000], [-20, 5], [1, 1000]] } >>> sr = generate_samples(problem, n=3, random_state=11) >>> mopt = smash.multiple_optimize( model, samples=sr, optimize_options={"termination_crit": {"maxiter": 2}} ) >>> mest = smash.multiset_estimate(model, multiset=mopt) New metric method ***************** A new method to compute efficiency metric or error from a `smash.Model` object been introduced. It can be used as follows: .. code-block:: python >>> model.optimize() >>> ret = smash.metrics(model, "kge") >>> ret array([0.96001273, 0.86749017, 0.81862521]) New default optimize options method *********************************** New methods, `smash.default_optimize_options` and `smash.default_bayesian_optimize_options` have been added to retrieve what default options will be used with a certain ``mapping`` and ``optimizer``. These methods will return a dictionary containing default value that can be modified afterwards by the user. It can be used as follows: .. code-block:: python >>> opt_u = smash.default_optimize_options(model, mapping="uniform") >>> opt_u { 'parameters': ['cp', 'ct', 'kexc', 'llr'], 'bounds': { 'cp': (1e-06, 1000.0), 'ct': (1e-06, 1000.0), 'kexc': (-50, 50), 'llr': (1e-06, 1000.0) }, 'control_tfm': 'sbs', 'termination_crit': {'maxiter': 50}, } >>> opt_u["termination_crit"] = 100 >>> opt_u { 'parameters': ['cp', 'ct', 'kexc', 'llr'], 'bounds': { 'cp': (1e-06, 1000.0), 'ct': (1e-06, 1000.0), 'kexc': (-50, 50), 'llr': (1e-06, 1000.0) }, 'control_tfm': 'sbs', 'termination_crit': {'maxiter': 100}, } >>> model.optimize(mapping="uniform", optimize_options=opt_u) New optimize control information method *************************************** New methods, `smash.optimize_control_info` and `smash.bayesian_optimize_control_info` have been added to retrieve control vector information from a certain ``mapping``, ``optimizer``, ``optimize_options`` and ``cost_options``. These methods will return a dictionary containing control vector information such as, initial value, name of each element, bounds, kind of bounds, etc.. These methods can be used to check that the control vector created for a given optimization configuration is correct. In addition, in the case of bayesian estimation, the names of each element in the control vector can be retrieved in this way in order to impose prior distribution on the control vector. .. code-block:: python >>> control_info = smash.optimize_control_info(model) >>> control_info { 'l': array([-13.815511 , -13.815511 , -4.6052704, -13.815511 ], dtype=float32), 'l_bkg': array([ 1.e-06, 1.e-06, -5.e+01, 1.e-06], dtype=float32), 'n': 4, 'name': array(['cp0', 'ct0', 'kexc0', 'llr0'], dtype='