smash 1.0.0 Release Notes#
The smash 1.0.0 release brings major updates aimed at simplifying the user experience while continuing the ongoing work to improve the handling, fix possible bugs, clarify the documentation. The highlights are:
CPU parallel computation
CPU parallel computation in simulation and optimization. This has been made possible using the OpenMP library, whose most of the directives are handled by Tapenade to generate a parallel adjoint code.
Memory management
The memory management has been improved in two critical sections:
- Atmospheric data storage
The memory management of atmospheric data has been improved by using sparse matrices to store precipitation, potential evapotranspiration and snow data.
- Storage of variables during the adjoint code forward sweep
The time-stepping loop fills up the memory space during the forward sweep which can cause the code to crash. To reduce the memory peak, checkpoints have been added during the time-stepping loop.
lat-lon meshing
It is now possible to work with spatial data whose unit is in degree.
bayesian estimation
A bayesian estimation module has been added to perform parameter estimation and uncertainty quantification
Hint
See the Bayesian Estimation section.
modular structure
The structure of the model has been modularised so that the snow, hydrological and routing modules can now be selected independently.
API redesign
The API has been redesigned. The essence remains the same, but there have been a lot of changes to the names of methods and arguments and the way variables are passed to functions.
Contributors#
This release was made possible thanks to the contributions of:
François Colleoni (inoelloc)
Ngo Nghi Truyen Huynh (nghi-truyen)
Benjamin Renard (benRenard)
Thomas de Fournas (ThomasdeFournas)
Apolline El Baz (asjeb)
Pierre-André Garambois (pag13)
Maxime Jay-Allemand (maximejay)
Compatibilities#
The
smash
package is now available on PyPI and can be installed as follows:pip install hydro-smash
The gdal package has been replaced by rasterio. This change has been made purely to improve the distribution of
smash
. Despite the better performance in terms of raster reading time, gdal does not provide a binary wheel directly, unlike rasterio (see discussion).The pytest-cov plugin has been added to the list of dependencies to perform test coverage.
The black package has been removed from the list of dependencies and replaced by the ruff package.
The SALib package has been removed from the list of dependencies. The sensitivity analysis can still be performed but external to
smash
.
Deprecations#
Signatures sensitivity#
The method to perform sensitivity analysis on hydrological signatures is no longer available. Sensitivity
analysis is not specific to smash
and can be considered as smash
post-processing. This is why we have
decided to withdraw the direct use of sensitivity analysis tools in smash
, without preventing independent
post-processing.
Nelder-Mead optimizer#
One of the optimizers available for spatially uniform parameter optimization was the Nelder-Mead
optimizer. Seeing no particular advantage in using this optimizer rather than the step-by-step
optimizer,
it was decided to withdraw it.
kge2 efficiency metric#
This efficiency metric has been removed as it is not used and is of no interest compared to the kge
efficiency metric.
Improvements#
Atmospheric data sparse storage#
The memory management of atmospheric data has been improved by using sparse matrices to store precipitation,
potential evapotranspiration and snow data. The previous sparse storage method consisted of simply passing
from a 2D matrix storage of size (nrow
, ncol
) to a 1D vector of active cells of size (nac
) where
\(\text{nac} \leq \text{nrow} \times \text{ncol}\). In this version, this method has been improved by
adding management of redundant values in the input data. We know for precipitation, potential
evapotranspiration and snow that most of the arrays are filled with 0. If more than 50% of the array is filled
with 0, then it’s worth saving only the indices and values of cells whose value is greater than 0. This method
drastically reduces the amount of memory used by not storing arrays where it is not raining or arrays of
potential evapotranspiration in the middle of the night.
Adjoint code time-stepping checkpoints#
We use checkpoint to reduce the maximum memory usage of the adjoint model. Without checkpoints, the maximum memory required is equal to \(K \times T\), where \(K \in \left[0, \inf \right[\) is the memory used at each time step and \(T \in \left[1, \inf \right[\) the total number of time steps. With checkpoints, the maximum memory required is equal to \(K \times C + K \times \frac{T}{C}\), where \(C \in \left[1, T \right]\) is the number of checkpoints. Finding out what value of \(C\) minimizes it, \(C\) must be equal to the square root of the number of time steps \(T\). \(K \times C + K \times \frac{T}{C}\) becomes \(2K \sqrt{T}\). Therefore, the memory gain is equivalent to \(M = 1 - \frac{2}{\sqrt{T}}\)
Modular structure#
In the previous version, the model structure was defined by modifying the structure option in setup.
The possible choices were gr-a
, gr-b
, gr-c
and gr-d
and described both the hydrological
module and the routing
module. Now, the structure is defined via 3 modules, snow
module containing
zero
and ssn
, hydrological
module containing gr4
, gr5
, grd
, loieau
and vic3l
and routing
module containing lag0
, lr
and kw
. The combination of the 3, forms a structure.
Below is the table of equivalence between the two versions:
v0.5 |
v1.0 |
---|---|
|
No equivalence |
|
No equivalence |
|
No equivalence |
|
|
Mesh generation#
The mesh generation has been improved to be equally effective at delineating a catchment regardless of the domain size of the input flow direction file. In the previous version, it was necessary to crop the flow direction file over a region to speed up calculations. Now, before delineating a catchment, the flow direction file is clipped by taking as its maximum extension the number of cells required to delineate a hypothetically rectangular catchment one cell wide. In this way, it is possible to create a mesh of a French catchment using worldwide flow direction.
Reading of atmospheric and physiograhic data#
The reading of geo-referenced data (i.e. precipitation, snow, descriptor, etc) has been improved to manage:
Resolution missmatch
Nearest-Neighbour resample algorithm is used to match the resolution of the data and the resolution of the mesh.
Overlap missmatch
The reading window is correctly shifted to overlap the data and the mesh on the nearest overlapping cell.
Out Of Bound
In the previous version, when the mesh extent was larger than the data extent a not a very informative error was returned. It is now possible to read data whose extent partially include the mesh extent. The corresponding out of bound extent will be filled in with no data.
A warning is returned if one of the 3 cases below is encountered when reading the data.
Mean atmospheric data#
Computing the spatial averages of atmospheric data for each catchment took a considerable amount of time,
particularly on large domains with a lot of catchments. It still take some times but has been reduced by more
than half. The spatial averages of atmospheric data are optional and are only required for precipitation when
optimizing with signatures. Therefore, we added a new option in the setup, compute_mean_atmos
which by
default is set to True, but can be set to False to disable the computation of the spatial averages of
atmospheric data.
Hydrograph segmentation and signatures-based optimization#
The baseflow separation method used in the hydrograph segmentation algorithm has been rewritten in Fortran to improve the efficiency of signature calculation, especially for signature-based optimization. The multi-criteria optimization method is now applicable for all studied event-based signatures.
Model attributes#
As the API has been redesigned, the attributes of the Model object have changed, below is the equivalence between the two versions:
v0.5 |
v1.0 |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Only two attributes have not been changed, setup
and mesh
. For the rest, either they have
been split into several attributes (input_data
and ouput
) or the structure of the attribute itself has
been modified (parameters
and states
). Below some equivalences for each modified attributes on how to
access variables.
v0.5 |
v1.0 |
---|---|
|
|
|
|
|
|
|
|
v0.5 |
v1.0 |
---|---|
|
|
v0.5 |
v1.0 |
---|---|
|
|
v0.5 |
v1.0 |
---|---|
|
|
|
|
|
|
New Features#
CPU parallel computation#
It is now possible to run parallel simulations within a single Model object. This has been made possible using
the OpenMP library, whose most of the directives are handled by Tapenade to
generate a parallel adjoint code. To activate this option, simply pass a number of CPUs greater than 1 to the
common_options
argument of the various simulation methods (forward_run
, optimize
,
bayesian_optimize
, etc.).
>>> model.forward_run() # Sequential run
>>> model.forward_run(common_options={"ncpu": 5}) # Parallel run on 5 CPUs
Lat-Lon meshing#
It is now possible to work with spatial data whose unit is in degree. A mesh can be produced from flow direction whose unit is the degree, so that atmospheric data can be read on the same projection.
Bayesian estimation#
A bayesian estimation module has been added to perform parameter estimation and uncertainty quantification.
Hint
See the Bayesian Estimation section and smash.Model.bayesian_optimize
API reference.
This bayesian estimation method can be invoked as follows:
>>> model.bayesian_optimize()
Strutural error mu and sigma can be accessed as follows once the bayesian estimation has been performed:
>>> model.get_serr_mu()
>>> model.get_serr_sigma()
New snow module#
A new snow module has been introduced:
ssn
This snow module is a simple degree-day snow module.
Hint
See the Snow Module section.
This snow module can be selected using the snow_module
setup option:
snow_module: "ssn"
Precipitation partitioning#
Since a snow module has been introduced, a partitioning of the total precipitation into liquid and solid precipitation has been implemented. The partitioning method is derived from a classical parametric S-shaped curve [Garavaglia et al., 2017].
The precipitation partitioning can be selected using the prcp_partitioning
setup option:
prcp_partitioning: True
New hydrological modules#
4 new hydrological modules have been introduced:
gr4
This hydrological module is derived from the GR4 model [Perrin et al., 2003].
gr5
This hydrological module is derived from the GR5 model [Le Moine, 2008].
loieau
This hydrological module is derived from the GR model [Folton and Arnaud, 2020].
vic3l
This hydrological module is derived from the VIC model [Liang et al., 1994].
Hint
See the Hydrological Module section.
These hydrological modules can be selected using the hydrological_module
setup option.
hydrological_module: "gr5"
hydrological_module: "loieau"
hydrological_module: "vic3l"
New routing modules#
2 new routing modules have been introduced:
lag0
This routing module is a simple aggregation of upstream discharge to downstream following the drainage plan
kw
This routing module is based on a conceptual 1D kinematic wave model that is numerically solved with a linearized implicit numerical scheme [Chow et al., 1998]. This is applicable given the drainage plan \(\mathcal{D}_{\Omega}\left(x\right)\) that enables reducing the routing problem to 1D.
Hint
See the Routing Module section.
These routing modules can be selected using the routing_module
setup option.
routing_module: "lag0"
routing_module: "kw"
New efficiency metrics#
4 new efficiency metrics have been introduced:
NNSE
The Normalized Nash-Sutcliffe Efficiency
MAE
The Mean Absolute Error
MAPE
The Mean Absolute Percentage Error
MSE
The Mean Squared Error
Hint
See the Efficiency & Error Metric section.
These efficiency metrics can be selected using the cost_options
argument in a simulation method.
>>> model.forward_run(cost_options={"jobs_cmpt": "mape"})
>>> model.optimize(cost_options={"jobs_cmpt": "nnse"})
or in the smash.metrics
method
>>> smash.metrics(model, "mse")
Efficiency metric discharge transformation#
It is now possible to apply transformations to the discharge to calculate the efficiency metrics. Two transformations are available:
sqrt
Square root transformation
inv
Multiplicative inverse transformation
This can be used as follows:
>>> model.optimize(cost_options={"jobs_cmpt": ["kge", "kge"],
... "jobs_cmpt_tfm": ["keep", "inv"]})
Return optional variables#
It is now possible to return certain optional variables directly from the simulation function call. Each function has its own possible optional variables to return. It can be used as follows:
>>> ret = model.optimize(return_options={"q_domain": True, "iter_cost": True, "cost": True})
>>> ret
cost: <class 'float'>
iter_cost: <class 'numpy.ndarray'>
q_domain: <class 'numpy.ndarray'>
time_step: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
>>> ret.cost
0.09739679098129272
>>> ret.iter_cost
[0.64318967 0.09739679]
>>> ret.q_domain
[[[-9.9000000e+01 -9.9000000e+01 -9.9000000e+01 ... -9.9000000e+01
-9.9000000e+01 -9.9000000e+01]
...
[-9.9000000e+01 -9.9000000e+01 -9.9000000e+01 ... -9.9000000e+01
-9.9000000e+01 -9.9000000e+01]]]
>>> ret.q_domain.shape # shape (nrow, ncol, ntime_step)
(28, 28, 1440)
Two variables can be returned at certain time steps only, q_domain
and rr_states
through the
time_step
option, by default all the time steps are returned.
>>> ret = model.optimize(return_options={"q_domain": True, "rr_states": True})
>>> ret.q_domain.shape # shape (nrow, ncol, ntime_step)
(28, 28, 1440)
>>> len(ret.rr_states)
1440
>>> ret = model.optimize(return_options={"q_domain": True, "rr_states": True,
... "time_step": model.setup.end_time})
>>> ret.q_domain.shape # shape (nrow, ncol, ntime_step)
(28, 28, 1)
>>> len(ret.rr_states)
1
Multi-linear/polynomial descriptors#
It is now possible, only for multi-linear
and multi-polynomial
mappings, to choose the descriptors to
be linked to the optimized parameters. In the previous version, all the descriptors where choosen for all
optimized parameters. It can be specified as follows:
>>> model.optimize("multi-linear",
... optimize_options={"descriptor": dict(cp=["slope", "dd"], ct=["slope"],
... kexc=["slope", "dd"], llr=["dd"])})
New early stopping feature for ANN-based regionalization#
A new feature for ANN-based regionalization method to stop training when the loss function does not decrease
below the current optimal value for early_stopping
consecutive epochs. It can be used as follows:
>>> model.optimize(mapping="ann",
... optimize_options={"termination_crit": dict(early_stopping=10)})
</> Optimize
At epoch 1 J = 1.150821 |proj g| = 0.002341
At epoch 2 J = 1.142061 |proj g| = 0.002610
...
At epoch 26 J = 0.103324 |proj g| = 0.005455
At epoch 27 J = 0.089769 |proj g| = 0.016618
At epoch 28 J = 0.104150 |proj g| = 0.019718
...
At epoch 36 J = 0.185123 |proj g| = 0.011936
At epoch 37 J = 0.179911 |proj g| = 0.011819
Training: 92%|██████████████████████████▊ | 37/40 [00:30<00:02, 1.23it/s]
New multiple optimize method#
A new simulation method has been introduced, smash.Model.multiple_optimize
. In a similar way to
smash.Model.multiple_forward_run
, it is now possible to run multiple optimization from different
starting points in parallel. The output of this method can be passed to the smash.Model.multiset_estimate
method. It can be used as follows:
>>> problem = {
'num_vars': 4,
'names': ['cp', 'ct', 'kexc', 'llr'],
'bounds': [[1, 2000], [1, 1000], [-20, 5], [1, 1000]]
}
>>> sr = generate_samples(problem, n=3, random_state=11)
>>> mopt = smash.multiple_optimize(
model,
samples=sr,
optimize_options={"termination_crit": {"maxiter": 2}}
)
>>> mest = smash.multiset_estimate(model, multiset=mopt)
New metric method#
A new method to compute efficiency metric or error from a smash.Model
object been introduced. It can
be used as follows:
>>> model.optimize()
>>> ret = smash.metrics(model, "kge")
>>> ret
array([0.96001273, 0.86749017, 0.81862521])
New default optimize options method#
New methods, smash.default_optimize_options
and smash.default_bayesian_optimize_options
have been added to
retrieve what default options will be used with a certain mapping
and optimizer
. These methods will
return a dictionary containing default value that can be modified afterwards by the user. It can be used as
follows:
>>> opt_u = smash.default_optimize_options(model, mapping="uniform")
>>> opt_u
{
'parameters': ['cp', 'ct', 'kexc', 'llr'],
'bounds': {
'cp': (1e-06, 1000.0),
'ct': (1e-06, 1000.0),
'kexc': (-50, 50),
'llr': (1e-06, 1000.0)
},
'control_tfm': 'sbs',
'termination_crit': {'maxiter': 50},
}
>>> opt_u["termination_crit"] = 100
>>> opt_u
{
'parameters': ['cp', 'ct', 'kexc', 'llr'],
'bounds': {
'cp': (1e-06, 1000.0),
'ct': (1e-06, 1000.0),
'kexc': (-50, 50),
'llr': (1e-06, 1000.0)
},
'control_tfm': 'sbs',
'termination_crit': {'maxiter': 100},
}
>>> model.optimize(mapping="uniform", optimize_options=opt_u)
New optimize control information method#
New methods, smash.optimize_control_info
and smash.bayesian_optimize_control_info
have been added to
retrieve control vector information from a certain mapping
, optimizer
, optimize_options
and
cost_options
. These methods will return a dictionary containing control vector information such as,
initial value, name of each element, bounds, kind of bounds, etc.. These methods can be used to check that the
control vector created for a given optimization configuration is correct. In addition, in the case of
bayesian estimation, the names of each element in the control vector can be retrieved in this way in order to
impose prior distribution on the control vector.
>>> control_info = smash.optimize_control_info(model)
>>> control_info
{
'l': array([-13.815511 , -13.815511 , -4.6052704, -13.815511 ], dtype=float32),
'l_bkg': array([ 1.e-06, 1.e-06, -5.e+01, 1.e-06], dtype=float32),
'n': 4,
'name': array(['cp0', 'ct0', 'kexc0', 'llr0'], dtype='<U5'),
'nbd': array([2, 2, 2, 2], dtype=int32),
'nbk': array([4, 0, 0, 0], dtype=int32),
'u': array([6.9077554, 6.9077554, 4.6052704, 6.9077554], dtype=float32),
'u_bkg': array([1000., 1000., 50., 1000.], dtype=float32),
'x': array([5.2983174, 6.214608 , 0. , 1.609438 ], dtype=float32),
'x_bkg': array([200., 500., 0., 5.], dtype=float32),
}
New package architecture#
Two sub-modules have been created io
and factory
. The io
sub-module contains all the In/Out
methods (i.e., smash.io.read_setup
, smash.io.save_model
, etc.). The factory
sub-module contains all
the methods that are used to work around the smash.Model
object but not dependent on it (i.e.,
smash.factory.load_dataset
, smash.factory.generate_mesh
, etc.).
New precipitation indice#
The vertical gap VG
[Emmanuel et al., 2015] has been added to the list of precipitation indices.
New setup options#
New options have been added to the setup:
prcp_access
,pet_access
,snow_access
andtemp_access
This options (one per atmospheric data type) can be used to specify how are stored the atmospheric data files (i.e., the directories architecture).
Hint
See the Input Data Convention convention.
It can be specified as follows:
prcp_acces: "%Y/%m/%d" # files stored by YYYY/MM/DD prcp_acces: "%Y/%m" # files stored by YYYY/MM
adjust_interception
When constructing a
smash.Model
object with anhydrological_module
set togr4
orgr5
at sub-daily time step, the maximum capacity of the interception reservoir is automatically adjusted. This can be deactivated (to take another value of this, for example) by settingFalse
to theadjust_interception
option.adjust_interception: False # deactivate
compute_mean_atmos
When constructing a
smash.Model
object the spatial averages of atmospheric data are computed. This can be deactivated (to take another value of this, for example) by settingFalse
to thecompute_mean_atmos
option.compute_mean_atmos: False # deactivate
read_snow
,snow_directory
,snow_conversion_factor
,read_temp
,temp_directory
With the addition of the snow module, new options for reading snow and temperature data have been added to the setup. They are used in a similar way to the precipitation and potential evapotranspiration data.
Fixes#
No observed discharge available between end_warmup and end_time#
Solves a problem when the user chooses to optimize the model by taking stations where no observed discharge
data is available over the calibration period between end_warmup
and end_time
.
Error message when no input files were found#
The error message has been improved to make it easier to read if a lot of files are missing.
Missing descriptor or spatially uniform descriptor#
Fix the error message when a descriptor file is missing and add an error if the physiograhic descriptor is spatially uniform.
Model object save and read method with sparse storage#
The smash.Model
object was not correctly saved and therefore read when it was constructed with the
sparse_storage
option. FortranDerivedTypeArray
were not currently handled.
Error in projected gradient computation with ANN mapping#
There was an error in the computation of the projected gradient when using ANN mapping. The returned projected gradients were computed only based on \(\nabla_\boldsymbol{\rho} \boldsymbol{\theta}\), while the correct term is \(\nabla_\boldsymbol{\rho} J = \nabla_\boldsymbol{\theta} J . \nabla_\boldsymbol{\rho} \boldsymbol{\theta}\).
Optimization of initial states with ANN mapping#
There was a typing error leading to a bug when optimizing the initial states using ANN mapping.
Final states not computed when using ANN mapping#
The final states were not computed when using ANN mapping for optimization. This issue has been fixed by adding a forward run when the optimization process with ANN has been finished.
Error when computing min and max of physiograhic descriptor#
There was an error when computing the min and max of a physiograhic descriptor if there is no data (i.e., sea in the domain). No data from tif file are converted to -99 which leads to a wrong minimal value.
Interception reservoir capacity calculation#
Missing values were not properly taken into account in the calculation of the interception reservoir. Missing values in Fortran for precipitation and potential evapotranspiration are stored as -99, resulting in erroneous values. This was fixed simply by not incorporating these values into the calculation.