1 Introduction

Large-scale quantitative assessment of water resources, which is useful in hydrology, hydrogeology, agriculture, and other fields, is generally carried out using models that take into account soil–atmosphere interaction and the hydraulic behaviour of the soil (Brocca et al. 2007; Koster et al. 2009; Brocca et al. 2014; Mimeau et al. 2021). The shallow part of the soil, which is the most affected by atmospheric variables, is normally unsaturated. Soil water content (SWC) and soil water potential (SWP) are the main variables to be considered in the evaluation of the hydraulic behaviour of unsaturated soil in relation to rainfall events. In fact, such variables are used as input data for different types of physically based models to quantify the soil water balance (Bittelli et al. 2010, 2015).

In particular, SWC is a fundamental property that affects a large variety of biophysical processes, such as seed germination, plant growth, and plant nutrition. Given that it determines water infiltration, percolation, evaporation, and plant transpiration, it is a key variable for computing the soil water budget. Moreover, SWC is an important quantity often required for agricultural practices (tillage, soil fertilization, and irrigation), assessment of drought conditions, estimation of run-off, management of water resources, triggering of shallow landslides, and impact on climatic features of an area (Koster et al. 2004; Liu et al. 2008; Godt et al. 2009; Ahmad et al. 2010).

SWC is also used to model the coupled hydraulic and mechanical behaviour of unsaturated soils in geotechnical problems such as stability analysis of natural slopes, levees, dikes, and dams. With regard to the soil–atmosphere interactions, some researchers demonstrated that SWC might regulate the atmospheric variables that are relevant to the dynamics of storms and occurrence of future rainfall (Eltahir 1998). Soil moisture conditions not only reflect past occurrences of rainfall, but also determine a positive feedback mechanism between soil moisture and subsequent precipitation due to convection-related parameters (Findell and E E, 1997). However, the identification of a relationship between soil moisture and precipitation feedback is not simple, due to a complex interplay between various factors that favour or inhibit convection initiation (Hauck et al. 2011).

Regarding the coupled hydraulic–mechanical behaviour of unsaturated soils in stability analysis of both natural and artificial slopes, many authors have highlighted how small pores in soil induce a strength contribution enabling slope stability even for slopes that are steeper than the soil friction angle. However, such a contribution decreases under increasing water content (Rianna et al. 2014; Leung and Ng 2013). In most slope stability analyses, the behaviour of an unsaturated soil is modelled using the soil water characteristic curve (SWCC), which represents the relationship between SWC and SWP (Rahardjo et al. 2005; Fredlund et al. 2012; Fredlund 2019). In any case, whenever the phenomenon under investigation concerns soil, plants, or atmosphere interactions, the estimation of SWC is very important when direct measurement is not available.

SWC can be measured with a variety of methods in the spatial scale, ranging from a few cubic centimetres (small soil sensors used in the greenhouse or field applications) to kilometres (global microwave satellites). Different time scales can also be employed with measurements that can be performed on a minute-based scale (by using soil sensors) or daily with satellites. When measurement is dependent on the acquisition schedule, it is performed with discontinuous methods, such as ground-penetrating radar (Gerhards et al. 2008). Bittelli (2011) provides a review of the fundamental principles employed for SWC measurement and a discussion about the time and spatial scale measurements. In many practical applications (for instance, irrigation management at the farm scale), soil moisture sensors are not available, and satellite data do not provide the necessary spatial resolution. In this regard, the International Soil Moisture Network aims at collecting data at the global level for a variety of applications in climate science, hydrology, agriculture, and other fields (Dorigo et al. 2021). Additionally, soil moisture modelling and forecast have become important as management tools and require reliable data for model parameterization and testing. Many models are available for quantification of vadose zone processes as discussed in some recent review papers (Vereecken et al. 2016; Zheng et al. 2019).

Prediction methods for the SWC can be grouped into the main categories of data-driven empirical models and process-based models. The data-driven empirical models used for producing soil moisture maps are mostly based on satellite remote sensing data and microwave radar data. They include statistical methods such as Bayesian models (Kim et al. 2017), support vector machines (Yu et al. 2012; Raghavendra and Deka 2014; Liu et al. 2016), multiple linear regression models (Qiu et al. 2003; Jung et al. 2017; Mei et al. 2019; Cai et al. 2019), random forests (Pan et al. 2019), artificial intelligence methods (Nguyen 2022), and artificial neural network algorithms (Zou et al. 2010; Schmidt et al. 2020; Hegazi et al. 2021). Despite the good prediction capabilities of these models, the interpretation of the relationships between one or more predictors and SWC appears rather difficult to interpret from a physical and hydrological point of view (Raghavendra and Deka 2014).

Process-based models focus on the hydrological processes that control the soil moisture transfer mechanisms through physical equations, and calculate the explanatory variables as part of the land surface data assimilation techniques (Dai and Cheng 2022). An extended description of numerical methods and computer code for solving flow equations with process-based models is provided by Bittelli et al. (2015). Observationally obtained factors such as precipitation, atmospheric temperature, and solar radiation can be used for the seasonal dynamic prediction of SWC (Panigrahi and Panda 2003; Bittelli et al. 2010; Valentino et al. 2011; Mo and Lettenmaier 2014).

Process-based models also include numerical models that calculate SWC by solving equations of soil water flow. They are based on water balance parameters and on the main soil hydrological properties, namely the soil water characteristic curve (SWCC) and the hydraulic conductivity function (Van Dam et al. 1997; Šimunek and Van Genuchten 2008). The main advantage of these methods is the physical meaning of the equations used to solve SWC calculations (Lamorski et al. 2013). However, these equations need many soil parameters (hydraulic properties, soil properties, land coverage) that can be difficult to collect over large areas and sometimes require a preliminary calibration of the adopted hydrological parameters (Deng et al. 2011). In this framework, statistical models based on time series analysis and the adoption of robust statistical analysis are an alternative to process-based modelling and can be used with data that are more easily obtained, such as weather data. Robust statistics is a peculiar branch of statistics: broadly speaking, it is referred to as a collection of methods which provide fully reliable estimates and prediction even in the presence of multiple outliers and large errors in the collected data (Atkinson and Riani 2000; Riani 2004).

The aim of this research is to provide a new statistical model to estimate the SWC within a thickness of 1.4 m from ground level. The rationale is to develop a statistical function linking the quantities involved in both infiltration and evapotranspiration phenomena, namely soil volumetric water content, water potential, air temperature, rainfalls, and solar radiation, but not considering the feedback effect of soil moisture on convection-related parameters. To achieve this goal, a time series of field experimental data was employed. The time series was collected from continuous monitoring over a long period at a test site in Oltrepò Pavese in northern Italy (Bordoni et al. 2021). These data are treated in the framework of robust statistics by using the combination of robust parametric and non-parametric models: a combination of least trimmed squares (LTS) and singular spectrum analysis (SSA).

The paper shows how the proposed model can capture the relevant features present in the data and how it can be used for prediction purposes. The approach is based on models introduced in the paper by Rousseeuw et al. (2019) and uses the MATLAB Flexible Statistics Data Analysis (FSDA) toolbox, which is freely available on the MATLAB marketplace, with fine-tuning on seasonal identifications. Other statistical approaches exist, but none of the available software is sufficiently fine-tuned to handle gross errors or outliers (Hosseini et al. 2015).

The main novelty of the proposed model is its ability to accurately predict the SWC at various soil depths based on daily rainfall data. Among the evaluated meteorological variables that were available in our study, it was found that daily air temperature paired with prior rainfall accumulation was the most important. Therefore, the model is able to self-tune and predict seasonal fluctuations using very few field data. Compared with other models, the proposed model requires very little computational effort and uses readily available input data. These characteristics make it particularly suitable for large-scale implementation in areas with scarce experimental data.

The structure of the paper is as follows: Section 2 illustrates the test site, the available observations, and the processing of field data, while Sect. 3 introduces the model and the methodology for analysing a time series which contains a trend, time-varying multiple seasonal components, and isolated or consecutive outliers. Section 4 shows the results of the methodology application and the comparison between model results and time series of field measurements. The relevant aspects of the methodology and results are discussed in Sect. 4 as well. Finally, Sect. 5 presents the concluding remarks.

Fig. 1
figure 1

Location of the site, scheme of the devices, and soil composition

2 Data and Methods

2.1 Monitoring Test Site

The selected test site is located near the village of Montuè (Fig. 1) in the north-eastern Oltrepò Pavese (northern Italian Apennines, Lombardy region, northern Italy), within the catchment of Scuropasso creek. The test site is 0.02 km\(^2\) wide and is representative of the main geological and geomorphological features of the study area.

The bedrock is made of gravel, sand, and poorly cemented conglomerates, overlying marls and gypsum (Vercesi and Scagni 1984). The groundwater is characterized by deep water circulation, which is confined in fractured levels located at different depths in the bedrock, without forming a continuous aquifer. The test site faces east, at altitudes ranging between 170 and 210 m a.s.l. The slope steepness is between 26\(^\circ \) and 35\(^\circ \), in a very steep range all along the hillslope. The top of the slope is mostly covered by grass and shrubs, while the slope toe is covered by a woodland of black robust trees.

According to Koppen’s classification of world climates, the climatic regime is temperate/mesothermal (Csa: Mediterranean hot summer climate), with a mean yearly temperature of 13\(^\circ \) C and mean yearly rainfall around 694 mm (Canevino meteorological station, ARPA Lombardia monitoring network).

The test site is located in a catchment very prone to shallow landslides. In particular, an extreme rainfall event (160 mm accumulated rain in 62 h) that occurred on 27 and 28 April 2009 triggered many shallow landslides (mean density of 29 landslides per km\(^2\)) in the surrounding area (Bordoni et al. 2015) (Fig. 1). The same event caused nine shallow landslides in the test site. This slope was affected by a further shallow failure that occurred between 28 February and 2 March 2014 as a consequence of rainfall of 68.9 mm in 42 h (Bordoni et al. 2015). Shallow landslides on this slope involved areas of a few hundred square metres, with sliding surfaces at 1 m from ground level, mostly corresponding to slope steepness between 30\(^\circ \) and 35\(^\circ \).

The shallow landslides involved clayey-sandy silts and clayey-silty sands, which derive from bedrock weathering and are characterized by three main layers (Fig. 1). In the first layer (US), from the ground surface down to 0.7 m, the soil is clayey-sandy silt with low plasticity, high carbonate content, and unit weight between 16.7 and 17.0 kN/m\(^3\). The second soil layer (LS), between 0.7 and 1.1 m from the ground level, has similar characteristics as the US layer but a higher unit weight of 18.6 kN/m\(^3\). At a depth between 1.1 and 1.3 m, the soil has the same textural, plasticity, and density features of the LS layer, but it is characterized by a significant increase in carbonate content up to 35.3%. This layer can be classified as a calcic horizon (CAL), where the carbonate concretions have higher density than in other levels. The weathered bedrock (WB), composed of sand and poorly cemented conglomerates, is positioned 1.3 m below the ground surface. These soil layers are characterized by hydraulic conductivity that decreases as depth increases. Hydraulic conductivity was measured in the field through a compact constant head permeameter (Amoozemeter; Amoozegar 1989). The US layer has the highest value, in the order of 10\(^{-5}\) m/s, while LS and CAL are characterized by a saturated hydraulic conductivity equal to 10\(^{-6}\) m/s and 10\(^{-7}\) m/s, respectively. With regard to the mechanical features of the soils, the peak shear strength parameters were obtained through triaxial tests. The US and LS layers are characterized by similar friction angles between 31\(^\circ \) and 33\(^\circ \), and by zero effective cohesion. The CAL layer has a smaller friction angle (26\(^\circ \)) than the other layers, but it has effective cohesion of 29 kPa. Moreover, all the soil layers are over-consolidated, as demonstrated by oedometric tests. Table 1 summarizes the main soil features at the Montuè test site.

Table 1 Description of different soil layers

A monitoring station, which integrates meteorological and hydrological sensors, was installed at the test site in March 2012 (Fig. 1). The meteorological sensors measure rainfall, air temperature, air humidity, atmospheric pressure, wind speed and direction, and net solar radiation. The soil probes measure water content, water potential, and soil temperature. Details on the devices are reported in Table 2.

Table 2 Devices and sensors for hydrological monitoring with data logger: No. 1 CR1000X (Campbell Scientific, Inc.)

Hydrological sensors included six time-domain reflectometer (TDR) probes installed at different depths, three jet-fill tensiometers, and three heat dissipation (HD) sensors installed in pairs at three different depths based on the characteristics of the soil layers. Jet-fill tensiometers and HD sensors are in pairs because the jet-fill tensiometer measures soil–water potential higher than −10 J/kg (fewer negative values, lower absolute values), whereas the HD sensor allows one to obtain soil–water potential lower than −10 J/kg (more negative values, higher absolute values). The HD sensor is based on the Flint et al. (2002) equation to convert the measured change in soil temperature after a constant heating period (Bittelli et al. 2012). All field data were collected by a data logger powered by a photovoltaic panel and recorded with a frequency of 10 min. A more detailed description of the monitoring station and the probes is reported in Bordoni et al. (2015). As described in the following sections, field-measured data over 8 years relating to both soil hydrological quantities and atmospheric variables (Bordoni et al. 2021) were taken into account for the development of the proposed model.

2.2 Field Data Processing

Field measurements of both soil and atmospheric variables were recorded with a frequency of 10 min, but for the purpose of this research, accumulated hourly data were deemed more appropriate. The final hourly time series presented randomly scattered missing values. This was the first issue to be solved. There are several methods for performing missing replacement, and an interpolation is a common choice. A more robust alternative is to replace the missing data points with the median of a small block of data, using some of the previous and subsequent records. Additional jittering taken from a uniform distribution could be considered if data replacement involves a large chunk of data that would be constant over time.

In the subsequent analysis, daily data are obtained by aggregating or averaging hourly data. Obviously, data with shorter frequency alleviate the arbitrariness underlying the missing data replacement, and both alternatives discussed above result in similar outcomes once daily data are considered.

3 The Statistical Model

Based on time series of field data discussed in Bordoni et al. (2021), the aim of this research is to provide a unified statistical framework for modelling and prediction of SWC at different soil depths. In this section, the statistical features of the data and the structure of the proposed model are discussed. A preliminary discussion is related to the approach followed to validate the model. We split the data into two parts: in the so-called training part, daily time series (21/11/2012 to 31/12/2019) are used to estimate all parameters of the model. Diagnostics in-sample are assessed via residual analysis (see Sect. 4.3). Subsequently, in the testing part, the validation of the model is explored using daily out-of-sample forecasts for the year 2020, with details reported below (see Sect. 4.4). We recall that we have daily data, properly cleaned with robust filters discussed in Sect. 2.2. Field SWC data measured at depths of 0.2 m and 1.2 m are plotted in Fig. 2 in black and blue, respectively. The red vertical lines of Fig. 2 denote the daily cumulative precipitation. A similar plot is presented in Fig. 3, where the red line denotes the daily average temperature.

Fig. 2
figure 2

Time series of daily values for SWC (first axis) at soil depth of 0.2 m (black line), 1.4 m (blue line), and the daily accumulated rain (red line on the second axis)

Fig. 3
figure 3

Time series of daily values for SWC (first axis) at soil depth of 0.2 m (black line), 1.4 m (blue line), and the daily average air temperature

From visual inspection of both Figs. 2 and 3, it is clear that there is a seasonal variation in SWC at all depths, but whether there is a clear direct link between SWC and atmospheric variables is far from obvious.

Table 3 Variables included in our dataset

Table 3 lists all the variables that were originally available in the data loggers. The superscript in \(Y_t^{(m)}\) denotes the value of the outcome at soil depth of m metres. A similar notation is used for the explanatory variables \(X_{t,j}^{(m)}\) (with \(j=\{1, 2, \ldots , 9\}\)). Our aim is to model SWC at a specific soil depth via a minimal set of explanatory variables that are easy to obtain. By “easy to obtain”, we mean that such variables do not require the installation of specific devices in the soil.

The pairwise scatter of daily data does not suggest any specific relationship between the available variables. On the contrary, the time series plot shows some regularity, mostly related to seasonal factors and common trends among the variables. The building bricks of the proposed model are formulated by the regression-like expression

$$\begin{aligned} Y_t^{(m)}= & {} c_0 + \sum _{a=0}^{A} \alpha _a t^{\alpha }+ \sum _{j=1}^{P}\theta _j X_{t,j}^{(m)} + \left[ \sum _{b=1}^B \beta _{b,1} \cos (\omega _b t)+ \sum _{b=1}^B \beta _{b,2} \sin (\omega _b t) \right] \nonumber \\{} & {} \quad \left( 1 + \sum _{g = 1}^G \gamma _g t^g \right) + \delta _1 {\textbf{I}}(t \ge \delta _2) + W_t. \end{aligned}$$
(1)

Details and rationales of model (1) are discussed for monthly data in Rousseeuw et al. (2019), and here we revise the most important features. The model has four main components: polynomial time parameters for long-term trends, denoted by \(\alpha _a\); linear effect of time-varying explanatory variables with coefficients \(\theta _j\), and the same notation is used when the explanatory variables of Table 3 do not have the superscript (m) or they have a lag k effect, that is, when \(X_{t-k,j}\) is considered; seasonality term modelled by trigonometric waves with coefficients \(\beta _{b,1}\) and \(\beta _{b,2}\), having time-varying magnitude driven by \(\gamma _g\); and finally, a level shift is included in the case of a major sudden level break located at time \(\delta _2\), with magnitude \(\delta _1\). A minor comment is warranted for \(\omega _b = 2b\pi /T\), where T is the length of the time period (1 year of daily data, so \(T = 365.25\)), implying that \(\omega _b\) is driven by the time-frequency of the recorded data.

For the random disturbance \(W_t\) we assume a Gaussian-like distribution with 0 mean and finite variance \(\sigma _W^2\). Despite the non-linear structure, the model introduced in Eq. (1) can be recast into a regression-like framework and enjoy simplicity of estimation coupled with robustness (see Sect. 2.2 in Rousseeuw et al. (2019) for further details). One can note the presence of an intercept \(c_0\). Additionally, it can happen that there is a “lag effect” of the explanatory variables on the \( Y_t^{(m)}\), and in that case, the explanatory variable will be written, for example, like \(X_{t-k,j}^{(m)}\), with integer \(k>1\) (with superscript (m) removed when the explanatory variable is related to ground-level measurements).

Model (1) is fitted to all soil depths of \(Y_t^{(m)}\) and, for each single analysis, a careful variable selection is performed. A relatively common structure considers as significant only two predictors: the daily average air temperature \(X_{t,4}\) and the cumulative daily lagged rain \(X_{t-k,7}\), with the value of k depending on the soil depth under investigation. The seasonal sine/cosine waves are significant for values of B in the set \(\{1, 2, 3\}\), depending on the soil depth. At first glance it seems that the interaction term between seasonal sine/cosine and polynomial components is unnecessary. Finally, in some cases we also found a significant linear trend, with negative drift, which might suggest global warming issues.

In other words, based on our experimental data, the model introduced in Eq. (1) reduces to the following special case

$$\begin{aligned} Y_t^{(m)} = c_0 + \sum _{a=0}^{A} \alpha _a t^{\alpha }+ \sum _{j=1}^{P}\theta _j X_{t,j} + \left[ \sum _{b=1}^B \beta _{b,1} \cos (\omega _b t) + \sum _{b=1}^B \beta _{b,2} \sin (\omega _b t) \right] + W_t.\nonumber \\ \end{aligned}$$
(2)

The focus is now on the specific values of unknown parameters for all studied soil depths. Before discussing the features of significant coefficients in each sub-model at a specific depth, we anticipate that the relevant predictors are a mixture of trend-seasonal deterministic components (low-degree polynomial functions and sine/cosine waves) and atmospheric stochastic components, driven by rain and temperature. These findings have important practical implications, as the water content can be estimated with a very minimal set of explanatory variables for which data values are easily retrievable (simple devices installed on the surface). Additionally, due to the availability of existing software such as Weather Generator (Tomei et al. 2022), future scenarios can be easily simulated for long-term assessment.

3.1 Hints from Singular Spectrum Analysis for Seasonal Components

In this work, a very powerful signal processing technique (singular spectrum analysis, SSA) is used to reduce the impact of noise on the measured data and to detect structural variations in the data (Huffaker et al. 2017). SSA separates time series data into structured variation (signal), including trend and oscillatory components, and unstructured variation (noise). Since the proposed model can be implemented with a different number of oscillatory components (periods), we used the SSA to enable optimal selection of the number of periods. After identifying the proper number of periods contributing to the signal, the result was used to fine-tune the structure of model (1) and to obtain a statistical estimation of the associated parameters.

By using SSA, it was possible to obtain information about which seasonal effects are overwhelming and which are, instead, negligible. From the visual inspection of eigenvectors (individual and pairwise comparisons) of the SSA for SWC at 0.2 m, it appears that there is a strong seasonal pattern and a long-term trend, suggesting that the location under investigation is potentially subject to long-term climate changes. All these findings are visible from inspection of both panels of Fig. 4. Similar results hold for all other SWC depths (not reported).

Fig. 4
figure 4

Eigenvectors of the SSA decomposition for each component and pairwise comparisons. The first three eigenvectors are responsible for almost all signal in the data (about 97% of the signal) and display a steady long-term and marked seasonality

It is possible to extract the components of SSA for convenient visual inspection of any regularity. As an illustration, we show the extraction of the long-term trend and seasonal components in Fig. 5. In particular, the four panels represent (i) the original series of SWC recorded at 0.2 m; (ii) the trend (whose decline looks linear at first glance and consistent with findings reported in Table 4—see the sign of the estimate of \(\alpha _1\)); (iii) the overall effect of the two seasonal components associated with eigenvectors 2 and 3; and (iv) the “residual” part from the decomposition, which still appears to be far from white noise. As stated previously, this issue will be investigated below, in Sect. 4, where some model improvements will be discussed, but other adjustments are subject to further research.

Fig. 5
figure 5

Plot of original series and reconstruction of components after SSA for SWC at 0.2 m. The components responsible for the overall signal are the long-term trend and two seasonal components

4 Results, Diagnostics, and Validation

4.1 SWC at Superficial Levels: Depth <1 m

We report the results of model fitting for soil depths of SWC located at 0.2, 0.4, and 0.6 m, which we refer to as “superficial levels”. We report the estimated parameters of the model (2) after a careful, statistically motivated variable selection in Table 4. Using the training data, the adjusted \(R^2\) value for all the fitted models considered here is around 0.7 (or even larger), with better performance at more superficial levels. In all cases, there is a temporal correlation in the residuals, and we provide comments on this evidence below.

Table 4 Significant variables included for SWC at depth of 0.2 m

From a temporal viewpoint, the most important findings are the presence of a negative linear trend and the presence of a single sine/cosine wave, implying one strong seasonality pattern. At the superficial depth of 0.2 m, there is a positive effect of accumulated rainfall, which lagged at about 50 days. In other words, the contribution of accumulated rainfall is strongest with a lag of approximately 50 days, implying that the amount of SWC at day t is mostly driven by the accumulated rainfall over the prior 50 days. This last piece of evidence indicates a positive effect and relatively long persistence of accumulated rainfall, holding constant the effect of all the other explanatory variables. This finding is not new, and one of the first attempts of modelling this persistence dates to Yu and Cruise (1982).

The temperature at a superficial depth of 0.2 m has a negative effect on the SWC. Stated more precisely, the value of the average air temperature at day \(t-1\) negatively influences the level of SWC. The choice of lagged temperature at \(t-1\) rather than t is for practical use of the model: using the temperature recorded “yesterday” gives no uncertainty on such explanatory variable when daily predictions are sought. Additionally, we report that using \(X_{t, 4}\) instead of \(X_{t-1, 4}\) yielded very marginal model improvements.

Similar comments hold for models fitted at depths 0.4 and 0.6 m, reported in Table 5 and Table 6, respectively. The main differences rely on the selection of more involved seasonal effects, as three waves of sine/cosine are found by our variable selection algorithm. The negative gradient of the long-term trend is significant at a depth of 0.4 m and no longer significant at a depth of 0.6 m. We note the longer persistence effect of the accumulated rainfall, which is always positively related to the amount of SWC, but with longer-lasting effects as depth increases, suggesting a longer time span needed for drying the soil. At depths of both 0.4 m and 0.6 m, the effect of the average daily surface temperature is negative, with magnitude decreasing with increasing depth, following the results obtained at 0.2 m. This feature anticipates that the average air temperature might reverse its effect at some stage.

Table 5 Significant variables included for SWC at depth of 0.4 m
Table 6 Significant variables included for SWC at depth of 0.6 m

4.2 SWC at Deeper Levels: Depth of 1 m and More

For deeper levels, the structure of the best-fitted model is still in the form of expression (2). Using our robust fit and robust variable selection algorithm, the coefficients are reported in Tables 7, 8, and  9.

The main finding is that the coefficient associated with the air temperature has a positive sign, as we highlighted earlier, and this feature seems to have a natural physical explanation in the interaction between SWC and air temperature. The effect of cumulative rainfall is still significant, but the time lag at which the most important peak is found is longer for this soil depth, suggesting a longer persistence effect at deeper levels than at superficial levels (we find this very sensible). The number of the multiple seasonal cycles is generally lower than those found at superficial levels, as it appears that only long-term seasonality is found. We found a negative linear trend at 1 m, the magnitude of which is similar to what we have at a depth of 0.6 m. The actual presence of a significant long-term trend would require further investigation, perhaps including more data from several nearby sites.

Table 7 Significant variables included for SWC at depth of 1 m
Table 8 Significant variables included for SWC at depth of 1.2 m
Table 9 Significant variables included for SWC at depth of 1.4m

4.3 Diagnostic Check and Analysis of Residuals

In this section we analyse residuals \(e_t = y_t - {\hat{y}}_t\), \(t = 1, 2, \ldots , N\), where N is the sample size used in the fit, and \({\hat{y}}_t\) are the fitted values after estimating the parameters of model (2). We comment only on residuals of SWC at 0.2 m, but results are similar for other depths. Estimated coefficients are reported in Table 4. Residuals are standardized so they have zero mean and unit variance, and it is simpler to contrast their values against quantiles of a standard normal distribution. The comparison against a standard normal is useful for checking marginal features of residuals. Another feature to inspect is the temporal correlation of residuals via the analysis of the empirical autocorrelation and the empirical partial autocorrelation; these diagnostic checks are routinely performed to assess a model’s mis-specification, and are all summarized, for example, in Brockwell and Davies (2016)[Sect. 5.3, pp. 144 to 147].

Fig. 6
figure 6

Time series of daily SWC at 0.2 m split into training (up to the end of 2019) and testing, using a scenario for simulation of precipitation and air temperature

The four panels of Fig. 6 highlight some interesting findings. The plot of residuals over time (top left panel of Fig. 6) shows a pattern that displays some time dependence. Therefore, residuals are not white noise. This is confirmed by the estimates of both autocorrelation and partial autocorrelation of residuals (bottom left and bottom right panels, respectively, of Fig. 6): these two diagnostic plots suggest that an auto-regressive model should provide some improvement of the fit. The robust fit of model (2) with auto-regressive moving average (ARMA) is still under construction in the FSDA toolbox that we have used for this research. Finally, from the top right panel of Fig. 6 we observe that residuals are not Gaussian, but deviation from normality appears very minor (see the reference dotted line, which is the density of a standard normal). For deeper soil levels, results are broadly similar and thus not reported, but are available upon request.

A concern that needs to be investigated is associated with the “direction” of the model’s errors: it seems that the model underestimates some large values, as observed standardized residuals larger than 3 occur quite often (compared with the theoretical normal assumption). As a final summary, the model diagnostics suggest including ARMA components and some adjustment for the possible presence of heavy tails; this will be the subject of further ongoing research.

4.4 Forecast Scenario for 2020

Despite its simplicity, the model demonstrates good performance at all depths, with an average adjusted \(R^2\) exceeding 0.7 for the observed data up to 31 December 2019. As already highlighted in Sect. 4.3, we also noted some serial correlation in the residuals, and approaches for handling this feature will be better investigated and suitably addressed in further research. We now turn to the investigation of a genuine forecast scenario using generated climate data of precipitation and air temperature. The climate data are generated via scenario simulation on a daily basis for precipitation and air temperature for all of 2020. We used the Weather Generator software developed by Tomei et al. (2022) to perform a scenario generation. Here, we discuss in some detail the two most “extreme” cases (i.e. shallowest and deepest soil levels) for illustrative purposes. For all intermediate depths, we show all results and give some comments.

Figure 7 includes the training part and the forecast part, distinguished by a dotted vertical line. In the training part, the agreement between the observed data and the model results appears convincing. In the forecast part (2020), it is possible to see that some sharp observed peaks are not accurately predicted by the model (mostly during the dry period). This is probably due to inappropriate functioning of the field device. This lack of accuracy is still visible at depths of 0.4 m and 0.6 m (see top left and top right panels of Fig. 9).

Fig. 7
figure 7

Time series of daily SWC at 0.2 m split into the training (up to the end of 2019) and testing, using scenario for simulation of precipitation and air temperature

Fig. 8
figure 8

Time series of daily SWC at 1.4 m split into the training (up to the end of 2019) and testing, using scenario for simulation of precipitation and air temperature

Subsequently, we used the same generated weather to perform a similar check but at a deeper level (SWC at 1.4 m). For this level, the training part is not fully satisfactory, especially in 2017 and 2019 when periods of severe drought were observed throughout northern Italy. In this case, the forecast is relatively smooth, and the real observed values are in agreement with the model results. Similar arguments hold for levels at 1 m and 1.2 m, which are displayed in the bottom panels of Fig. 9.

Fig. 9
figure 9

Time series of daily SWC at several depth, ranging from 0.4 to 1.2 m, with data split into training (up to the end of 2019) and testing, using a scenario for simulation of precipitation and air temperature

5 Discussion and Final Remarks

We have developed a statistical model to describe the temporal pattern of water content at different depths in soil. SWC is a fundamental variable of water balance in soil, influencing several agronomic, geological, and hydrological processes. The model was developed starting from a dataset of meteorological and hydrological parameters measured by a monitoring station on a hillslope very prone to shallow landslides (Bordoni et al. 2015). In fact, shallow landslide triggering depends strongly on SWC values. Shallow failures are triggered when soil approaches or reaches saturated conditions, namely values of SWC close to or equal to the total volume of voids, during or immediately after intense rainfall events (Godt et al. 2009).

Despite the simplicity of the underlying mathematical model, the results obtained are very satisfactory. The use of the proposed model might have benefits in water management and other effects on shallow landslide predictions. We tested our methods using standard goodness-of-fit measures and via a long-term scenario (1 year of daily data).

One of the major benefits of our data-driven approach is the possibility of obtaining accurate daily predictions relying on past data only (i.e., on data that are known without uncertainty). Another benefit is that we require very little physical instrumentation, none of which is located underground, making the water content estimation feasible for very large audiences.

There are limitations in our study, and we left some issues open to further research, some of which are currently under investigation in parallel research projects. From a statistical viewpoint, the selected models all display some correlation in the residuals, and this suggests a more involved time series modelling. We try to fix this feature by adding some with an ARMA structure (and their seasonal generalization), but that quickly turns into an over-fitting. Additionally, fitting seasonal ARMA models when outlying observations are included requires specific software, which is not yet available for multiple seasonalities, as we have found in our data via the SSA.

Another feature that we have overlooked is the mutual interaction of SWC at different depths and at different temporal lags. Addressing this multivariate response problem requires methods that generalize those illustrated in Lowther et al. (2020), which for our perspective require some fine-tuning for robustness checks. We believe that the joint modelling of water content at different depths, robust fitting, and software development open an avenue for further research.

In terms of the usability of our approach, we are investigating other sites with different soil types and soil use, retrieving data from official worldwide sources. At the moment, we have evidence that different soil compositions and plants have an effect on SWC and on the speed of drying of the soil. The ability to make valid inferences regarding the specific soil composition and plant coverage would require a larger set of data, which are currently being collected.