22 Spatial Cross-Validation

22.1 Cross-validation

22.1.1 Why?

Cross-validation can be used for evaluating an estimator performance and to compare different estimators.

Cross-validation may help choosing a kriging method (e.g. the type of kriging), a variogram model (e.g. spherical, exponential, or other configurations about the nugget, sill and range parameters) or a neighbourhood.

22.1.2 What is it?

Its principle is the following: at each point location, remove the sample value and estimate its value from the other samples. Then, compare actual \(z({\mathbf u})\) to estimated \(\hat{z}({\mathbf u})\) values using statistics on the errors \(\epsilon_{\beta}\):

\(\epsilon_{\beta} = z_{\beta}({\mathbf u}) - \hat{z}_{\beta}({\mathbf u})\,,\;\; \forall\; \beta=1,...,n_{VAL}\)

where \(n_{VAL}\) is the number of known data included in the validation subset.

Once a model is built (or trained or calibrated, according to its type) it can be used to make predictions. Usually, the interest of the modeler is to estimate a value for the attribute at hand at an unvisited geospatial location.

22.1.2.1 What about the accuracy of the prediction?

The only way to acknowledge about the accuracy is to make a new measurement in the point location where the prediction is made. This operation isn’t always possible for cost or technical constraints (e.g. we measured a dynamic variable for which we cannot get currently a measurement back at the time of the monitoring or survey). Therefore, it is possible to use some samples from the data at hand for validation. To make it properly works, an independent subset (called validation subset) must be created and used only for the purpose of measuring the model performance (and never participating to the training).

One can get useful information about the fitting capabilities of the model, including

  • overfitting (model too complex which badly fits other data than training),
  • underfitting (model too simple and unable to find relationships and patterns),
  • underestimation (the model consistently predicts lower values than the actual values), and
  • overestimation (the model consistently predicts higher values than the actual values).

22.2 Split the data to create subsets

The operation of splitting data into subsets can be usefull to destinate each subset to a distinct scope in the procedure of model building and validation. It is very common to split the whole dataset into at least two subsets:

  • Training subset. The training subset is used to build / calibrate the model and generally it contains 60%-90% of the whole data. It means that only the information available in the training subset is actually used either to make the interpolation at unknown locations (during validation the unknown locations are the validation locations) or to recognize the parameters of the model (e.g. think to the parameters of the model variogram).
  • Validation subset. The validation subset is used to validate the model by measuring the error of the estimation on 10%-40% of the data not used for training / calibration. Any error measure takes into consideration the difference of the measured vs predicted value. It means that any record belonging to the validation subset can be used (or “seen”) by the model during calibration.

22.2.2 Import data

tmean <- read.csv("XY_Tmean.csv")

Dimension:

dim(tmean)
#> [1] 493   3
n <- dim(tmean)[1]
n
#> [1] 493

22.2.3 Transform into simple feature

tmean_sf <- st_as_sf(tmean, coords = c("Xcoord", "Ycoord"), crs=32633)
summary(tmean_sf)
#>      Tmean                  geometry  
#>  Min.   : 0.3347   POINT        :493  
#>  1st Qu.: 1.4522   epsg:32633   :  0  
#>  Median : 3.2497   +proj=utm ...:  0  
#>  Mean   : 3.5548                      
#>  3rd Qu.: 5.0767                      
#>  Max.   :12.6371

Map of points

tmap_mode("view")
#> ℹ tmap mode set to "view".
tm_shape(tmean_sf) + 
  tm_dots(fill="black")

22.2.4 Validation subset

The definition of a random subset for validation purpose is crucial to evaluate the accuracy (or the error) of any interpolation technique (e.g. nearest neighbours, IDW, kriging, etc.).

The validation subset is made by more than few records (let’s say from 10% to 40% of the whole dataset) in order to get a good estimate of the error associated with the spatial interpolation.

In the following example, we can use 20% of the whole data to create the validation subset:

n_val <- n * 20 / 100
n_val
#> [1] 98.6

Let’s round to the nearest integer, since we cannot select a fraction of a single record:

n_val <- round(n * 20 / 100,0)
n_val
#> [1] 99
val_idx <- sample(1:n, n_val)
val_idx
#>  [1] 353  13 102 129 487 415 137 391 354 441 464 323 218 491
#> [15] 429 104 122 324 340 453 250 131 322  88 128 348  29 300
#> [29] 419  39 408 265 240  46   5 452 191 109  33  72 455 266
#> [43]  41  74 412 450 289 304 385 182 392 309 344 420 208 239
#> [57] 150  82 379 460 236 418 124  16 409 439 278 226 413 364
#> [71] 260  14 148  19 478 316 386 395 249 477 349 446 241 445
#> [85] 258 493 397  15 285 438 138 118 359 267 231 483 352 345
#> [99] 425

Split into validation:

tm_val <- tmean_sf[val_idx,]
dim(tm_val)
#> [1] 99  2

22.2.5 Training subset

The residual subset remaining after the removal of the validation subset forms the training subset.

tm_tr <- tmean_sf[-val_idx,]
dim(tm_tr)
#> [1] 394   2

22.2.6 Map of points

tmap_mode("view")
#> ℹ tmap mode set to "view".
tm_shape(tm_tr) + 
  tm_dots(fill="black") + 
  tm_shape(tm_val) + 
  tm_dots(fill = "magenta")

22.2.7 Random vs spatial partitioning

In the plot below there are four partitions for each of the following partitioning types:

  • purely random (first row in in the plot)
  • spatial, clustered (second row)
  • spatial, vertical bands (third row)

We shall use the random partitioning scheme in the course.

22.3 Seed

One can set the seed in the random number generator in order to get always the same list of random indexes. This can be useful in the case in which it is desirable to repeat the experiment in the same conditions.

set.seed(3008)
val_idx <- sample(1:n, n_val)
val_idx
#>  [1] 388 385 387 462 436 381 116 146 448 204  43  49 459 325
#> [15] 453 166 460 125 468 226  12 232 243 304  67 423 256  11
#> [29] 111  45 425 162 149 221 474 276 246 118  75  97 316 212
#> [43] 193 300  74 330 234 433 223 318  15 441 168 473 173 137
#> [57] 294 250 353 307 384 249 184  27 217 485 247 261 337  20
#> [71] 229 406 415 144 117 192  39 366   2 421  10 390 358  44
#> [85] 467 338 163   4 109 491 225 251 449  38 335 153  55 450
#> [99] 478

The seed definition helps to implement reproducibility and brings the randomness in the geospatial interpolation algorithms (e.g. Machine Learning algorithms) for using random numbers. Random numbers are of two types: pseudo-random numbers and true random numbers. Pseudorandom numbers are numbers that appears to be random, but they are not truly random. Typically, pseudorandom numbers (see Wikipedia > Pseudorandom number generator) can be generated using a seed value (which is provided by a user with the set.seed() function) which is then passed to an algorithm that uses the value to generate a new number.

22.4 Monte Carlo analysis

The error calculation is dependent of the particular subset selected by the random number generator, hence it should be convenient to repeat several times the splitting procedure with a different set of random number in each. One can calculate the error for each repetition and then calculate the average error. This procedure is called Monte Carlo analysis.

The Monte Carlo analysis (see Wikipedia > Monte Carlo method) follows this pattern:

  • Define a domain of possible inputs (select the dataset to be used for spatial interpolation)
  • Generate inputs randomly from a probability distribution over the domain (split the whole dateset in training and validation using the random number generator)
  • Perform a deterministic computation of the outputs (calibrate the model using the training subset and predict with it at validation locations)
  • Aggregate the results (calculate the error for each Monte Carlo repetition and compute the aggregated average error)

22.5 Error definition (optional)

We now define a custom function to calculate the root mean square error. Of course, there are different packages in R already providing the RMSE calculation (please, search for them by yourself). Here we use a custom RMSE function as an example and just to underscore a viable opportunity to set up any kind of custom error measurement.

22.5.1 Version 1

rmse_v1 <- function(M,P){
    # INPUTS
    #  M    : measurements
    #  P    : predictions
    
    n <- length(M)
    
    E = (M - P)
    S = E^2
    M = sum(S) / n
    R = sqrt(M)
    
    R
}
rmse_v1( tm_val$Tmean , tm_val$Tmean )
#> [1] 0
rmse_v1( tm_val$Tmean , tm_val$Tmean[1:98] )
#> Warning in M - P: longer object length is not a multiple of
#> shorter object length
#> [1] 0.3389811

22.5.2 Version 2

rmse_v2 <- function(M,P){
    # INPUTS
    #  M    : measurements
    #  P    : predictions
    
    # Checek that M & P have the same size:
    n1 <- length(M)
    n2 <- length(P)
    if(n1>n2 | n1<n2){
        print("Input vectors have different size!")
        return(NA)
    }
    
    n <- length(M)
    
    E = (M - P)
    S = E^2
    M = sum(S) / n
    R = sqrt(M)
    
    R
}

Now we can manage different lengths about actual and predicted values:

rmse_v2( tm_val$Tmean , tm_val$Tmean[1:98] )
#> [1] "Input vectors have different size!"
#> [1] NA

Anyway, the second version of the RMSE function cannot handle the situation in which missing values are generated by the estimator, as an NA is returned:

# insert a dummy NA in the data:
actual = tm_val$Tmean
predicted = tm_val$Tmean
predicted[1] <- NA
rmse_v2( actual , predicted )
#> [1] NA

Indeed, it may happen that the estimator returns NA in neighborhoods where an insufficient number of observations are available for interpolation (e.g. below the nmin parameter).

22.5.3 Version 3

The following third version of the RMSE function is created to overcome the previous situation:

rmse_v3 <- function(M,P){
    # INPUTS
    #  M    : measurements
    #  P    : predictions
    
    # Checek that M & P have the same size:
    n1 <- length(M)
    n2 <- length(P)
    if(n1>n2 | n1<n2){
        print("Input vectors have different size!")
        return(NA)
    }
    
    n <- length(M)
    
    E = (M - P)
    S = E^2
    M = sum(S, na.rm=TRUE) / n
    R = sqrt(M)
    
    R
}

Indeed, the RMSE is correctly computed:

# insert a dummy NA in the data:
actual = tm_val$Tmean
predicted = tm_val$Tmean
predicted[1] <- NA
rmse_v3( actual , predicted )
#> [1] 0

Do not forget that the result of the previous RMSE calculations were always zero, since we are comparing the measurements with themselves.

22.6 Assignment

Use the training and validation subsets to build and then validate the models of interpolation:

  • NULL (global average)
  • Nearest neighbour (proximity polygon)
  • Nearest neighbours (local average)
  • Inverse Distance Weighting (IDW)
  • Kriging, using two different variogram models

Measure the performance by each model on the validation subset using the custom RMSE function above.
Finally, compare models and select the most performing one (the model having the lowest error).