recorder
0.8.1 is now available on CRAN. recorder
is a lightweight toolkit to validate new observations before computing their corresponding predictions with a predictive model.
With recorder
the validation process consists of two steps:
There can be many data specific reasons, why you might not be confident in the predictions of a predictive model on new data.
Some of them are obvious, e.g.:
Others are more subtle, for instance when observations in new data are not within the “span” of the training data. One example of this could be, when a variable is “N/A” (missing) for a new observation to be predicted, but no missing values appeared for the same variable in the training data. This implies, that the new observation is not within the “span” of the training data. Another way of putting this: the model has never encountered an observation like this before, therefore there is good reason to doubt the quality of the prediction.
We will need some data in order to demonstrate the recorder
workflow. As so many times before the famous iris
data set will be used as an example. The data set is divided into training data, that can be used for model development, and new data for predictions after modelling, which we can validate with recordr
.
set.seed(1)
trn_idx <- sample(seq_len(nrow(iris)), 100)
data_training <- iris[trn_idx, ]
data_new <- iris[-trn_idx, ]
What we want to achieve is to validate the new observations (before computing their predictions with a predictive model) based on relevant statistics and meta data of the variables in the training data. Therefore relevant statistics and meta data of the variables must first be learned (recorded) from the trainingdata of the model. This is done with the record()
function.
library(recorder)
tape <- record(data_training)
#>
#> [RECORD]
#>
#> ... recording meta data and statistics of 100 rows with 5 columns...
#>
#> [STOP]
This provides us with an object belonging to the data.tape
class. The data.tape
contains the statistics and meta data recorded from the training data.
str(tape)
#> List of 2
#> $ class_variables:List of 5
#> ..$ Sepal.Length: chr "numeric"
#> ..$ Sepal.Width : chr "numeric"
#> ..$ Petal.Length: chr "numeric"
#> ..$ Petal.Width : chr "numeric"
#> ..$ Species : chr "factor"
#> $ parameters :List of 5
#> ..$ Sepal.Length:List of 3
#> .. ..$ min : num 4.3
#> .. ..$ max : num 7.9
#> .. ..$ any_NA: logi FALSE
#> ..$ Sepal.Width :List of 3
#> .. ..$ min : num 2
#> .. ..$ max : num 4.4
#> .. ..$ any_NA: logi FALSE
#> ..$ Petal.Length:List of 3
#> .. ..$ min : num 1.1
#> .. ..$ max : num 6.9
#> .. ..$ any_NA: logi FALSE
#> ..$ Petal.Width :List of 3
#> .. ..$ min : num 0.1
#> .. ..$ max : num 2.5
#> .. ..$ any_NA: logi FALSE
#> ..$ Species :List of 2
#> .. ..$ levels: chr [1:3] "setosa" "versicolor" "virginica"
#> .. ..$ any_NA: logi FALSE
#> - attr(*, "class")= chr [1:2] "list" "data.tape"
As you see, which meta data and statistics are recorded for the individual variables depends on the class of the given variable, e.g. for a numeric variable min
and max
values are computed, whilst levels
is recorded for factor variables.
First, to spice things up, we will give the new observations a twist by inserting some extreme values and some missing values. On top of that we will create a new column, that was not observed in training data.
# create sample of row indices.
samples <- lapply(1:3, function(x) {
set.seed(x)
sample(nrow(data_new), 5, replace = FALSE)})
# create numeric values without range, -Inf and Inf.
data_new$Sepal.Width[samples[[1]]] <- -Inf
data_new$Petal.Width[samples[[2]]] <- Inf
# insert NA's in numeric vector.
data_new$Petal.Length[samples[[3]]] <- NA_real_
# insert new column.
data_new$junk <- "junk"
Now, we will validate the new observations by running a number of basic validation tests on each of the new observations. The tests are based on the data.tape
with the recorded statistics and meta data of variabels in the training data.
You can get an overview over the validation tests with get_tests_meta_data()
.
get_tests_meta_data()
#> test_name evaluate_level evaluate_class
#> 1: missing_variable col all
#> 2: mismatch_class col all
#> 3: mismatch_levels col factor
#> 4: new_variable col all
#> 5: outside_range row numeric, integer
#> 6: new_level row factor
#> 7: new_NA row all
#> 8: new_text row character
#> description
#> 1: variable observed in training data but missing in new data
#> 2: 'class' in new data does not match 'class' in training data
#> 3: 'levels' in new data and training data are not identical
#> 4: variable observed in new data but not in training data
#> 5: value in new data outside recorded range in training data
#> 6: new 'level' in new data compared to training data
#> 7: NA observed in new data but not in training data
#> 8: new text in new data compared to training data
To run the tests simply invoke the play()
function with the recorded data.tape
on the new data.
playback <- play(tape, data_new)
#>
#> [PLAY]
#>
#> ... playing data.tape on new data with 50 rows with 6 columns ...
#>
#> [STOP]
What we actually have here is an object belonging to the new data.playback
class.
class(playback)
#> [1] "data.playback" "list"
Great, now let us have a detailed look at the test results with the print()
method.
playback
#>
#> [PLAY]
#>
#> # of rows in new data: 50
#> # of rows passing all tests: 0
#> # of rows failing one or more tests: 50
#>
#> Test results (failures):
#> > 'missing_variable': no failures
#> > 'mismatch_class': no failures
#> > 'mismatch_levels': no failures
#> > 'new_variable': junk
#> > 'outside_range': Sepal.Width[row(s): #10, #14, #19, #28, #43],
#> Petal.Length[row(s): #11],
#> Petal.Width[row(s): #8, #10, #28, #35, #44]
#> > 'new_level': no failures
#> > 'new_NA': Petal.Length[row(s): #9, #16, #19, #28, #40]
#> > 'new_text': no failures
#>
#> Test descriptions:
#> 'missing_variable': variable observed in training data but missing in new data
#> 'mismatch_class': 'class' in new data does not match 'class' in training data
#> 'mismatch_levels': 'levels' in new data and training data are not identical
#> 'new_variable': variable observed in new data but not in training data
#> 'outside_range': value in new data outside recorded range in training data
#> 'new_level': new 'level' in new data compared to training data
#> 'new_NA': NA observed in new data but not in training data
#> 'new_text': new text in new data compared to training data
#>
#> [STOP]
As you can see, we are in a lot of trouble here. All rows failed, because a new variable (junk
), that did not appear in the training data, was suddenly observed in new data. By assumption this invalidates all rows.
Besides from that, some rows failed, because values Inf
and -Inf
were outside the recorded range in the training data for variables Sepal.Width
and Petal.Width
. Also, a handful of NA
values were encountered in new data for Petal.Length
. This is a new phenomenon compared to the training data, where no NA
values were observed.
recorder
allows you extract the results of the validation tests in a number of ways.
You might want to extract the results as a data.frame with the results of the (failed) tests as columns. To do this, invoke get_failed_tests()
on playback
:
knitr::kable(head(get_failed_tests(playback), 15))
outside_range.Sepal.Width | outside_range.Petal.Length | outside_range.Petal.Width | new_NA.Petal.Length | new_variable.junk |
---|---|---|---|---|
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | TRUE | FALSE | TRUE |
FALSE | FALSE | FALSE | TRUE | TRUE |
TRUE | FALSE | TRUE | FALSE | TRUE |
FALSE | TRUE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
TRUE | FALSE | FALSE | FALSE | TRUE |
FALSE | FALSE | FALSE | FALSE | TRUE |
It can also be useful to get the results of the (failed) tests as a string with one entry per row in new data, where names of the failed tests for the given row are concatenated.
head(get_failed_tests_string(playback))
#> [1] "new_variable.junk;" "new_variable.junk;" "new_variable.junk;"
#> [4] "new_variable.junk;" "new_variable.junk;" "new_variable.junk;"
As a third option you can extract a logical vector, that indicates which rows, that passed the validation tests.
get_clean_rows(playback)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] FALSE FALSE FALSE FALSE FALSE FALSE
TRUE
means, that a given row is clean and has passed all tests, FALSE
on the other hand implies that a given row failed one or more tests.
In this case, all rows are invalid due to the strange column junk
, that appears in the new data (you might think, this is a strict rule, but it is consistent nonetheless).
It might be, that the user - for various reasons - wants to ignore one or more of the failed tests. You can handle this easily with recorder
, whenever you invoke one of the functions get_clean_rows()
, get_failed_tests()
or get_failed_tests_string()
.
Let us assume, that we do not care about, if there is a new column in the new data, that was not observed in the training data. The results of a specific test can be ignored with the ignore_test
argument.
Let us try it out and ignore the results of the new_variable
validation test.
get_clean_rows(playback, ignore_tests = "new_variable")
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#> [12] TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
#> [23] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
#> [34] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
#> [45] TRUE TRUE TRUE TRUE TRUE TRUE
According to this - less restrictive - selection 38 of the new observations are now valid.
Maybe you - for some reason - do not care about the tests results for a specific column. You can ignore results from tests of a specific variable with the ignore_cols
argument. Let us go ahead and suppress the test results from tests of the Petal.Length
variable.
get_clean_rows(playback,
ignore_tests = "new_variable",
ignore_cols = "Petal.Length")
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
#> [12] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
#> [23] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
#> [34] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
#> [45] TRUE TRUE TRUE TRUE TRUE TRUE
Now, with this modification a total of 42 of the new observations are now valid.
It is also possible to ignore the test results of specific tests of specific columns with the ignore_combinations
argument. Let us try to ignore the outside_range
test, but only for the Sepal.Width
variable.
knitr::kable(head(get_failed_tests(playback,
ignore_tests = "new_variable",
ignore_cols = "Petal.Length",
ignore_combinations = list(outside_range = "Sepal.Width")),
15))
outside_range.Petal.Width |
---|
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
TRUE |
FALSE |
TRUE |
FALSE |
FALSE |
FALSE |
FALSE |
FALSE |
As you see - with this additional removal - the only test failures that remain are the ones from the outside_range
test of the Petal.Width
variable.
That is it, I hope, that you will enjoy the recorder
package :)