collapse’s Handling of R Objects

A Quick View Behind the Scenes of Class-Agnostic R Programming

Sebastian Krantz

2024-11-02

This much-requested vignette provides some details about how collapse deals with various R objects. It is principally a digest of cumulative details provided in the NEWS for various releases since v1.4.0.

Overview

collapse is a class-agnostic programming framework that can deal with a broad range of R objects. It provides explicit support for base R classes and data types (integer, double, character, logical, list, data.frame, matrix, factor, Date, POSIXct, ts), as well as data.table, tibble, grouped_df, xts, zoo, pseries, pdata.frame, units, and sf (no geometric operations).

It also introduces GRP_df as a more performant and class-agnostic grouped data frame, and indexed_series and indexed_frame classes as modern class-agnostic successors of pseries, pdata.frame. These objects inherit the classes they succeed and are handled through .pseries, .pdata.frame, and .grouped_df methods, which also support the original (plm / dplyr) implementations (more details at the end).

All other objects are handled internally at the C or R level using general principles extended by specific considerations for some of the above classes. I start with summarizing the general principles, which enable the usage of collapse with further classes it was not designed for.

General Principles

In general, in collapse, attributes and classes of R objects are preserved in statistical and data manipulation operations unless their preservation involves a high-risk of yielding something wrong/useless. Risky operations are those that change the dimensions or internal data type (typeof()) of an R object.

To collapse’s R and C code, there exist 3 principal types of objects: atomic vectors, matrices, and lists - which are often assumed to be data frames. Most data manipulation functions in collapse like fmutate() only support lists, whereas statistical functions - notably the S3 generic Fast Statistical Functions like fmean() - generally support all 3 types of objects.

S3 generic functions initially dispatch to .default, .matrix, .data.frame, and (hidden) .list methods. The .list method generally dispatches to the .data.frame method. These basic methods, and other non-generic functions in collapse, then decide how exactly to handle the object based on the statistical operation performed and attribute handling principles mostly implemented in C.

The simplest case arises when an operation preserves the dimensions of the object, such as fscale(x) or fmutate(data, across(a:c, log)). In this case, all attributes of x / data are preserved1.

Another simple case for matrices and lists arises when a statistical operation reduces them to a single dimension such as fmean(x), where, under the drop = TRUE default of Fast Statistical Functions, all attributes apart from (column-)names are dropped and a (named) vector of means is returned.

For atomic vectors, a statistical operation like fmean(x) will preserve the attributes (except for ts objects), as the object could have useful properties such as labels or units.

More complex cases involve changing the dimensions (number of rows or columns) of an object. If the number of rows is preserved e.g. fmutate(data, a_b = a / b) or flag(x, -1:1), only the (column-)names attribute of the object is modified. If the number of rows is reduced e.g. fmean(x, g), all attributes are also retained under suitable modifications of the (row-)names attribute. However, if x is a matrix, other attributes than row- or column-names are only retained if !is.object(x), that is, if the matrix does not have a ‘class’ attribute. For atomic vectors, attributes are retained if !inherits(x, "ts"), as aggregating a time series will break the class. This also applies to columns in a data frame being aggregated.

When data is transformed using statistics as provided by the TRA() function e.g. TRA(x, STATS, operation, groups) and the like-named argument to the Fast Statistical Functions, operations that simply modify the input (x) in a statistical sense ("replace_NA", "-", "-+", "/", "+", "*", "%%", "-%%") just copy the attributes to the transformed object. Operations "replace_fill" and "replace" are more tricky, since here x is replaced with STATS, which could be of a different class or data type. The following rules apply: (1) the result has the same data type as STATS; (2) if is.object(STATS), the attributes of STATS are preserved; (3) otherwise the attributes of x are preserved unless is.object(x) && typeof(x) != typeof(STATS); (4) an exemption to this rule is made if x is a factor and an integer replacement is offered to STATS e.g. fnobs(factor, group, TRA = "replace_fill"). In that case, the attributes of x are copied except for the ‘class’ and ‘levels’ attributes. These rules were devised considering the possibility that x may have important information attached to it which should be preserved in data transformations, such as a "label" attribute.

Another rather complex case arises when manipulating data with collapse using base R functions, e.g. BY(mtcars$mpg, mtcars$cyl, mad) or mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mad_mpg = mad(mpg)). In this case, collapse internally uses base R functions lapply and unlist(), following efficient splitting with gsplit() (which preserves all attributes). Concretely, the result is computed as y = unlist(lapply(gsplit(x, g), FUN, ...), FALSE, FALSE), where in the examples x is mtcars$mpg, g is the grouping variable(s), FUN = mad, and y is mad(x) in each group. To follow its policy of attribute preservation as closely as possible, collapse then calls an internal function y_final = copyMostAttributes(y, x), which copies the attributes of x to y if both are deemed compatible2 (\(\approx\) of the same data type). If they are deemed incompatible, copyMostAttributes still checks if x has a "label" attribute and copies that one to y.

So to summarize the general principles: collapse just tries to preserve attributes in all cases except for where it is likely to break something, beholding the way most commonly used R classes and objects behave. The most likely operations that break something are when aggregating matrices which have a class (such as mts/xts) or univariate time series (ts), when data is to be replaced by another object, or when applying an unknown function to a vector by groups and assembling the result with unlist(). In the latter cases, particular attention is paid to integer vectors and factors, as we often count something generating integers, and malformed factors need to be avoided.

The following section provides some further details for some collapse functions and supported classes.

Specific Functions and Classes

Object Conversions

Quick conversion functions qDF, qDT, qTBL() and qM (to create data.frame’s, data.table’s, tibble’s and matrices from arbitrary R objects) by default (keep.attr = FALSE) perform very strict conversions, where all attributes non-essential to the class are dropped from the input object. This is to ensure that, following conversion, objects behave exactly the way users expect. This is different from the behavior of functions like as.data.frame(), as.data.table(), as_tibble() or as.matrix() e.g. as.matrix(EuStockMarkets) just returns EuStockMarkets whereas qM(EuStockMarkets) returns a plain matrix without time series attributes. This behavior can be changed by setting keep.attr = TRUE, i.e. qM(EuStockMarkets, keep.attr = TRUE).

Selecting Columns by Data Type

Functions num_vars(), cat_vars() (the opposite of num_vars()), char_vars() etc. are implemented in C to quickly select columns by data type without the need to check data frame columns by applying a function such as is.numeric() in R. For is.numeric, the C implementation is equivalent to is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, "ts") || inherits(x, "units") || inherits(x, "integer64")). This of course does not respect the behavior of other classes that define methods for is.numeric e.g. is.numeric.foo <- function(x) TRUE, then for y = structure(rnorm(100), class = "foo"), is.numeric(y) is TRUE but num_vars(data.frame(y)) returns an empty frame. Correct behavior in this case requires get_vars(data.frame(y), is.numeric). A particular case to be aware of here is when using collap() with the FUN and catFUN arguments, where the C code (is_numeric_C) is used internally to decide whether a column is numeric or categorical. Thus numeric columns with a class attribute other than “ts”, “units”, and “integer64” are not recognized as numeric by collap(). collapse also does not support statistical operations on complex data.

Parsing of Time-IDs

Time Series Functions flag, fdiff, fgrowth and psacf/pspacf/psccf (and the operators L/F/D/Dlog/G) have a t argument to pass time-ids for fully identified temporal operations on time series and panel data. If t is a plain numeric vector or a factor, it is coerced to integer using as.integer(), and the integer steps are used as time steps. This is premised on the observation that the most common form of temporal identifier is a numeric variable denoting calendar years. If on the other hand t is a numeric time object such that is.object(t) && is.numeric(unclass(t)) (e.g. Date, POSIXct, etc.), then it is passed through timeid() which computes the greatest common divisor of the vector and generates an integer time-id in that way. Users are therefore advised to use appropriate classes to represent time steps e.g. for monthly data zoo::yearmon would be appropriate. It is also possible to pass non-numeric t, such as character or list/data.frame. In such cases ordered grouping is applied to generate an integer time-id, but this should rather be avoided.

xts/zoo Time Series

xts/zoo time series are handled through .zoo methods to all relevant functions. These methods are simple and all follow this pattern: FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....). Thus the general principles apply. Time-Series function do not automatically use the index for indexed computations, partly for consistency with native methods where this is also not the case (e.g. lag.xts does not perform an indexed lag), and partly because, as outlined above, the index does not necessarily accurately reflect the time structure. Thus the user must exercise discretion to perform an indexed lag on xts/zoo. For example: flag(xts_daily, 1:3, t = index(xts_daily)) or flag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly))).

Support for sf and units

collapse internally supports sf by seeking to avoid undue destruction of sf objects through removal of the ‘geometry’ column in data manipulation operations. This is simply implemented through an additional check in the C programs used to subset columns of data: if the object is an sf data frame, the ‘geometry’ column is added to the column selection. Other functions like funique() or roworder() have internal facilities to avoid sorting or grouping on the ‘geometry’ column. Again other functions like descr() and qsu() simply omit the geometry column in their statistical calculations. A short vignette describes the integration of collapse and sf in a bit more detail. In summary: collapse supports sf by seeking to appropriately deal with the ‘geometry’ column. It cannot perform any geometrical operations, for example after subsetting sf with fsubset(), the bounding box attribute of the geometry is unaltered and likely too large.

Regarding units objects, all relevant functions also have simple methods of the form FUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....). According to the general principles, the default method preserves the units class, whereas the matrix method does not if FUN aggregates the data. The use of copyMostAttrib(), which copies all attributes apart from "dim", "dimnames", and "names", ensures that the returned objects are still units objects.

Support for data.table

collapse provides quite thorough support for data.table. The simplest level of support is that it avoids assigning descriptive (character) row names to data.table’s e.g. fmean(mtcars, mtcars$cyl) has row-names corresponding to the groups but fmean(qDT(mtcars), mtcars$cyl) does not.

collapse further supports data.table’s reference semantics (set*, :=). To be able to add columns by reference (e.g. DT[, new := 1]), data.table’s are implemented as overallocated lists3. collapse copied some C code from data.table to do the overallocation and generate the ".internal.selfref" attribute, so that qDT() creates a valid and fully functional data.table. To enable seamless data manipulation combining collapse and data.table, all data manipulation functions in collapse call this C code at the end and return a valid (overallocated) data.table. However, because this overallocation comes at a computational cost of 2-3 microseconds, I have opted against also adding it to the .data.frame methods of statistical functions. Concretely, this means that res <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a)) gives a fully functional data.table i.e. res[, new := 1] works, but res2 <- DT |> fgroup_by(id) |> fmean() gives a non-overallocated data.table such that res2[, new := 1] will still work but issue a warning. In this case, res2 <- DT |> fgroup_by(id) |> fmean() |> qDT() can be used to avoid the warning. This, to me, seems a reasonable trade-off between flexibility and performance. More details and examples are provided in the collapse and data.table vignette.

Class-Agnostic Grouped and Indexed Data Frames

As indicated in the introductory remarks, collapse provides a fast class-agnostic grouped data frame created with fgroup_by(), and fast class-agnostic indexed time series and panel data, created with findex_by()/reindex(). Class-agnostic means that the object that is grouped/indexed continues to behave as before except in collapse operations utilizing the ‘groups’/‘index_df’ attributes.

The grouped data frame is implemented as follows: fgroup_by() saves the class of the input data, calls GRP() on the columns being grouped, and attaches the resulting ‘GRP’ object in a "groups" attribute. It then assigns a class attribute as follows

clx <- class(.X) # .X is the data frame being grouped, clx is its class
m <- match(c("GRP_df", "grouped_df", "data.frame"), clx, nomatch = 0L)
class(.X) <- c("GRP_df",  if(length(mp <- m[m != 0L])) clx[-mp] else clx, "grouped_df", if(m[3L]) "data.frame") 

In words: a class "GRP_df" is added in front, followed by the classes of the original object4, followed by "grouped_df" and finally "data.frame", if present. The first class "GRP_df" is for dealing appropriately with the object through methods for print() and subsetting ([, [[), e.g. print.GRP_df fetches the grouping object, prints fungroup(.X)5, and then prints a summary of the grouping. [.GRP_df works similarly: it saves the groups, calls [ on fungroup(.X), and attaches the groups again if the result is a list with the same number of rows. So collapse has no issues printing and handling grouped data.table’s, tibbles, sf data frames, etc.: they continue to behave as usual. Now collapse has various functions with a .grouped_df method to deal with grouped data frames. For example fmean.grouped_df, in a nutshell, fetches the attached ‘GRP’ object using GRP.grouped_df, and calls fmean.data.frame on fungroup(data), passing the ‘GRP’ object to the g argument for grouped computation. Here the general principles outlined above apply so that the resulting object has the same classes and attributes as the original one.

This architecture has an additional advantage: it allows GRP.grouped_df to examine the grouping object and check if it was created by collapse (class ‘GRP’) or by dplyr. If the latter is the case, an efficient C routine is called to convert the dplyr grouping object to a ‘GRP’ object so that all .grouped_df methods in collapse apply to data frames created with either dplyr::group_by() or fgroup_by().

The indexed_frame works more or less in the same way. It inherits from pdata.frame so that .pdata.frame methods in collapse deal with both indexed_frame’s of arbitrary classes and pdata.frame’s created with plm.

A notable difference to both grouped_df and pdata.frame is that indexed_frame is a deeply indexed data structure: each variable inside an indexed_frame is an indexed_series which contains in its index_df attribute an external pointer to the index_df attribute of the frame. Functions with pseries methods operating on indexed_series stored inside the frame (such as with(data, flag(column))) can fetch the index from this pointer. This allows worry-free application inside arbitrary data masking environments (with, %$%, attach, etc..) and estimation commands (glm, feols, lmrob etc..) without duplication of the index in memory. As you may have guessed, indexed_series are also class-agnostic and inherit from pseries. Any vector or matrix of any class can become an indexed_series.

Further levels of generality are that indexed series and frames allow one, two or more variables in the index to support both time series and complex panels, natively deal with irregularity in time6, and provide a rich set of methods for subsetting and manipulation which also subset the index_df attribute, including internal methods for fsubset(), funique(), roworder(v) and na_omit(). So indexed_frame and indexed_series is a rich and general structure permitting fully time-aware computations on nearly any R object. See ?indexing for more information.

Conclusion

collapse handles R objects in a preserving and fairly intelligent manner, allowing seamless compatibility with many common data classes in R, and statistical workflows that preserve attributes (labels, units, etc.) of the data. This is done by general principles and some specific considerations/exemptions mostly implemented in C - as detailed in this vignette.

The main benefits of this design are generality and execution speed: collapse has much fewer R-level method dispatches and function calls than other frameworks used to perform statistical or data manipulation operations, it behaves predictably, and may also work well with your simple new class.

The main disadvantage is that many of the general principles and exemptions are hard-coded in C and thus may not work with specific classes. A prominent example where collapse simply fails is lubridate’s interval class (#186, #418), which has a "starts" attribute of the same length as the data that is preserved but not subset in collapse operations.


  1. Preservation implies a shallow copy of the attribute lists from the original object to the result object. A shallow copy is memory-efficient and means we are copying the list containing the attributes in memory, but not the attributes themselves. Whenever I talk about copying attributes, I mean a shallow copy, not a deep copy. You can perform shallow copies with helper functions copyAttrib() or copyMostAttrib(), and directly set attribute lists using setAttrib() or setattrib().↩︎

  2. Concretely, attributes are copied if (typeof(x) == typeof(y) && (identical(class(x), class(y)) || typeof(y) != "integer" || inherits(x, c("IDate", "ITime"))) && !(length(x) != length(y) && inherits(x, "ts"))). The first part of the condition is easy: if x and y are of different data types we do not copy attributes. The second condition states that to copy attributes we also need to ensure that x and y are either or the same class or y is not integer or x is not an integer-based date or time (= classes provided by data.table). The main reason for this clause is to guard against cases where we are counting something on an integer-based variable such as a factor e.g. BY(factor, group, function(x) length(unique(x))). The case where the result is also a factor e.g. BY(factor, group, function(x) x[1]) is dealt with because unlist() preserves factors, so identical(class(x), class(y)) is TRUE. The last part of the expression again guards against reducing the length of univariate time series and then copying the attributes.↩︎

  3. Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the data.table, and an ".internal.selfref" attribute containing an external pointer is used to check if any shallow copy was made using base R commands like <-.↩︎

  4. Removing c("GRP_df", "grouped_df", "data.frame") if present to avoid duplicate classes and allowing grouped data to be re-grouped.↩︎

  5. Which reverses the changes of fgroup_by() so that the print method for the original object .X is called.↩︎

  6. This is done through the creation of a time-factor in the index_df attribute whose levels represent time steps i.e. the factor will have unused levels for gaps in time.↩︎