This much-requested vignette provides some details about how collapse deals with various R objects. It is principally a digest of cumulative details provided in the NEWS for various releases since v1.4.0.
collapse is a class-agnostic programming framework that can deal with a broad range of R objects. It provides explicit support for base R classes and data types (integer, double, character, logical, list, data.frame, matrix, factor, Date, POSIXct, ts), as well as data.table, tibble, grouped_df, xts, zoo, pseries, pdata.frame, units, and sf (no geometric operations).
It also introduces GRP_df
as a more performant and class-agnostic grouped data frame, and indexed_series
and indexed_frame classes as modern class-agnostic
successors of pseries, pdata.frame. These objects
inherit the classes they succeed and are handled through
.pseries
, .pdata.frame
, and
.grouped_df
methods, which also support the original
(plm / dplyr) implementations (more details at the
end).
All other objects are handled internally at the C or R level using general principles extended by specific considerations for some of the above classes. I start with summarizing the general principles, which enable the usage of collapse with further classes it was not designed for.
In general, in collapse, attributes and classes of R objects
are preserved in statistical and data manipulation operations unless
their preservation involves a high-risk of yielding
something wrong/useless. Risky operations are those that change the
dimensions or internal data type (typeof()
) of an R
object.
To collapse’s R and C code, there exist 3 principal types of
objects: atomic vectors, matrices, and lists - which are often assumed
to be data frames. Most data manipulation functions in collapse
like fmutate()
only support lists, whereas statistical
functions - notably the S3 generic Fast
Statistical Functions like fmean()
- generally
support all 3 types of objects.
S3 generic functions initially dispatch to .default
,
.matrix
, .data.frame
, and (hidden)
.list
methods.
The .list
method generally dispatches to the
.data.frame
method. These basic methods, and other
non-generic functions in collapse, then decide how exactly to
handle the object based on the statistical operation performed and
attribute handling principles mostly implemented in C.
The simplest case arises when an operation preserves the dimensions
of the object, such as fscale(x)
or
fmutate(data, across(a:c, log))
. In this case, all
attributes of x / data
are preserved1.
Another simple case for matrices and lists arises when a statistical
operation reduces them to a single dimension such as
fmean(x)
, where, under the drop = TRUE
default
of Fast
Statistical Functions, all attributes apart from (column-)names
are dropped and a (named) vector of means is returned.
For atomic vectors, a statistical operation like
fmean(x)
will preserve the attributes (except for
ts objects), as the object could have useful properties such as
labels or units.
More complex cases involve changing the dimensions (number of rows or
columns) of an object. If the number of rows is preserved
e.g. fmutate(data, a_b = a / b)
or
flag(x, -1:1)
, only the (column-)names attribute of the
object is modified. If the number of rows is reduced
e.g. fmean(x, g)
, all attributes are also retained under
suitable modifications of the (row-)names attribute. However, if
x
is a matrix, other attributes than row- or column-names
are only retained if !is.object(x)
, that is, if the matrix
does not have a ‘class’ attribute. For atomic vectors, attributes are
retained if !inherits(x, "ts")
, as aggregating a time
series will break the class. This also applies to columns in a data
frame being aggregated.
When data is transformed using statistics as provided by the TRA()
function e.g. TRA(x, STATS, operation, groups)
and the
like-named argument to the Fast
Statistical Functions, operations that simply modify the input
(x
) in a statistical sense ("replace_NA"
,
"-"
, "-+"
, "/"
, "+"
,
"*"
, "%%"
, "-%%"
) just copy the
attributes to the transformed object. Operations
"replace_fill"
and "replace"
are more tricky,
since here x
is replaced with STATS
, which
could be of a different class or data type. The following rules apply:
(1) the result has the same data type as STATS
; (2) if
is.object(STATS)
, the attributes of STATS
are
preserved; (3) otherwise the attributes of x
are preserved
unless is.object(x) && typeof(x) != typeof(STATS)
;
(4) an exemption to this rule is made if x
is a factor and
an integer replacement is offered to STATS
e.g. fnobs(factor, group, TRA = "replace_fill")
. In that
case, the attributes of x
are copied except for the ‘class’
and ‘levels’ attributes. These rules were devised considering the
possibility that x
may have important information attached
to it which should be preserved in data transformations, such as a
"label"
attribute.
Another rather complex case arises when manipulating data with
collapse using base R functions,
e.g. BY(mtcars$mpg, mtcars$cyl, mad)
or
mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mad_mpg = mad(mpg))
.
In this case, collapse internally uses base R functions
lapply
and unlist()
, following efficient
splitting with gsplit()
(which preserves all attributes).
Concretely, the result is computed as
y = unlist(lapply(gsplit(x, g), FUN, ...), FALSE, FALSE)
,
where in the examples x
is mtcars$mpg
,
g
is the grouping variable(s), FUN = mad
, and
y
is mad(x)
in each group. To follow its
policy of attribute preservation as closely as possible,
collapse then calls an internal function
y_final = copyMostAttributes(y, x)
, which copies the
attributes of x
to y
if both are deemed
compatible2 (\(\approx\) of the same data type). If they
are deemed incompatible, copyMostAttributes
still checks if
x
has a "label"
attribute and copies that one
to y
.
So to summarize the general principles: collapse just tries
to preserve attributes in all cases except for where it is likely to
break something, beholding the way most commonly used R classes and
objects behave. The most likely operations that break something are when
aggregating matrices which have a class (such as
mts/xts) or univariate time series (ts), when
data is to be replaced by another object, or when applying an unknown
function to a vector by groups and assembling the result with
unlist()
. In the latter cases, particular attention is paid
to integer vectors and factors, as we often count something generating
integers, and malformed factors need to be avoided.
The following section provides some further details for some collapse functions and supported classes.
Quick
conversion functions qDF
, qDT
,
qTBL()
and qM
(to create data.frame’s,
data.table’s, tibble’s and matrices from arbitrary R
objects) by default (keep.attr = FALSE
) perform very strict
conversions, where all attributes non-essential to the class are dropped
from the input object. This is to ensure that, following conversion,
objects behave exactly the way users expect. This is different from the
behavior of functions like as.data.frame()
,
as.data.table()
, as_tibble()
or
as.matrix()
e.g. as.matrix(EuStockMarkets)
just returns EuStockMarkets
whereas
qM(EuStockMarkets)
returns a plain matrix without time
series attributes. This behavior can be changed by setting
keep.attr = TRUE
,
i.e. qM(EuStockMarkets, keep.attr = TRUE)
.
Functions num_vars()
,
cat_vars()
(the opposite of num_vars()
),
char_vars()
etc. are implemented in C to quickly select
columns by data type without the need to check data frame columns by
applying a function such as is.numeric()
in R. For
is.numeric
, the C implementation is equivalent to
is_numeric_C <- function(x) typeof(x) %in% c("integer", "double") && (!is.object(x) || inherits(x, "ts") || inherits(x, "units") || inherits(x, "integer64"))
.
This of course does not respect the behavior of other classes that
define methods for is.numeric
e.g. is.numeric.foo <- function(x) TRUE
, then for
y = structure(rnorm(100), class = "foo")
,
is.numeric(y)
is TRUE
but
num_vars(data.frame(y))
returns an empty frame. Correct
behavior in this case requires
get_vars(data.frame(y), is.numeric)
. A particular case to
be aware of here is when using collap()
with the
FUN
and catFUN
arguments, where the C code
(is_numeric_C
) is used internally to decide whether a
column is numeric or categorical. Thus numeric columns with a class
attribute other than “ts”, “units”, and “integer64” are not recognized
as numeric by collap()
. collapse also does not
support statistical operations on complex data.
Time
Series Functions flag
, fdiff
,
fgrowth
and psacf/pspacf/psccf
(and the
operators L/F/D/Dlog/G
) have a t
argument to
pass time-ids for fully identified temporal operations on time series
and panel data. If t
is a plain numeric vector or a factor,
it is coerced to integer using as.integer()
, and the
integer steps are used as time steps. This is premised on the
observation that the most common form of temporal identifier is a
numeric variable denoting calendar years. If on the other hand
t
is a numeric time object such that
is.object(t) && is.numeric(unclass(t))
(e.g. Date,
POSIXct, etc.), then it is passed through timeid()
which
computes the greatest common divisor of the vector and generates an
integer time-id in that way. Users are therefore advised to use
appropriate classes to represent time steps e.g. for monthly data
zoo::yearmon
would be appropriate. It is also possible to
pass non-numeric t
, such as character or list/data.frame.
In such cases ordered grouping is applied to generate an integer
time-id, but this should rather be avoided.
xts/zoo time series are handled through
.zoo
methods to all relevant functions. These methods are
simple and all follow this pattern:
FUN.zoo <- function(x, ...) if(is.matrix(x)) FUN.matrix(x, ...) else FUN.default(x, ....)
.
Thus the general principles apply. Time-Series function do not
automatically use the index for indexed computations, partly for
consistency with native methods where this is also not the case
(e.g. lag.xts
does not perform an indexed lag), and partly
because, as outlined above, the index does not necessarily accurately
reflect the time structure. Thus the user must exercise discretion to
perform an indexed lag on xts/zoo. For example:
flag(xts_daily, 1:3, t = index(xts_daily))
or
flag(xts_monthly, 1:3, t = zoo::as.yearmon(index(xts_monthly)))
.
collapse internally supports sf by seeking to avoid
undue destruction of sf objects through removal of the
‘geometry’ column in data manipulation operations. This is simply
implemented through an additional check in the C programs used to subset
columns of data: if the object is an sf data frame, the
‘geometry’ column is added to the column selection. Other functions like
funique()
or roworder()
have internal
facilities to avoid sorting or grouping on the ‘geometry’ column. Again
other functions like descr()
and qsu()
simply
omit the geometry column in their statistical calculations. A short vignette
describes the integration of collapse and sf in a bit
more detail. In summary: collapse supports sf by
seeking to appropriately deal with the ‘geometry’ column. It cannot
perform any geometrical operations, for example after subsetting
sf with fsubset()
, the bounding box attribute of
the geometry is unaltered and likely too large.
Regarding units objects, all relevant functions also have
simple methods of the form
FUN.units <- function(x, ...) copyMostAttrib(if(is.matrix(x)) FUN.matrix(x, ...), x) else FUN.default(x, ....)
.
According to the general principles, the default method preserves the
units class, whereas the matrix method does not if FUN
aggregates the data. The use of copyMostAttrib()
, which
copies all attributes apart from "dim"
,
"dimnames"
, and "names"
, ensures that the
returned objects are still units objects.
collapse provides quite thorough support for
data.table. The simplest level of support is that it avoids
assigning descriptive (character) row names to data.table’s
e.g. fmean(mtcars, mtcars$cyl)
has row-names corresponding
to the groups but fmean(qDT(mtcars), mtcars$cyl)
does
not.
collapse further supports data.table’s reference
semantics (set*
, :=
). To be able to add
columns by reference (e.g. DT[, new := 1]
),
data.table’s are implemented as overallocated lists3.
collapse copied some C code from data.table to do the
overallocation and generate the ".internal.selfref"
attribute, so that qDT()
creates a valid and fully
functional data.table. To enable seamless data manipulation
combining collapse and data.table, all data
manipulation functions in collapse call this C code at the end
and return a valid (overallocated) data.table. However, because
this overallocation comes at a computational cost of 2-3 microseconds, I
have opted against also adding it to the .data.frame
methods of statistical functions. Concretely, this means that
res <- DT |> fgroup_by(id) |> fsummarise(mu_a = fmean(a))
gives a fully functional data.table
i.e. res[, new := 1]
works, but
res2 <- DT |> fgroup_by(id) |> fmean()
gives a
non-overallocated data.table such that
res2[, new := 1]
will still work but issue a warning. In
this case,
res2 <- DT |> fgroup_by(id) |> fmean() |> qDT()
can be used to avoid the warning. This, to me, seems a reasonable
trade-off between flexibility and performance. More details and examples
are provided in the collapse
and data.table vignette.
As indicated in the introductory remarks, collapse provides
a fast class-agnostic
grouped data frame created with fgroup_by()
, and fast
class-agnostic
indexed time series and panel data, created with
findex_by()
/reindex()
. Class-agnostic means
that the object that is grouped/indexed continues to behave as before
except in collapse operations utilizing the ‘groups’/‘index_df’
attributes.
The grouped data frame is implemented as follows:
fgroup_by()
saves the class of the input data, calls
GRP()
on the columns being grouped, and attaches the
resulting ‘GRP’ object in a "groups"
attribute. It then
assigns a class attribute as follows
clx <- class(.X) # .X is the data frame being grouped, clx is its class
m <- match(c("GRP_df", "grouped_df", "data.frame"), clx, nomatch = 0L)
class(.X) <- c("GRP_df", if(length(mp <- m[m != 0L])) clx[-mp] else clx, "grouped_df", if(m[3L]) "data.frame")
In words: a class "GRP_df"
is added in front, followed
by the classes of the original object4, followed by "grouped_df"
and
finally "data.frame"
, if present. The first class
"GRP_df"
is for dealing appropriately with the object
through methods for print()
and subsetting ([
,
[[
), e.g. print.GRP_df
fetches the grouping
object, prints fungroup(.X)
5, and then prints a
summary of the grouping. [.GRP_df
works similarly: it saves
the groups, calls [
on fungroup(.X)
, and
attaches the groups again if the result is a list with the same number
of rows. So collapse has no issues printing and handling
grouped data.table’s, tibbles, sf data
frames, etc.: they continue to behave as usual. Now collapse
has various functions with a .grouped_df
method to deal
with grouped data frames. For example fmean.grouped_df
, in
a nutshell, fetches the attached ‘GRP’ object using
GRP.grouped_df
, and calls fmean.data.frame
on
fungroup(data)
, passing the ‘GRP’ object to the
g
argument for grouped computation. Here the general
principles outlined above apply so that the resulting object has the
same classes and attributes as the original one.
This architecture has an additional advantage: it allows
GRP.grouped_df
to examine the grouping object and check if
it was created by collapse (class ‘GRP’) or by dplyr.
If the latter is the case, an efficient C routine is called to convert
the dplyr grouping object to a ‘GRP’ object so that all
.grouped_df
methods in collapse apply to data
frames created with either dplyr::group_by()
or
fgroup_by()
.
The indexed_frame works more or less in the same way. It
inherits from pdata.frame so that .pdata.frame
methods in collapse deal with both indexed_frame’s of
arbitrary classes and pdata.frame’s created with
plm.
A notable difference to both grouped_df and
pdata.frame is that indexed_frame is a deeply indexed
data structure: each variable inside an indexed_frame is an
indexed_series which contains in its index_df
attribute an external pointer to the index_df attribute of the
frame. Functions with pseries methods operating on
indexed_series stored inside the frame (such as
with(data, flag(column))
) can fetch the index from this
pointer. This allows worry-free application inside arbitrary data
masking environments (with
, %$%
,
attach
, etc..) and estimation commands (glm
,
feols
, lmrob
etc..) without duplication of the
index in memory. As you may have guessed, indexed_series are
also class-agnostic and inherit from pseries. Any vector or
matrix of any class can become an indexed_series.
Further levels of generality are that indexed series and frames allow
one, two or more variables in the index to support both time series and
complex panels, natively deal with irregularity in time6, and provide a rich
set of methods for subsetting and manipulation which also subset the
index_df attribute, including internal methods for
fsubset()
, funique()
, roworder(v)
and na_omit()
. So indexed_frame and
indexed_series is a rich and general structure permitting fully
time-aware computations on nearly any R object. See ?indexing
for more information.
collapse handles R objects in a preserving and fairly intelligent manner, allowing seamless compatibility with many common data classes in R, and statistical workflows that preserve attributes (labels, units, etc.) of the data. This is done by general principles and some specific considerations/exemptions mostly implemented in C - as detailed in this vignette.
The main benefits of this design are generality and execution speed: collapse has much fewer R-level method dispatches and function calls than other frameworks used to perform statistical or data manipulation operations, it behaves predictably, and may also work well with your simple new class.
The main disadvantage is that many of the general principles and
exemptions are hard-coded in C and thus may not work with specific
classes. A prominent example where collapse simply fails is
lubridate’s interval class (#186, #418), which
has a "starts"
attribute of the same length as the data
that is preserved but not subset in collapse operations.
Preservation implies a shallow copy of the attribute
lists from the original object to the result object. A shallow copy is
memory-efficient and means we are copying the list containing the
attributes in memory, but not the attributes themselves. Whenever I talk
about copying attributes, I mean a shallow copy, not a deep copy. You
can perform shallow copies with helper
functions copyAttrib()
or
copyMostAttrib()
, and directly set attribute lists using
setAttrib()
or setattrib()
.↩︎
Concretely, attributes are copied
if (typeof(x) == typeof(y) && (identical(class(x), class(y)) || typeof(y) != "integer" || inherits(x, c("IDate", "ITime"))) && !(length(x) != length(y) && inherits(x, "ts")))
.
The first part of the condition is easy: if x
and
y
are of different data types we do not copy attributes.
The second condition states that to copy attributes we also need to
ensure that x
and y
are either or the same
class or y
is not integer or x
is not an
integer-based date or time (= classes provided by data.table).
The main reason for this clause is to guard against cases where we are
counting something on an integer-based variable such as a factor
e.g. BY(factor, group, function(x) length(unique(x)))
. The
case where the result is also a factor
e.g. BY(factor, group, function(x) x[1])
is dealt with
because unlist()
preserves factors, so
identical(class(x), class(y))
is TRUE
. The
last part of the expression again guards against reducing the length of
univariate time series and then copying the attributes.↩︎
Notably, additional (hidden) column pointers are
allocated to be able to add columns without taking a shallow copy of the
data.table, and an ".internal.selfref"
attribute
containing an external pointer is used to check if any shallow copy was
made using base R commands like <-
.↩︎
Removing
c("GRP_df", "grouped_df", "data.frame")
if present to avoid
duplicate classes and allowing grouped data to be re-grouped.↩︎
Which reverses the changes of fgroup_by()
so that the print method for the original object .X
is
called.↩︎
This is done through the creation of a time-factor in the index_df attribute whose levels represent time steps i.e. the factor will have unused levels for gaps in time.↩︎