‘compositions’ v2.0: R classes for compositional analysis

counts?	total?	relative?	geometry	class
no	no	no	real compositional	rcomp
no	no	yes	relative compositional	acomp
no	yes	no	real amounts	rplus
no	yes	yes	relative amounts	aplus
yes	yes	yes	count multinomial	ccomp
no	–	no	real multivariate	rmult

Compositional classes, old and new

Once you have chosen your geometry (or a couple of them that you want to compare), you can load “compositions” with

library(compositions)
#> Welcome to compositions, a package for compositional data analysis.
#> Find an intro with "? compositions"
#> 
#> Attaching package: 'compositions'
#> The following objects are masked from 'package:stats':
#> 
#>     anova, cor, cov, dist, var
#> The following object is masked from 'package:graphics':
#> 
#>     segments
#> The following objects are masked from 'package:base':
#> 
#>     %*%, norm, scale, scale.default

and load your data with any data reading function (read.csv(), read_delim(), read_excel(), etc). Here we use for illustration purposes the data set “Hydrochem”

data("Hydrochem")
dim(Hydrochem)
#> [1] 485  19
colnames(Hydrochem)
#>  [1] "Code"     "Site"     "Location" "River"    "Date"     "H"       
#>  [7] "Na"       "K"        "Mg"       "Ca"       "Sr"       "Ba"      
#> [13] "NH4"      "Cl"       "NO3"      "PO4"      "SO4"      "HCO3"    
#> [19] "TOC"

having 5 descriptive variables and 14 chemical components (from H to TOC). These last form a composition, which we preliminarily consider of an aplus geometry

xp = aplus(Hydrochem[, c("Ca","K", "Na", "Mg")])
summary(xp)
#>             Ca        K       Na      Mg
#> Min.     40.39   0.0251    2.233   4.545
#> 1st Qu.  87.13   3.2230   30.360  20.430
#> Median  104.70  13.3700  135.100  34.260
#> Mean    119.30  10.1600   90.300  32.110
#> 3rd Qu. 170.10  32.6200  227.000  58.110
#> Max.    441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix"        "array"

plot(xp, col=Hydrochem$River, cex=0.5)

The object is of course an “aplus” object

is.aplus(xp)    # check with S3 paradigm
#> [1] TRUE
is(xp, "aplus") # check with S4 paradigm
#> [1] TRUE

but it is also an amounts object

is(xp, "amounts") # only S4 paradigm
#> [1] TRUE

The “amounts” class is an abstract superclass, that cannot be created directly. Also objects of class “rplus” and “ccomp” are amounts

is(rplus(xp), "amounts")
#> [1] TRUE
is(ccomp(xp), "amounts")
#> [1] TRUE

If we consider the total sum of each row to be irrelevant, the appropriate class is an “acomp”

xc = acomp(Hydrochem[, c("Ca","K", "Na", "Mg")])
summary(xc)
#> $mean
#>           Ca            K           Na           Mg 
#> "0.47366650" "0.04034558" "0.35849821" "0.12748971" 
#> attr(,"class")
#> [1] "acomp"
#> 
#> $mean.ratio
#>            Ca         K        Na        Mg
#> Ca 1.00000000 11.740233 1.3212521 3.7153311
#> K  0.08517718  1.000000 0.1125405 0.3164614
#> Na 0.75685786  8.885688 1.0000000 2.8119776
#> Mg 0.26915501  3.159943 0.3556216 1.0000000
#> 
#> $variation
#>           Ca        K        Na        Mg
#> Ca 0.0000000 1.697259 1.1117720 0.2195639
#> K  1.6972588 0.000000 0.4538740 1.1885007
#> Na 1.1117720 0.453874 0.0000000 0.7442923
#> Mg 0.2195639 1.188501 0.7442923 0.0000000
#> 
#> $expsd
#>          Ca        K       Na       Mg
#> Ca 1.000000 3.679544 2.870270 1.597718
#> K  3.679544 1.000000 1.961485 2.974821
#> Na 2.870270 1.961485 1.000000 2.369606
#> Mg 1.597718 2.974821 2.369606 1.000000
#> 
#> $invexpsd
#>           Ca         K        Na        Mg
#> Ca 1.0000000 0.2717728 0.3483993 0.6258926
#> K  0.2717728 1.0000000 0.5098179 0.3361547
#> Na 0.3483993 0.5098179 1.0000000 0.4220110
#> Mg 0.6258926 0.3361547 0.4220110 1.0000000
#> 
#> $min
#>              Ca         K          Na          Mg
#> Ca 1.0000000000 0.4980473 0.146331281 0.853172522
#> K  0.0004626728 1.0000000 0.008240315 0.005522552
#> Na 0.0381578947 1.7109562 1.000000000 0.259131129
#> Mg 0.0806127186 0.4801286 0.053971402 1.000000000
#> 
#> $q1
#>            Ca        K         Na        Mg
#> Ca 1.00000000 3.535891 0.59303483 2.8447810
#> K  0.02604998 1.000000 0.06717914 0.1342102
#> Na 0.30390625 5.067818 1.00000000 1.5538624
#> Mg 0.22329706 1.109722 0.18490000 1.0000000
#> 
#> $med
#>            Ca         K        Na       Mg
#> Ca 1.00000000 12.281003 0.9759519 3.499224
#> K  0.08142658  1.000000 0.1214027 0.272508
#> Na 1.02464066  8.237046 1.0000000 3.581553
#> Mg 0.28577763  3.669617 0.2792085 1.000000
#> 
#> $q3
#>           Ca         K        Na        Mg
#> Ca 1.0000000 38.387745 3.2904884 4.4783394
#> K  0.2828142  1.000000 0.1973236 0.9011264
#> Na 1.6862416 14.885572 1.0000000 5.4083288
#> Mg 0.3515209  7.451001 0.6435576 1.0000000
#> 
#> $max
#>          Ca         K         Na        Mg
#> Ca 1.000000 2161.3546 26.2068966 12.404990
#> K  2.007841    1.0000  0.5844685  2.082775
#> Na 6.833809  121.3546  1.0000000 18.528331
#> Mg 1.172096  181.0757  3.8590501  1.000000
#> 
#> $missingness
#>         missingType
#> variable NMV BDL MAR MNAR  SZ Err
#>       Ca 485   0   0    0   0   0
#>       K  485   0   0    0   0   0
#>       Na 485   0   0    0   0   0
#>       Mg 485   0   0    0   0   0
#> 
#> attr(,"class")
#> [1] "summary.acomp"

plot(xc, col=Hydrochem$River, cex=0.5)

This can be thus checked

is.acomp(xc)    # check with S3 paradigm
#> [1] TRUE
is(xc, "acomp") # check with S4 paradigm
#> [1] TRUE

There is also an abstract superclass for compositional objects, i.e. those where the total is irrelevant or an artifact

is(xc, "compositional") 
#> [1] TRUE

that contains objects of class “acomp”, “rcomp” and “ccomp”

is(rcomp(xc), "compositional")
#> [1] TRUE
is(ccomp(xc), "compositional")
#> [1] TRUE

Additionally, all comopostional classes have natural ways to be automatically converted to “data.frame” and to “structure” classes. This works like this

as(xc, "data.frame")
as(xc, "structure")

and it allows S4 classes with slots expecting a “data.frame” or one of “vector”, “matrix” or “array” to accept a compositional object. This will be discussed in the last section of this vignette.

Subsetting and transformations

Compositional data sets react now to the dollar notation. Since compositions v2 you can extract one particular variable with

summary(xp$Ca)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   40.39   87.13  104.70  132.12  170.10  441.50

As discussed in UsingCompositions.pdf, the key idea of compositional analysis is to transform the data prior to the analysis. All transformations are now accessible using the dollar notation

summary(xp$clr)
#>                 Ca          K         Na          Mg
#> Min.    -0.5765232 -4.4190306 -0.7626504 -1.14624527
#> 1st Qu.  0.4801820 -1.9501308  0.3878456 -0.69086350
#> Median   0.8936557 -1.5989396  0.8498314 -0.36473684
#> Mean     1.0135173 -1.4495044  0.7349375 -0.29895048
#> 3rd Qu.  1.5489812 -0.7802982  1.0572174  0.03228459
#> Max.     3.2594599  0.1205369  1.8932905  1.36923306
#> attr(,"class")
#> [1] "summary.rmult" "matrix"        "array"

This includes each of alr, ilr, clr, log, ilt, iit, apt, ipt, cpt, their generic versions cdt and idt, as well as all their inverses (e.g. cptInv or alrInv). Also raw, clo, unclass and pwlr resp. pwlrInv (pairwise logratio and its inverse) work with this trick

summary(xp)
#>             Ca        K       Na      Mg
#> Min.     40.39   0.0251    2.233   4.545
#> 1st Qu.  87.13   3.2230   30.360  20.430
#> Median  104.70  13.3700  135.100  34.260
#> Mean    119.30  10.1600   90.300  32.110
#> 3rd Qu. 170.10  32.6200  227.000  58.110
#> Max.    441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix"        "array"
summary(cdt(xp))
#>               Ca         K       Na       Mg
#> Min.    3.698582 -3.684887 0.803346 1.514028
#> 1st Qu. 4.467401  1.170313 3.413126 3.017004
#> Median  4.651099  2.593013 4.906015 3.533978
#> Mean    4.781715  2.318693 4.503135 3.469247
#> 3rd Qu. 5.136386  3.484926 5.424950 4.062338
#> Max.    6.090178  5.382888 7.022422 5.250702
#> attr(,"class")
#> [1] "summary.rmult" "matrix"        "array"
summary(xp$cdt)
#>               Ca         K       Na       Mg
#> Min.    3.698582 -3.684887 0.803346 1.514028
#> 1st Qu. 4.467401  1.170313 3.413126 3.017004
#> Median  4.651099  2.593013 4.906015 3.533978
#> Mean    4.781715  2.318693 4.503135 3.469247
#> 3rd Qu. 5.136386  3.484926 5.424950 4.062338
#> Max.    6.090178  5.382888 7.022422 5.250702
#> attr(,"class")
#> [1] "summary.rmult" "matrix"        "array"
summary(xp$cdt$cdtInv)
#>             Ca        K       Na      Mg
#> Min.     40.39   0.0251    2.233   4.545
#> 1st Qu.  87.13   3.2230   30.360  20.430
#> Median  104.70  13.3700  135.100  34.260
#> Mean    119.30  10.1600   90.300  32.110
#> 3rd Qu. 170.10  32.6200  227.000  58.110
#> Max.    441.50 217.6000 1121.000 190.700
#> attr(,"class")
#> [1] "summary.aplus" "matrix"        "array"

Another (potentially conflictive) novelty is the [-subsetting, in particular for closed classes (“acomp” and “rcomp”). In compositions-v1, subseting invariably destroyed the class, returning the selected rows or columns in a matrix, or even in a vector if only one row or one column was selected. From compositions-v2 on, the nature of the subsetting depends on class of the parent object and the number of selected elements:

selecting rows returns an object of the same class as the parent object; if only one row is chosen, the output will be a vector of the parent class;

xp0 = xp[1:5,]
xp0
#>   Ca      K       Na      Mg     
#> 1 "279.1" "6.318" "159.0" "92.41"
#> 2 "300.3" "6.449" "179.2" "96.20"
#> 3 "305.1" "5.978" "186.0" "96.83"
#> 4 "343.4" "5.171" "181.1" "88.58"
#> 5 "369.0" "7.371" "151.0" "84.90"
#> attr(,"class")
#> [1] "aplus"
xc0 = xc[1:5,]
xc0
#>   Ca          K             Na          Mg         
#> 1 "0.5199058" "0.011769133" "0.2961843" "0.1721408"
#> 2 "0.5158473" "0.011077920" "0.3078250" "0.1652498"
#> 3 "0.5137159" "0.010065532" "0.3131798" "0.1630387"
#> 4 "0.5554378" "0.008363917" "0.2929231" "0.1432751"
#> 5 "0.6026743" "0.012038787" "0.2466228" "0.1386641"
#> attr(,"class")
#> [1] "acomp"
xc[10,]
#>           Ca            K           Na           Mg 
#> "0.52155495" "0.01018437" "0.30386080" "0.16439989" 
#> attr(,"class")
#> [1] "acomp"

selecting columns in classes “rmult”, “ccomp”, “aplus” and “rplus” invariably return an object of that class, with the columns selected;

xp0[,"Ca"]
#>     1     2     3     4     5 
#> 279.1 300.3 305.1 343.4 369.0
xp0[,c("Ca","K")]
#>   Ca      K      
#> 1 "279.1" "6.318"
#> 2 "300.3" "6.449"
#> 3 "305.1" "5.978"
#> 4 "343.4" "5.171"
#> 5 "369.0" "7.371"
#> attr(,"class")
#> [1] "aplus"

selecting two or more columns of objects of class “acomp” or “rcomp” returns the selected subcomposition, that is, another object of the parent class with the selected components reclosed to sum to 1

xc0[,c("Ca","K")]
#>   Ca          K           
#> 1 "0.9778640" "0.02213595"
#> 2 "0.9789763" "0.02102370"
#> 3 "0.9807830" "0.01921705"
#> 4 "0.9851651" "0.01483485"
#> 5 "0.9804156" "0.01958440"
#> attr(,"class")
#> [1] "acomp"

selecting one single column returns that individual column as a vector; the same happens with the dollar notation; both destroy the class

xc0$Ca
#>         1         2         3         4         5 
#> 0.5199058 0.5158473 0.5137159 0.5554378 0.6026743
xc0[,"Ca"]
#>         1         2         3         4         5 
#> 0.5199058 0.5158473 0.5137159 0.5554378 0.6026743
xp0[,"Ca"]
#>     1     2     3     4     5 
#> 279.1 300.3 305.1 343.4 369.0

you can override this special bevavior and return to the behavior of compositions v1 by specifying the extra argument drop=TRUE in the subsetting, or globally by calling setStickyClassOption(FALSE)

xc0[,c("Ca","K"), drop=TRUE]
#>          Ca           K
#> 1 0.5199058 0.011769133
#> 2 0.5158473 0.011077920
#> 3 0.5137159 0.010065532
#> 4 0.5554378 0.008363917
#> 5 0.6026743 0.012038787
xp0[,"Ca", drop=TRUE]
#>     1     2     3     4     5 
#> 279.1 300.3 305.1 343.4 369.0

You can check which is the current status of sticky classes with getStickyClassOption().

Compositions as columns

Matrix-like compositions can be embedded in other data containers, typically “data.frame”s or tibbles. For this, we just need to define an extra column

Hydrochem$comp <- xc
head(Hydrochem)
#>      Code Site Location River       Date            H    Na     K    Mg    Ca
#> 1 9535612   95    Anoia Anoia 01.07.1997 7.943282e-09 159.0 6.318 92.41 279.1
#> 2 9535643   95    Anoia Anoia 01.08.1997 5.011872e-09 179.2 6.449 96.20 300.3
#> 3 9535674   95    Anoia Anoia 01.09.1997 7.943282e-09 186.0 5.978 96.83 305.1
#> 4 9535704   95    Anoia Anoia 01.10.1997 6.165950e-09 181.1 5.171 88.58 343.4
#> 5 9535735   95    Anoia Anoia 01.11.1997 1.202264e-08 151.0 7.371 84.90 369.0
#> 6 9535765   95    Anoia Anoia 01.12.1997 1.028016e-08 170.1 6.475 89.82 337.6
#>        Sr       Ba        NH4       Cl       NO3       PO4      SO4     HCO3
#> 1 8.58366 0.049900 0.22341340 345.9561 16.965779 0.4014300 730.6014 349.5047
#> 2 7.29700 0.054983 0.02694842 280.2735  8.647828 0.5218775 773.2336 314.6050
#> 3 7.34100 0.057367 0.07923994 300.2572 11.394282 0.5084960 884.5392 342.4435
#> 4 7.63000 0.059473 0.02231826 297.9269 11.293516 0.4014442 757.7849 335.2200
#> 5 8.24600 0.053410 0.46398830 229.2619  9.547616 0.4415887 616.2988 308.6500
#> 6 7.57200 0.050168 0.09921348 267.8800 13.253693 0.6958367 881.1794 334.1700
#>     TOC      comp.1      comp.2      comp.3      comp.4
#> 1 3.022 0.519905817 0.011769133 0.296184253 0.172140797
#> 2 3.039 0.515847317 0.011077920 0.307824973 0.165249790
#> 3 3.196 0.513715929 0.010065532 0.313179819 0.163038720
#> 4 3.871 0.555437840 0.008363917 0.292923101 0.143275142
#> 5 3.400 0.602674306 0.012038787 0.246622819 0.138664088
#> 6 3.625 0.558945024 0.010720287 0.281624848 0.148709840
dim(Hydrochem)
#> [1] 485  20

This composition can be then recovered with the dollar notation, and also used in further calls to statistical methods, e.g.

head(Hydrochem$comp)
#>   [,1]        [,2]          [,3]        [,4]       
#> 1 "0.5199058" "0.011769133" "0.2961843" "0.1721408"
#> 2 "0.5158473" "0.011077920" "0.3078250" "0.1652498"
#> 3 "0.5137159" "0.010065532" "0.3131798" "0.1630387"
#> 4 "0.5554378" "0.008363917" "0.2929231" "0.1432751"
#> 5 "0.6026743" "0.012038787" "0.2466228" "0.1386641"
#> 6 "0.5589450" "0.010720287" "0.2816248" "0.1487098"
#> attr(,"class")
#> [1] "acomp"
codalm = lm(alr(comp)~River, data=Hydrochem) 
summary(codalm)
#> Response v1 :
#> 
#> Call:
#> lm(formula = v1 ~ River, data = Hydrochem)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.90144 -0.19728  0.06689  0.26947  0.82324 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          1.13957    0.03242  35.152   <2e-16 ***
#> RiverCardener        0.09677    0.05131   1.886   0.0599 .  
#> RiverLowerLLobregat  0.01237    0.04652   0.266   0.7904    
#> RiverUpperLLobregat  0.65173    0.04892  13.324   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3877 on 481 degrees of freedom
#> Multiple R-squared:  0.3198, Adjusted R-squared:  0.3155 
#> F-statistic: 75.38 on 3 and 481 DF,  p-value: < 2.2e-16
#> 
#> 
#> Response v2 :
#> 
#> Call:
#> lm(formula = v2 ~ River, data = Hydrochem)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3186 -0.3621  0.1977  0.4878  1.6234 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         -1.68613    0.07202 -23.413   <2e-16 ***
#> RiverCardener        0.99021    0.11399   8.687   <2e-16 ***
#> RiverLowerLLobregat  1.38843    0.10334  13.435   <2e-16 ***
#> RiverUpperLLobregat -0.19421    0.10867  -1.787   0.0745 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8612 on 481 degrees of freedom
#> Multiple R-squared:  0.3798, Adjusted R-squared:  0.376 
#> F-statistic:  98.2 on 3 and 481 DF,  p-value: < 2.2e-16
#> 
#> 
#> Response v3 :
#> 
#> Call:
#> lm(formula = v3 ~ River, data = Hydrochem)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.1345 -0.4578  0.2273  0.5083  1.7514 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          0.95089    0.06784  14.017  < 2e-16 ***
#> RiverCardener        0.21806    0.10738   2.031 0.042821 *  
#> RiverLowerLLobregat  0.45005    0.09735   4.623 4.86e-06 ***
#> RiverUpperLLobregat -0.36803    0.10236  -3.595 0.000357 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.8112 on 481 degrees of freedom
#> Multiple R-squared:  0.1213, Adjusted R-squared:  0.1158 
#> F-statistic: 22.13 on 3 and 481 DF,  p-value: 1.94e-13

The transformation can even be called with the dollar notation, and transformed data keep the information necessary to construct their back-transformation.

codalm = lm(comp$idt~River, data=Hydrochem) 
head(predict(codalm)$idtInv)
#>   [,1]        [,2]         [,3]        [,4]       
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"

To recover the column names in the backtransformation you might need to make use of the orig argument of the explicit transformation (that is, without the dollar notation)

head(idtInv(predict(codalm), orig=xc))
#>   Ca          K            Na          Mg         
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"

Note as well the existence of the new function backtransform, which will take an “rmult” object and return its expression in the appropriate original components. This can even be allowed to be controlled by a third object through an extra argument as

head(backtransform(idt(xp)))
#>   Ca      K       Na      Mg     
#> 1 "279.1" "6.318" "159.0" "92.41"
#> 2 "300.3" "6.449" "179.2" "96.20"
#> 3 "305.1" "5.978" "186.0" "96.83"
#> 4 "343.4" "5.171" "181.1" "88.58"
#> 5 "369.0" "7.371" "151.0" "84.90"
#> 6 "337.6" "6.475" "170.1" "89.82"
#> attr(,"class")
#> [1] "aplus"
head(backtransform(predict(codalm), as=codalm$residuals))
#>   [,1]        [,2]         [,3]        [,4]       
#> 1 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 2 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 3 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 4 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 5 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> 6 "0.4530465" "0.02685076" "0.3751471" "0.1449556"
#> attr(,"class")
#> [1] "acomp"

In the case of tibbles, the column names of the embedded object are even preserved, so that

tbHydro = tibble::as_tibble(Hydrochem)
tbHydro$comp = xc
tbHydro$comp$Ca

would return the same as xc$Ca. This is not included here to redude the dependencies of this package. Note that this trick does not work with “data.frame” objects, because [<- takes care of removing the column names of the composition before attaching it.

This property of self-back-transformability is the single most important addition to “compositions” v2, and will experiment improvements in the future. In particular, we strive to make the recovery of column names automatic in the near future.

‘compositions’ v2.0: R classes for compositional analysis

The “compositions” package

Compositional classes, old and new

Subsetting and transformations

Compositions as columns

Embedding in S4 objects