This vignette shows how to use healthcare.antitrust to implement the semiparametric demand estimation technique developed in Raval, Rosenbaum, and Tenn (2017) “A Semiparametric Discrete Choice Model: An Application to Hospital Mergers” https://doi.org/10.1111/ecin.12454. This model is useful to estimate “diversion ratios” which are used to assess substitutability between merging parties’ products in antitrust analysis.
The method is illustrated using a simulated dataset of hospital discharges that is included with the healthcare.antitrust package. This simulated dataset mimics the basic structure of real hospital discharge data, with each observation including a hospital identifier and patient characteristics (DRG diagnosis code, patient age, and patient zip code of residence). The data includes 1,200 discharges from eight hospitals.
Suppose that the analyst is interested in estimating diversion ratios between hospitals in System 1, which includes Hospitals 1 and 2, and Hospital 5, which is an independent hospital. This vignette will show how to implement each step to arrive at the estimate.
The first step is to assign observations to ‘cells,’ within which the
substitution proportional to share will be assumed to hold. This
assignment of observations can be done with the cell_defn
function. Begin by loading the example simulation dataset of hospital
discharges:
The observations will be grouped into cells based on variables in the
dataset. Let’s define cells based on three variables: the DRG diagnosis
code, the patient age, and the patient zip code of residence. We do this
by specifying the variables and then saving them in a list variable
called layers
.
A minimum cell size threshold also needs to be set. Observations will be assigned to a cell only if that cell includes at least the minimum threshold number of observations. Let’s set the minimum size to 15:
Then use cell_defn
to allocate observations to cells,
based on the variables in the layer. The one variable that needs to be
included in the dataframe is a variable specifying the number of
admissions represented by each observation. By default, this variable
name is count
, and if present the function
cell_defn
will use it as the admission count.
Alternatively, another variable can be indicated as the count variable.
If no variable is specified as the count and count
is not
present in the dataframe, cell_defn
will assume each
observation represents one admission and create a variable
count
always equal to one. We will go ahead and create the
variable in our dataframe, and then run cell_defn
.
discharge_data$count <- 1
outList <- cell_defn(discharge_data,th,layers)
#> Layer 1: 886 obs allocated
#> Number of Excluded Obs: 314
The cell_defn
function prints a message indicating the
number of observations assigned to cells, and also returns a list. The
first item in the return list is a new dataset that contains the
observations that have been allocated to a cell, as well as their
assignment.
The new dataset has nrow(D0)
observations.
cell_defn
stated that it excluded some observations, those
that were not allocated to a cell that met the minimum size threshold.
The number of unassigned observations added to the number of assigned
observations in the new dataset equals the number of observations in the
original dataset. A data frame of unassigned observations is the second
item in the return list and can be accessed with
outList$unassigned
.
Because some observations may not be assigned to a cell, it is useful
to specify several “layers” of variables in order to assign observations
to cells. Using a more coarse set of variables will assign more
observations, since the cells will be larger and more likely to exceed
the minimum size threshold. But finer layers allow one to better account
for heterogeneity in preferences across different types of people. That
is why cell_defn
allows for input of many layers, and they
should be ordered by increasing coarseness. This will assign
observations by more granular categories at first, and then for
remaining unassigned individuals, they will be assigned into more coarse
groupings.
Instead of using only one layer, we can modify our example to instead use three layers to define cells. Adding more coarse layers will allow more observations to be allocated to cells that meet the minimum size threshold. Let’s add layers that use three digit zip codes instead of the five digit zip code.
discharge_data$zip3 <- floor(discharge_data$zip5/100)
list1 <- c("drg","age","zip5")
list2 <- c("drg","age","zip3")
list3 <- c("zip3")
layers <- list(list1, list2, list3)
With this new definition of layers, we can allocate observations like before:
outList <- cell_defn(discharge_data,th,layers)
#> Layer 1: 886 obs allocated
#> Layer 2: 154 obs allocated
#> Layer 3: 160 obs allocated
#> Number of Excluded Obs: 0
D0 <- outList$assigned
print(nrow(D0))
#> [1] 1200
print(nrow(discharge_data))
#> [1] 1200
The observations are first allocated to cells as before, and the remaining observations are allocated to a cell based on three digit zip code only. With these definitions, every observation is allocated to a cell that meets the minimum size threshold that was set.
Note that in each layer, cell_defn
only assigns
observations that have not been assigned in previous layers. That is,
the function assigns observations without replacement. An alternative
method can be used, in which assignment at each layer is done by
sampling with replacement from all observation that meet the cell
criteria, even if they have been previously assigned in a finer layer.
This option of assigning with replacement is currently not
implemented.
We are now able to calculate diversion ratios based on the cell definitions.
Once observations have been assigned to cells, we can now calculate
diversion ratios with the function div_calc
. Within each
cell \(c\), the diversion from a
hospital \(k\) to hospital \(h\) is given by \(d^c_{k,h} = \frac{s^c_h}{1 - s^c_{m(k)}}\)
where \(s^c_h\) is the share of cell
\(c\) admissions to hospital \(h\), and \(s^c_{m(k)}\) is the share of cell \(c\) admissions at hospital \(k\) or any other hospital in the same
system as \(k\). The overall diversions
are then calculated by aggregating over cells (See Raval, Rosenbaum, and
Tenn (2017) for details.)
The required variables in the input dataframe are cell
,
which has been defined by the cell_defn
function, as well
as provider and system identifiers: provider_id
,
provider
, sys_id
, and system
. The
function also requires count
, which is created by
cell_defn
if it does not exist.
Finally, we need to specify an indicator for the set of hospitals we
are interested in diversions from. In the analysis of a merger, this
will typically be the hospitals that are parties to the proposed merger.
The variable to indicate such systems is focal_sys_id
,
which should be a numeric vector listing the sys_id
’s of
the systems of interest. Diversion ratios will be calculated for
providers belonging to the focal systems. Since in this example we are
interested in studying a proposed merger including Systems 1 and 5, we
define the variable accordingly.
Now we can calculate diversions using the div_calc
function.
focal_systems <- c(1,5)
out <- div_calc(D0, provider_id = "hosp_id", provider = "hospital",
focal_sys_id = focal_systems)
The function div_calc
returns a list with two
components, the first is a matrix of provider-level diversions and the
second is a matrix aggregated to system-level diversions. The
provider-level matrix gives diversions from each of the merging
providers to all other providers The diversions should sum to 1. We can
verify that our diversions sum to one for Hospital 1:
We can also print the provider-level and system-level diversions.
print(out$provider_level)
#> hosp_id hospital sys_id N_h div_from_1 div_from_2 div_from_5
#> 1 1 Hospital 1 1 126 NA NA 0.019
#> 2 2 Hospital 2 1 62 NA NA 0.116
#> 5 5 Hospital 5 5 281 0.029 0.242 NA
#> 3 3 Hospital 3 3 68 0.014 0.073 0.350
#> 4 4 Hospital 4 3 67 0.340 0.059 0.013
#> 6 6 Hospital 6 6 198 0.033 0.164 0.185
#> 7 7 Hospital 7 7 107 0.008 0.161 0.046
#> 8 8 Hospital 8 8 291 0.576 0.301 0.273
print(out$sys_level)
#> hosp_id hospital sys_id N_h div_from_sys_1 div_from_sys_5
#> 1 1 Hospital 1 1 126 NA 0.019
#> 2 2 Hospital 2 1 62 NA 0.116
#> 5 5 Hospital 5 5 281 0.099 NA
#> 3 3 Hospital 3 3 68 0.033 0.350
#> 4 4 Hospital 4 3 67 0.247 0.013
#> 6 6 Hospital 6 6 198 0.076 0.185
#> 7 7 Hospital 7 7 107 0.058 0.046
#> 8 8 Hospital 8 8 291 0.485 0.273
If the total diversion for a given hospital does not sum to 1, that means that at least one degenerate cell exists, meaning that there exists a cell where every individual assigned to that cell visits the same hospital. Such cells cause problems for the method used here, which predicts diversions for individuals in a cell based on the shares within that cell. If the share is 100% in a cell, the model cannot make any prediction about to where those individuals would substitute.
By default, div_calc
will drop any degenerate cells so
that diversions will still sum to 1. If a degenerate cell exists and the
dropDegenerateCell
option is set to FALSE
, the
diversion for a hospital may sum to less than one because degenerate
cells will not be dropped. A notification flag will be printed whenever
a degenerate cell exists for a given hospital.
Once observations have been assigned to cells, one can also use the
function wtp_calc
to calculate the willingness-to-pay (WTP)
of a hospital system (see Capps, Dranove, and Satterthwaite (2003)
“Competition and Market Power in Option Demand Markets” https://www.jstor.org/stable/1593786). This can be used
to estimate the change in WTP that would occur after a merger, using the
discharge data as we have been using.
Within this semiparametric estimation method, the formulas for WTP are as follows. For individuals in a given cell \(c\), the choice probabiities for a given hospital \(j\) are estimated by \(s_j^c = \frac{N^c_j}{N^c}\), where \(N^c\) is the number of admissions belonging to cell \(c\), and \(N^c_j\) is the number of admissions from cell \(c\) that visit hospital \(j\). The WTP for the cell is then given by \(WTP_c(j) = \ln \frac{1}{1-s_j^c}\). The population WTP for hospital \(j\) is then simply the weighted sum across cells:
\(WTP(j) = \sum_c N^c \times WTP_c(j)\)
To calculate the change in WTP resulting from a merger, first note the general formula for calcuating WTP for a system \(M\) is:
\(WTP_c(M) = \ln \frac{1}{1- \sum_{j \in M} s_j^c}\)
Aggregating across cells gives the population WTP. The change in WTP from a merger can then be given as the WTP of the post-merger system minus the WTP of the sum of the systems pre-merger. If we denote by \(M\) the hypothetical post-merger system and \(M_1\) and \(M_2\) the systems before they merge, then as a percentage increase the change in willingness-to-pay is given by \(\Delta WTP = \frac{WTP(M) - (WTP(M_1) + WTP(M_2)) }{WTP(M_1) + WTP(M_2)}\)
To implement this calculation using the wtp_calc
function, we calculate the WTP for both systems pre-merger as well as
for a hypothetical combined system.
out <- wtp_calc(D0)
y_pre <- subset(out, sys_id %in% focal_systems)
D0_post <- D0
D0_post$sys_id[D0_post$sys_id == 5] <- 1
out <- wtp_calc(D0_post)
y_post <- subset(out, sys_id == 1)
y_pre$sys_id <- 1
y_pre <- aggregate(list(WTP_s = y_pre$WTP_s, WTP_s_wt = y_pre$WTP_s_wt, N_s=y_pre$N_s),by=list(y_party=y_pre$sys_id),sum)
print("% Change in WTP")
#> [1] "% Change in WTP"
print((y_post$WTP_s-y_pre$WTP_s)/(y_pre$WTP_s)*100)
#> [1] 6.857682
This standard approach makes the assumption that WTP per dollar is
constant across diagnoses. One alternative approach is to weight each
admission by the diagnoses code (such as DRG) for each admission. This
can be done with wtp_calc
by providing a variable
weight
. For example, it is sometimes desirable to weight
the WTP calculation by DRG weights. If weight is provided,
wtp_calc
will provide both a weighted an unweighted
calculation for WTP. Otherwise, wtp_calc
will assume a
weight of 1 and both the unweighted and weighted WTP estimates will be
equivalent.