Semiparametric Estimation with Hospital Discharge Data

Matthew T. Panhans

2024-01-08

This vignette shows how to use healthcare.antitrust to implement the semiparametric demand estimation technique developed in Raval, Rosenbaum, and Tenn (2017) “A Semiparametric Discrete Choice Model: An Application to Hospital Mergers” https://doi.org/10.1111/ecin.12454. This model is useful to estimate “diversion ratios” which are used to assess substitutability between merging parties’ products in antitrust analysis.

The method is illustrated using a simulated dataset of hospital discharges that is included with the healthcare.antitrust package. This simulated dataset mimics the basic structure of real hospital discharge data, with each observation including a hospital identifier and patient characteristics (DRG diagnosis code, patient age, and patient zip code of residence). The data includes 1,200 discharges from eight hospitals.

Suppose that the analyst is interested in estimating diversion ratios between hospitals in System 1, which includes Hospitals 1 and 2, and Hospital 5, which is an independent hospital. This vignette will show how to implement each step to arrive at the estimate.

Cell definitions

The first step is to assign observations to ‘cells,’ within which the substitution proportional to share will be assumed to hold. This assignment of observations can be done with the cell_defn function. Begin by loading the example simulation dataset of hospital discharges:

library(healthcare.antitrust)
data(discharge_data, package = "healthcare.antitrust")

The observations will be grouped into cells based on variables in the dataset. Let’s define cells based on three variables: the DRG diagnosis code, the patient age, and the patient zip code of residence. We do this by specifying the variables and then saving them in a list variable called layers.

list1 <- c("drg","age","zip5")
layers <- list(list1)

A minimum cell size threshold also needs to be set. Observations will be assigned to a cell only if that cell includes at least the minimum threshold number of observations. Let’s set the minimum size to 15:

th <- 25

Then use cell_defn to allocate observations to cells, based on the variables in the layer. The one variable that needs to be included in the dataframe is a variable specifying the number of admissions represented by each observation. By default, this variable name is count, and if present the function cell_defn will use it as the admission count. Alternatively, another variable can be indicated as the count variable. If no variable is specified as the count and count is not present in the dataframe, cell_defn will assume each observation represents one admission and create a variable count always equal to one. We will go ahead and create the variable in our dataframe, and then run cell_defn.

discharge_data$count <- 1
outList <- cell_defn(discharge_data,th,layers)
#> Layer 1: 886 obs allocated
#> Number of Excluded Obs: 314

The cell_defn function prints a message indicating the number of observations assigned to cells, and also returns a list. The first item in the return list is a new dataset that contains the observations that have been allocated to a cell, as well as their assignment.

D0 <- outList$assigned
print(nrow(D0))
#> [1] 886
print(nrow(discharge_data))
#> [1] 1200

The new dataset has nrow(D0) observations. cell_defn stated that it excluded some observations, those that were not allocated to a cell that met the minimum size threshold. The number of unassigned observations added to the number of assigned observations in the new dataset equals the number of observations in the original dataset. A data frame of unassigned observations is the second item in the return list and can be accessed with outList$unassigned.

Because some observations may not be assigned to a cell, it is useful to specify several “layers” of variables in order to assign observations to cells. Using a more coarse set of variables will assign more observations, since the cells will be larger and more likely to exceed the minimum size threshold. But finer layers allow one to better account for heterogeneity in preferences across different types of people. That is why cell_defn allows for input of many layers, and they should be ordered by increasing coarseness. This will assign observations by more granular categories at first, and then for remaining unassigned individuals, they will be assigned into more coarse groupings.

Instead of using only one layer, we can modify our example to instead use three layers to define cells. Adding more coarse layers will allow more observations to be allocated to cells that meet the minimum size threshold. Let’s add layers that use three digit zip codes instead of the five digit zip code.

discharge_data$zip3 <- floor(discharge_data$zip5/100)
list1 <- c("drg","age","zip5")
list2 <- c("drg","age","zip3")
list3 <- c("zip3")
layers <- list(list1, list2, list3)

With this new definition of layers, we can allocate observations like before:

outList <- cell_defn(discharge_data,th,layers)
#> Layer 1: 886 obs allocated
#> Layer 2: 154 obs allocated
#> Layer 3: 160 obs allocated
#> Number of Excluded Obs: 0
D0 <- outList$assigned
print(nrow(D0))
#> [1] 1200
print(nrow(discharge_data))
#> [1] 1200

The observations are first allocated to cells as before, and the remaining observations are allocated to a cell based on three digit zip code only. With these definitions, every observation is allocated to a cell that meets the minimum size threshold that was set.

Note that in each layer, cell_defn only assigns observations that have not been assigned in previous layers. That is, the function assigns observations without replacement. An alternative method can be used, in which assignment at each layer is done by sampling with replacement from all observation that meet the cell criteria, even if they have been previously assigned in a finer layer. This option of assigning with replacement is currently not implemented.

We are now able to calculate diversion ratios based on the cell definitions.

Diversion ratio calculation

Once observations have been assigned to cells, we can now calculate diversion ratios with the function div_calc. Within each cell \(c\), the diversion from a hospital \(k\) to hospital \(h\) is given by \(d^c_{k,h} = \frac{s^c_h}{1 - s^c_{m(k)}}\) where \(s^c_h\) is the share of cell \(c\) admissions to hospital \(h\), and \(s^c_{m(k)}\) is the share of cell \(c\) admissions at hospital \(k\) or any other hospital in the same system as \(k\). The overall diversions are then calculated by aggregating over cells (See Raval, Rosenbaum, and Tenn (2017) for details.)

The required variables in the input dataframe are cell, which has been defined by the cell_defn function, as well as provider and system identifiers: provider_id, provider, sys_id, and system. The function also requires count, which is created by cell_defn if it does not exist.

Finally, we need to specify an indicator for the set of hospitals we are interested in diversions from. In the analysis of a merger, this will typically be the hospitals that are parties to the proposed merger. The variable to indicate such systems is focal_sys_id, which should be a numeric vector listing the sys_id’s of the systems of interest. Diversion ratios will be calculated for providers belonging to the focal systems. Since in this example we are interested in studying a proposed merger including Systems 1 and 5, we define the variable accordingly.

Now we can calculate diversions using the div_calc function.

focal_systems <- c(1,5)

out <- div_calc(D0, provider_id = "hosp_id", provider = "hospital",
                focal_sys_id = focal_systems)

The function div_calc returns a list with two components, the first is a matrix of provider-level diversions and the second is a matrix aggregated to system-level diversions. The provider-level matrix gives diversions from each of the merging providers to all other providers The diversions should sum to 1. We can verify that our diversions sum to one for Hospital 1:

divratio_hosp <- out$provider_level
sum(divratio_hosp$div_from_1, na.rm = TRUE)
#> [1] 1

We can also print the provider-level and system-level diversions.

print(out$provider_level)
#>   hosp_id   hospital sys_id N_h div_from_1 div_from_2 div_from_5
#> 1       1 Hospital 1      1 126         NA         NA      0.019
#> 2       2 Hospital 2      1  62         NA         NA      0.116
#> 5       5 Hospital 5      5 281      0.029      0.242         NA
#> 3       3 Hospital 3      3  68      0.014      0.073      0.350
#> 4       4 Hospital 4      3  67      0.340      0.059      0.013
#> 6       6 Hospital 6      6 198      0.033      0.164      0.185
#> 7       7 Hospital 7      7 107      0.008      0.161      0.046
#> 8       8 Hospital 8      8 291      0.576      0.301      0.273
print(out$sys_level)
#>   hosp_id   hospital sys_id N_h div_from_sys_1 div_from_sys_5
#> 1       1 Hospital 1      1 126             NA          0.019
#> 2       2 Hospital 2      1  62             NA          0.116
#> 5       5 Hospital 5      5 281          0.099             NA
#> 3       3 Hospital 3      3  68          0.033          0.350
#> 4       4 Hospital 4      3  67          0.247          0.013
#> 6       6 Hospital 6      6 198          0.076          0.185
#> 7       7 Hospital 7      7 107          0.058          0.046
#> 8       8 Hospital 8      8 291          0.485          0.273

Some notes

If the total diversion for a given hospital does not sum to 1, that means that at least one degenerate cell exists, meaning that there exists a cell where every individual assigned to that cell visits the same hospital. Such cells cause problems for the method used here, which predicts diversions for individuals in a cell based on the shares within that cell. If the share is 100% in a cell, the model cannot make any prediction about to where those individuals would substitute.

By default, div_calc will drop any degenerate cells so that diversions will still sum to 1. If a degenerate cell exists and the dropDegenerateCell option is set to FALSE, the diversion for a hospital may sum to less than one because degenerate cells will not be dropped. A notification flag will be printed whenever a degenerate cell exists for a given hospital.

Willingness-to-pay calculation

Once observations have been assigned to cells, one can also use the function wtp_calc to calculate the willingness-to-pay (WTP) of a hospital system (see Capps, Dranove, and Satterthwaite (2003) “Competition and Market Power in Option Demand Markets” https://www.jstor.org/stable/1593786). This can be used to estimate the change in WTP that would occur after a merger, using the discharge data as we have been using.

Within this semiparametric estimation method, the formulas for WTP are as follows. For individuals in a given cell \(c\), the choice probabiities for a given hospital \(j\) are estimated by \(s_j^c = \frac{N^c_j}{N^c}\), where \(N^c\) is the number of admissions belonging to cell \(c\), and \(N^c_j\) is the number of admissions from cell \(c\) that visit hospital \(j\). The WTP for the cell is then given by \(WTP_c(j) = \ln \frac{1}{1-s_j^c}\). The population WTP for hospital \(j\) is then simply the weighted sum across cells:

\(WTP(j) = \sum_c N^c \times WTP_c(j)\)

To calculate the change in WTP resulting from a merger, first note the general formula for calcuating WTP for a system \(M\) is:

\(WTP_c(M) = \ln \frac{1}{1- \sum_{j \in M} s_j^c}\)

Aggregating across cells gives the population WTP. The change in WTP from a merger can then be given as the WTP of the post-merger system minus the WTP of the sum of the systems pre-merger. If we denote by \(M\) the hypothetical post-merger system and \(M_1\) and \(M_2\) the systems before they merge, then as a percentage increase the change in willingness-to-pay is given by \(\Delta WTP = \frac{WTP(M) - (WTP(M_1) + WTP(M_2)) }{WTP(M_1) + WTP(M_2)}\)

To implement this calculation using the wtp_calc function, we calculate the WTP for both systems pre-merger as well as for a hypothetical combined system.

out <- wtp_calc(D0)
y_pre <- subset(out, sys_id %in% focal_systems)

D0_post <- D0
D0_post$sys_id[D0_post$sys_id == 5] <- 1

out <- wtp_calc(D0_post)
y_post <- subset(out, sys_id == 1)

y_pre$sys_id <- 1
y_pre <- aggregate(list(WTP_s = y_pre$WTP_s, WTP_s_wt = y_pre$WTP_s_wt, N_s=y_pre$N_s),by=list(y_party=y_pre$sys_id),sum)

print("% Change in WTP")
#> [1] "% Change in WTP"
print((y_post$WTP_s-y_pre$WTP_s)/(y_pre$WTP_s)*100)
#> [1] 6.857682

This standard approach makes the assumption that WTP per dollar is constant across diagnoses. One alternative approach is to weight each admission by the diagnoses code (such as DRG) for each admission. This can be done with wtp_calc by providing a variable weight. For example, it is sometimes desirable to weight the WTP calculation by DRG weights. If weight is provided, wtp_calc will provide both a weighted an unweighted calculation for WTP. Otherwise, wtp_calc will assume a weight of 1 and both the unweighted and weighted WTP estimates will be equivalent.