COMPAS - Task 1

Courts in Broward County, Florida use machine learning to predict whether individuals released on parole are at high risk of re-offending within 2 years (\(Y\)). The algorithm is based on the demographic information \(Z\) (\(Z_1\) for gender, \(Z_2\) for age), race \(X\) (\(x_0\) denoting White, \(x_1\) Non-White), juvenile offense counts \(J\), prior offense count \(P\), and degree of charge \(D\).

In this vignette, we perform the task of bias detection on this dataset. We begin by loading and pre-processing the original data:

data <- read.csv(file.path(root, "inst", "extdata",
                           "compas-scores-two-years.csv"))
col.keep <- which(
  names(data) %in% c("age", "sex", "juv_fel_count",
                     "juv_misd_count", "juv_other_count", "priors_count",
                     "c_charge_degree", "race", "two_year_recid", "decile_score")
)
data <- data[, col.keep]
data$race <- factor(data$race)
levels(data$race) <- c("Minority", "Minority", "Majority", "Minority",
                       "Minority", "Minority")
data$race <- relevel(data$race, "Majority")
cumsum(table(data$decile_score)) / sum(table(data$decile_score))
        1         2         3         4         5         6         7         8 
0.1996119 0.3300527 0.4336013 0.5401996 0.6345994 0.7234544 0.8055171 0.8764902 
        9        10 
0.9469088 1.0000000 
# decile_score > 4 represents high risk (approximately)
data$northpointe <- as.integer(data$decile_score > 4)
data$decile_score <- NULL
names(data) <- gsub("_count", "", names(data))
names(data)[which(names(data) == "c_charge_degree")] <- "charge"

knitr::kable(head(data), caption = "COMPAS dataset.")
COMPAS dataset.
sex age race juv_fel juv_misd juv_other priors charge two_year_recid northpointe
Male 69 Minority 0 0 0 0 F 0 0
Male 34 Minority 0 0 0 0 F 1 0
Male 24 Minority 0 0 1 4 F 1 0
Male 23 Minority 0 1 0 1 F 0 1
Male 43 Minority 0 0 0 2 F 0 0
Male 44 Minority 0 0 0 0 M 0 0

In Causal Fairness Analysis, we are interested in decomposing the TV measure (also known as the parity gap), into its direct, indirect, and spurious components. We show the causal diagram associated with the data, and also a representation of how the target effects can be visualized as follows:

Figure 1: COMPAS Causal Diagram

Figure 2: Direct effect visualization.

Figure 3: Indirect effect visualization.

Figure 4: Confounded effect visualization.

After obtaining the data, we then specify the Standard Fairness Model, and decompose the TV measure for the true outcome \(Y\):

X <- "race"
Z <- c("age", "sex")
W <- c("juv_fel", "juv_misd", "juv_other", "priors", "charge")
Y <- c("two_year_recid")
two_year <- fairness_cookbook(data, X = X, W = W, Z = Z, Y = Y,
                              x0 = "Majority", x1 = "Minority")

autoplot(two_year, decompose = "xspec") + 
  ggtitle(TeX("$Y$ disparity decomposition COMPAS"))

Causal decomposition of the TV measure for two-year recidivism.

However, we are also interested in the disparity for the predictor \(\widehat{Y}\), so we can decompose the TV measure for the predictor, too:

Yhat <- "northpointe"
north_decompose <- fairness_cookbook(data, X = X, W = W, Z = Z, Y = Yhat,
                                     x0 = "Majority", x1 = "Minority")
autoplot(north_decompose, decompose = "xspec") +
  ggtitle(TeX("$\\widehat{Y}$ disparity decomposition COMPAS"))

Causal decomposition of the TV measure for Northpointe’s predictions.

To perform the complete analysis, we plot the two results side-by-side, and shade the areas depending on whether the associated measure is included in the business necessity set or not:

Causal decompositions of the TV measure for the true and predicted outcomes visualized together.