Using Causal Forests to Unveil Gender Bias in Hiring

30 Jun 2024


(1) Anuar Assamidanov, Department of Economics, Claremont Graduate University, 150 E 10th St, Claremont, CA 91711. (Email:






Discussion and Conclusion, and References

A Appendix Tables and Figures

3 Methodology

The identification strategy used in this study is Difference in Differences with Dummy Outcomes (Price and Wolfers, 2010), which focuses on coach and artist gender interaction. Specifically, we estimate the following linear probability model:

The regression model in the difference-in-differences framework does not capture potential heterogeneous effects across covariates. There could be a positive effect on some groups in the covariates but a negative or no effect on other groups. For example, the coaches in the initial stage of the blind audition could elicit own-gender bias, but in the later stage, opposite-gender bias. The standard regression model will provide a weighted average of effect size, even if an effect varies across a population represented by different groups. One plausible solution is to estimate heterogeneous treatment effects. Instead of just estimating one effect, I estimate a distribution of effect size for a given unit with a given set of attributes and what their impact might be. To analyze this heterogeneous effect of own-gender bias, I incorporated the first difference approach into the causal forest approach. I will discuss the performance of the proposed causal forest approach and how to apply it to my empirical example.

The causal forest approach is consistent and asymptotically Gaussian and provides an estimator for their asymptotic variance that allows us to construct valid confidence intervals (Athey and Wager, 2019). Causal forest utilizes a forest-based approach to calculate a similarity weight and then applies the local generalized method of moments to estimate based on a weighted set of neighbors. The authors also show that using the centered instead of the original outcome and treatment allows a forest to focus its splits on the features that affect the treatment effect instead of the primary outcome. Intuitively, a causal forest consists of regression trees (Athey and Imbens, 2016; Breiman et al., 1984), which estimate conditional average treatment effects (CATE) at the leaves of the trees instead of predicting the outcome variables.

I utilize a two-step approach by incorporating the first-difference approach into the causal forest approach, the similar two-step approach in the standard differencein-difference method. In the first step, I estimate the first difference, which is the change in the selection of individual contestants from a male coach to a female coach. This step teases out the systematic differences between coaches in female and male coach groups. In the second step, I use the outcome variables obtained from the first step and apply the causal forest approach for heterogeneous treatment-effect analysis. This step teases out the effect of contestant gender trends that exist in both women and men.

My methods rely heavily on the causal forest approach, a machine learning method that can estimate heterogeneous causal effect functions under the above assumptions. The main strength associated with the causal forest is that it provides a strategy for learning patterns of heterogeneity from the data. This method requires little researcher discretion compared to traditional subgroup analyses. For instance, one does not have to decide which cutoffs to use to analyze heterogeneity in effects along continuous covariates, nor do we require to define the functional condition of the heterogeneity; the causal forest algorithm is designed to automate this search and find the most appropriate model that best characterizes the heterogeneous effect in the data (Bonander and Svensson, 2021).

This paper is available on arxiv under CC 4.0 license.