Historical data analysis that is naïve to past discrimination is doomed to parrot bias. How do we combat bias in our data analytics?
First, we already know that complex, multi-talented teams are best-suited to face complex, multi-faceted problems. A more diverse group of researchers and engineers is more likely to recognize and think harder about bias problems that may impact them personally. “It’s difficult to build technology that serves the world without first building a team that reflects the diversity of the world,” wrote Microsoft President Brad Smith in Tools and Weapons (2019).
Second, data collection is often provided to us through real-world interactions, subject to real-world bias. No real-world data exists in a vacuum. We should understand that bias in data collection and analysis may be inevitable but is not acceptable. It is not your responsibility to end bias (though that’s a worthy cause), but rather to be proactively informed and transparent.
Look for systemic outcomes, not intentions. Only in outcomes are potential disparate impacts measured. Let’s review a couple of examples.
Many Americans are familiar with the ACT test, an exam many students take as part of the college application process. In 2016, ACT admitted an achievement gap in composite scores based on family income. According to ACT’s own data, there is a 3-4 point difference in scores between poorer and wealthier households, and the gap continues to widen.
Credit to ACT for disclosing their research. Transparency is part of accounting for bias in historical data and data collection and is critically important to furthering a larger conversation about inequality of opportunity.
Knowing the variables involved in your analysis is important. When conducting analysis, researchers must review, identify, and anticipate variables. You will never find the variables unless you are looking for them.
Centuries of racially-divided housing policies in the United States evolved into legalized housing discrimination known as redlining, ensconced in federal-backed mortgage lending starting in the 1930s. The long history of legal racial housing discrimination in the United States was arguably not directly addressed until the 1977 Community Reinvestment Act. Yet today, 75% of neighborhoods “redlined” on government maps 80 years ago continue to struggle economically, and minority communities continue to be denied housing loans at rates far higher than their white counterparts. If discriminatory outcomes in housing are to change for the better, the algorithmic orchestration of mortgage lending should not be excused from scrutiny.
In both cases, we examined industries where algorithms draw from data revealing larger societal outcomes. These outcomes are the result of trends of economic inequality and a pervasive opportunity gap. Are such data systems to be trusted as the result of an algorithm, and thereby inherently ethical? No, not without scrutiny.
These provide examples of when data professionals must be aware of the historical and societal contexts from which our data is drawn, and how the outcomes of our data findings could be leveraged. Would our outcomes contribute to justice? For example, the industries of financial marketing, healthcare, criminal justice, education, or public contracting have histories checkered with injustice. We should learn that history.
Transparency is desirable. It is needed to aid an informed societal conversation. We should not that assume an algorithm can overcome historical biases or other latent discrimination in its source data. Informed scrutiny should gaze upon historical data with a brave and honest eye.