Data diversity is an important step to improving health equity as data science applications continue to drive health care decisions.
Big data is used throughout the health care industry. Health systems, health insurance providers, pharmaceutical companies and government agencies use data to predict risk, advance drug discovery, guide treatment protocols, manage costs and allocate resources.
Data science applications like artificial intelligence and machine learning can improve health care delivery, patient outcome and make it more affordable. But if those algorithms use data that do not accurately represent an entire population, they can negatively impact underrepresented groups.
“… algorithms trained with gender-imbalanced data do worse at reading chest x-rays for an underrepresented gender, and researchers are already concerned that skin-cancer detection algorithms, many of which are trained primarily on light-skinned individuals, do worse at detecting skin cancer affecting darker skin.”
Using diverse data sets in health care is important to health equity. Healthcare Information Management Systems Society says that data sets that are at the heart of heath care must be based on a “. . . diverse range of race, gender or geography. . .” to be effective. But it is also important to increase the diversity of the people who are collecting that data and framing the problems it is intended to solve.
“The technologies we create using data tend to have the same biases of humans, because they are created by people,” says Fortune Mhlanga, dean of the School of Applied Computational Sciences.
From collecting and preparing data to implementing and testing a machine learning or artificial intelligence algorithm, data scientists are involved throughout the process. It is important to avoid bias each step along the way.
While that human element is often the cause of unintended biases, it also offers a solution.
“As far as human involvement goes, increasing the representation of Black data scientists can go a long way in avoiding bias,” says Mhlanga. “Only 3 percent of data scientists are Black. It is critical to educate more Black data scientists.”
Creating a diverse data ecosystem
The School of Applied Computational Sciences is using the data from the Meharry Health System to develop a diverse data ecosystem that will live on Meharry Medical College’s campus. The data includes a significant amount of records from underrepresented groups and is essential to creating the School’s self-reliant and modern ecosystem that will support research applications.
“The goal is to make Meharry self-dependent and capable of integrating data from multiple and disparate sources such as electronic health records (EHR), in-silico biology, genomic technologies, medical devices, biosensors, social and environmental exposure, and financial data into Meharry’s own modern data science ecosystem,” says Ashutosh Singhal, medical research, development and strategy director.
This data set, powered by Meharry’s high-performance computing facility, will enable Meharry researchers to analyze health care big data through machine learning and artificial intelligence algorithms.
“Harnessing this data has tremendous potential at the macro,population-level health, as well as the micro, evidence-based precision medicine, level,” says Singhal.
An important objective with developing the diverse data ecosystem is to adapt it to common data model and standards to improve integration of clinical and observational data into biomedical science.
“Moving to the common data model enables us to participate in the national Clinical Research Network and help expedite the translation of research resources into knowledge, products and procedures to improve human health,” says Singhal.
“That also means our rich, dataset featuring underrepresented people can enhance other datasets that might otherwise be biased,” says Mhlanga.
Health disparity and COVID-19
In the SACS Population Health Informatics and Disparities Research Lab, Dr. Aize Cao, associate professor of biomedical data science, works to address heath disparities for underserved populations.
“We leverage Electronic Health Records to advance health care informatics, predictive modeling and machine learning tools to help with patient health improvement,” says Dr. Cao.
Unfortunately, the coronavirus pandemic has provided new opportunities to explore health disparities. Dr. Cao is using an Electronic Health Records (EHRs) dataset of over 125,000 COVID-19 positive patients provided by the HCA Healthcare CHARGE Consortium for a COVID-19 health disparity study. The dataset includes patients from its more than 180 nationwide affiliated facilities in 20 States from March 2020 to February 2021.
“We are developing risk adjustment prediction models leveraging EHR, health informatics, statistical modeling and machine learning techniques to improve understanding of health disparity among COVID-19 patients,” says Dr. Cao.
Through participation in the HCA CHARGE Consortium, the lab also aims to develop risk adjustment models to predict long term health outcome of COVID-19 patients. The developed models will be validated on a dataset of the underserved population at Meharry to study patient health disparities.