Six data science graduate school projects with real data

Data science graduate school projects using real-world datasets give students practical experience that prepares them for a data scientist career.
Students in a data science graduate school program seek the expertise necessary for a successful data science career. This comprises learning programming languages like Python and R as well as the latest methods for data and predictive analytics. But applying those skills to real-world data reveals the ways they can make an impact through data.
At Meharry SACS, students can gain hands-on, practical experience working with health care, financial and other data. Applying classroom concepts to these datasets is important to strengthening their data science expertise and also helps them build a portfolio of projects for future career opportunities.
These six projects with real-world datasets are part of the data science graduate school academic experience at Meharry SACS.
1) Analysis and machine learning data prep of heart disease data
Students in Dr. Vibhuti Gupta’s Computer Programming Foundations for Data Science class explored a heart disease dataset. The data included various features such as chest pain type, resting blood pressure, blood sugar levels and other attributes. The class performed data analysis to identify the features responsible for cardiovascular risk and transform data for building machine learning models to predict cardiovascular risk.
2) Using a convolutional neural network to identify COVID-19
In Dr. Bishnu Sarker’s Computational Machine Learning class, student Shara Taylor used deep learning to predict COVID-19 from X-ray data. She used X-ray images from the COVID-19 Radiography Database to build a convolutional neural network. She then trained the model to distinguish between COVID-19, viral pneumonia, lung infections, and normal lungs. Her model is able to recognize classifications of new images with approximately 85 percent accuracy.
3) Data analysis of electronic health record (EHR) data
Dr. Aize Cao designs projects in Statistical Methods for Biomedical Data Science, Statistical Inference and Modeling and Population Health Informatics. The classes used data from published studies and de-identified EHR. From data management to analytical dataset, the students in her classes gained hands-on experience transforming health care informatics to knowledge discovery. They applied statistical theories, built models and learned statistical inference. The course projects prepared students for the complete experience of conducting analysis from raw data to statistical inference.
4) Data analysis to understand health insurance data and costs
Dr. Gupta’s programming course also challenged students to analyze a US health insurance dataset. The data included demographic, heath history, lifestyle choices and other information that affect medical costs. The students performed exploratory analysis, data cleaning and data visualization. Their work helped them understand how those factors contribute to the medical costs billed by health insurance companies.
5) Food Nutrition Management using USDA FoodData Central
The US Department of Agriculture food and nutrition database, FoodData Central, is an integrated data system that provides expanded nutrient profile information. Students in the Data Management Foundation for Data Science, taught by Dr. Sarker, Dr. Qingguo Wang and Dr. Eugene Levin, applied course concepts to assess the nutritional value for a meal. The class pre-processed big data organized in numerous structured CSV files to handle missing values. They then produced statistical summaries to discover meaningful information. In the end, they were able to quickly calculate the total protein, fiber, vitamins, energy (kcal), and more for an arbitrary combination of foods.
6) Data analysis to explore conditions that contribute to diabetes
Students in Dr. Gupta’s course also applied their programming and analysis skills to a dataset of diabetic conditions. They applied exploratory analysis, data cleaning and data visualization to understand the factors that contribute to diabetes.