Data Science Ph.D. COURSES

Data Science Ph.D. students will enroll in three concurrent courses during the Fall, Spring, and Summer Semesters.

The Pathway to a Data Science Ph.D. degree provide an outline of all degree requirements and a curriculum map, organized by semester, for completing them.

Foundation Courses (15 hours) **

3 credit hours.

Introduction to the basic foundations of computer programming for data science, using Python, R, and SAS as problem solving tools. 1) Introduction to Python. Python syntax to write basic computer programs; Using the interpreter; Built-in and user-defined functions; Introduction to object-oriented programming in Python. 2) Introduction to R. Simple graphing; R Basics: variables, strings, vectors; Data Structures: arrays, matrices, lists, data-frames; Programming Fundamentals: conditions and loops, functions, objects and classes, debugging. 3) Introduction to SAS Programming. The SAS Operating Environment; Understanding Data and the quality characteristics it exhibits; SAS Programming Essentials: SAS Program Structure, SAS Program Syntax; Getting Data In and Out of SAS; Printing and Displaying Data; Introduction to SAS Graphics.

3 credit hours.

The concepts and structures used to store, analyze, manage, and present (visualize) information and navigation using Python, SQL, SAS, and QGIS. Topics will include information analysis and organizational methods, and metadata concepts and applications. Students will be assisted to identify disparate data sources needed to perform analysis for a given real-world problem. Typically, data from a single source will not be adequate to perform the required analysis. Students will pull data from the disparate data sources and import it into SAS, and use several SAS procedures to detect invalid data; format, validate, clean the data; and impute the data if it is missing. This will prepare the data for statistical analysis and decision modeling in SAS.

3 credit hours.

This course covers other useful mainstream programming languages for data science, beyond Python, R, SQL, and SAS. These “other” potential programming languages supplement the ability to crunch numbers, and equip the data scientist with good allround programming skills. Programming languages covered will vary depending on industry popularity. While some of the programming languages may not be covered in detail, examples include: Java, Scala, Julia, TensorFlow, Go, Spark.

3 credit hours.

Deep dive into recent advances in AI, focusing on deep learning approaches. Foundations of neural networks. Cutting-edge deep learning models including image, text, multimodal and time-series data. Advanced topics on open challenges of integrating AI in a societal application including interpretability, robustness, privacy and fairness.

3 credit hours.

This course will cover fundamental mathematical background for statistical theories. Probability spaces as models for phenomena with statistical regularity. Discrete spaces (binomial, hypergeometric, Poisson). Continuous spaces (normal, exponential) and densities. Random variables, expectation, independence, conditional probability. The course will cover probabilities, multivariate distribution and special distribution, statistical inference, maximum likelihood methods, sufficiency, test of hypotheses, inference about normal methods, nonparametric statistics, Bayesian statistics.

Core Courses (36 hours)

3 credit hours.

Introduction to machine learning with business applications. Survey of machine learning techniques, including traditional statistical methods, resampling techniques, model selection and regularization, tree-based methods, principal components analysis, cluster analysis, artificial neural networks, and deep learning. Students implement machine learning models with open-source software for data science. They explore data and learn from data, finding underlying patterns useful for data reduction, feature analysis, prediction, and classification.

3 credit hours.

An overview of modern data science: the practice of obtaining, storing, modeling, manipulating, analyzing, and interpreting data. Emerging Big data processing frameworks. NoSQL storage solutions. Memory resident databases and graph databases. Ability to initiate and design highly scalable systems that can accept, store, and analyze large volumes of unstructured data in batch mode and/or real time. Organization, administration and governance of large volumes of both structured and unstructured data.

3 credit hours.

Tools and techniques for building statistical or machine learning models to make predictions based on data. NLP and Text Analytics, Time Series, Experimentation and Optimization.

3 credit hours.

Data visualization tools and technologies essential to analyze massive disparate amounts of information and make data driven decisions. Information and geographic visualization of health data. Hands-on experience in planning, creating and using compelling multimedia visualizations such as online maps, responsive graphs, interactive animations and GIS dashboards. Use of different visualizations to support various research activities including hypothesis formulation, data synthesis, analysis and exploration as well as communicate and share health information. Application of usability and user experience (UX) principles to evaluate the extents to which various visualizations meet expectations.

3 credit hours.

The research process investigating information needs, creation, organization, flow, retrieval, and use. Stages include: research definition, question, objectives, data collection and management, data analysis and data interpretation. Techniques include: observation, interviews, questionnaires, and transaction-log analysis. 

3 credit hours.

Introduction to database concepts and the relational database model. Topics include ER Model, Relational Model, Relational Algebra, SQL, normalization, Indexing, Normal Forms, design methodology, DBMS functions, Security, Transaction Management, data-base administration, and other database management approaches such as client/server databases, object-oriented databases, and data warehouses. Strong emphasis on database system design and application development.

3 credit hours.

Principles, practices, and techniques for effective data modeling in the age of Big data.

3 credit hours.

Utilize current statistical techniques to assess and analyze biomedical and public health related data. Read and critique the use of such techniques in published research. Review of linear models, matrix algebra, and multiple analysis of variance. Introduction to random effects models, understanding and computing power for the GLM, GLM assumption diagnostics, transformations, polynomial regression, coding schemes for regression, multicollinearity. Determine what analytical approaches are appropriate under different research scenarios.

3 credit hours.

Study of Monte Carlo methods, a diverse class of algorithms that rely on repeated random sampling to compute the solution to problems whose solution space is too large to explore systematically or whose systemic behavior is too complex to model. Introduction to important principles of Monte Carlo techniques and their power. Bayesian analysis and Markov chain Monte Carlo samplers, slice sampling, multigrid Monte Carlo, Hamiltonian Monte Carlo, parallel tempering and multi-nested methods, and streaming methods such as particle filters/sequential Monte Carlo. Related topics in stochastic optimization and inference such as genetic algorithms, simulated annealing, probabilistic Gaussian models, and Gaussian processes. Applications to Bayesian inference and machine learning. Python or R for all programming assignments and projects.

3 credit hours.

Deep learning is a sub-field of machine learning that focuses on learning complex, hierarchical feature representations from raw data. The dominant method for achieving this, artificial neural networks, has revolutionized the processing of data (e.g. images, videos, text, and audio) as well as decisionmaking tasks (e.g. game-playing). Its success has enabled a tremendous amount of practical commercial applications and has had a significant impact on society. In this course, students will learn the fundamental principles, underlying mathematics, and implementation details of deep learning. This includes the concepts and methods used to optimize these highly parameterized models (gradient descent and backpropagation, and more generally computation graphs), the modules that make them up (linear, convolution, and pooling layers, activation functions, etc.), and common neural network architectures (convolutional neural networks, recurrent neural networks, etc.). Applications ranging from computer vision to natural language processing and decision-making (reinforcement learning) will be demonstrated. Through in-depth programming assignments, students will learn how to implement these fundamental building blocks as well as how to put them together using a popular deep learning library, PyTorch.

3 credit hours.

Examination of case studies. Introduction to healthcare law and ethics, making ethical decisions, contracts, medical records and informed consent, privacy law and HIPAA.

3 credit hours.

Security issues related to the safeguarding of sensitive personal and corporate information against inadvertent disclosure; Policy and societal questions concerning the value of security and privacy regulations, the real world effects of data breaches on individuals and businesses, and the balancing of interests among individuals, government, and enterprises; Current and proposed laws and regulations that govern information security and privacy; Private sector regulatory efforts and self-help measures; Emerging technologies that may affect security and privacy concerns; and Issues related to the development of enterprise data security programs, policies, and procedures that take into account the requirements of all relevant constituencies; e.g., technical, business, and legal.

Candidacy Exam (1 hour)

1 credit hour.

Preparation for the Candidacy Exam intended to demonstrate advanced knowledge of content and materials of the six required classes.

Special Topics and Electives (6 hours)

3 credit hours.

A comprehensive review of text analytics and natural language processing with a focus on recent developments in computational linguistics and machine learning. Students work with unstructured and semi-structured text from online sources, document collections, and databases. Using methods of artificial intelligence and machine learning, students learn how to parse text into numeric vectors and to convert higher dimensional vectors into lower dimensional vectors for subsequent analysis and modeling. Applications include speech recognition, semantic processing, text classification, relevant search, recommendation systems, sentiment analysis, and topic modeling. This is a project-based course with extensive programming assignments.

3 credit hours.

Networks are discrete mathematical objects that describe systems of entities with pairwise relationship. Over the past several decades, technological advances in data collection and extraction have fueled an explosion of data in the form of networks from seemingly all corners of science. This course aims at providing the mathematical foundations of networks with a particular emphasis on their applications in modern data science, using tools from algorithmic graph theory and linear algebra. The topics include basic graph theory, network statistics, search algorithms, community detection, duality theorems and applications. The course will utilize python (e.g., Networks and Jupyter Notebook) to implement and test the techniques in graph theory and network science in synthetic and real data. Students are strongly encouraged to have some familiarity in Python prior to taking this course.

3 credit hours.

This course introduces fundamentals of signal processing along with its applications in wearable sensor devices. The course includes topics on signal acquisition, techniques on processing the signals captured, including time domain approaches for event detection, time-varying signal processing for understanding the dynamical aspects of complex systems, and finally the application of machine learning algorithms to build predictive models for early insights.

3 credit hours.

What is artificial intelligence (AI)? What does it mean for cybersecurity? And how AI can be integrated to achieve the goals of cybersecurity? This course designed to answer the above questions. In this course, a mix of key AI technologies will be introduced to support the understanding of the decision-making process when cybersecurity is concerned. The course will address key AI technologies in an attempt to help in understanding their role in cybersecurity. AI deficiently will complement and strengthen the cybersecurity practices and will improve their applications in enhancing our security.

3 credit hours.

This course presents fundamental concepts and techniques in digital image processing and understanding. Both theoretical material and computing techniques are introduced. The analytical tools and methods which are currently used in digital image processing are introduced and applied to practical scenarios. Basic digital computing knowledge and programming skills are reinforced by solving real world problems. Computational studies may be performed in R or Python.

Research Seminar (5 hours total; at least one (1) hour in each)

Variable hours per semester may be offered (1–3 hours).

This course provides students an opportunity to delve into a special study of interest related to data science selected by the student under the guidance of a faculty member. The student and faculty member meet weekly to discuss the studies; the student will be required to write a comprehensive review paper on the semester’s studies.

Variable hours per semester may be offered (1–3 hours).

This course provides doctoral students with advanced research skills and strategies for conducting a literature review leading to a dissertation. Through this course, students will produce an extensive and integrative literature review related to their dissertation topic. Students will search, retrieve, summarize, and synthesize relevant studies to produce a comprehensive literature review.

Variable hours per semester may be offered (1–3 hours).

This course provides the student with the opportunity to concisely describe a data science research problem and methodology. Preparation and defense of the dissertation proposal which clearly articulates the problem to be investigated in the field of data science, literature review, and what would need to be done to complete the dissertation. Student must successfully defend the proposal before a Dissertation Proposal Committee which will determine whether the student proceeds to complete the dissertation.

Dissertation and Defense (12 hours)

12 credit hours.

Variable hours may be offered.

The completion of PhD dissertation is the culmination of the doctoral degree in this graduate program. The research topic of the dissertation must be related to the PhD in Data Science program.