What is Data Science?#


Many areas of modern scientific research and development rely on increasingly large and complex data sets. Discovery and application in science thus relies on the ability to manage these large data sets, and extract meaning from them. In other words, science now relies heavily on data science, which has been variously defined as “…an umbrella term to describe the entire complex and multistep processes used to extract value from data” (Wing, 2019), and the ability to “bring structure to large quantities of formless data and make analysis possible” (Davenport & Patil, 2012, p.73).

In modern kinesiological research, data science is an increasingly necessary and valued skill. Data from techniques like EMG, motion capture, and MRI are complex and multidimensional. Being able to understand, manipulate, and visualize the structure of these complex datasets is a necessary skill for performing the research. On top of this, it is increasingly clear that very large data sets - often built collaboratively by many labs - are often required to make reliable inferences about scientific processes.

Is data science just a trendy name for statistics?#

While data science and statistics are overlapping fields, statistics is generally focused on the specific task of testing hypotheses based on data. Data science more broadly includes the storage, manipulation, visualization, filtering, and preparation of data that is typically required prior to statistical analysis. In fact, we will, by necessity, be devoting more time in this course to the pre-data analysis steps than to actual algorithms (although there may be some of that as well). Data science does also encompass statistics, as well as machine learning; whereas statistics generally involves deriving conclusions from existing data, machine learning involves making predictions from a data set that will generalize to other data. Since statistics is covered in other courses in the kinesiology curricula, this course focuses instead on the other “front-end” aspects of data science described above. Other areas of data science, including software development and “back-end” data science (engineering, hardware, databases), will not be covered.

This highlights a mindset that differs quite dramatically in data science, as compared to the basic statistics taught in undergraduate curricula. Data science includes practices that are more exploratory. In experimentally-oriented disciplines such as exercise physiology or biomechanics, statistics are a natural approach to deriving meaning from data. This is because data typically come from experiments, in which the researcher(s) systematically and intentionally manipulated certain variables. A good experiment is hypothesis-driven, meaning that the researcher has predictions in advance as to how the data will systematically vary with the experimental manipulations. These predictions are usually based on past experimental findings, or models of the process being studied. Statistics are fundamentally embedded in data science — and indeed, the concept of “data science” as a discipline emerged from the field of statistics — but data science can be thought of as a larger set of practices that includes statistics, machine learning, data cleaning and transformations, and visualization. Many of these approaches are more exploratory than hypothesis-driven. That is, rather than looking for a specific, predicted pattern, the data scientist explores the data to find systematic patterns that may emerge from the data. For example, researchers using techniques like motion capture have attempted to use machine learning algorithms to detect subtle alterations in walking mechanics, as a means of one day being able to “classify” clinically asymptomatic people as being at higher risk of developing neurologic disease, such as Parkinson’s.

Tools for Data Science#

Central to data science is the ability to use scientific programming languages, such as Python, Matlab, and R. This ability includes a strong understanding of the fundamentals of at least one programming language, and the ability to extend one’s knowledge through continuous learning and problem-solving. This course teaches Python, a mature and widely-used language in modern scientific research and data science more broadly. However, many of the fundamentals of scientific programming and data science are common to many languages. Thus, having learned Python, you will be well-prepared to learn new languages in the future, as necessary.

Another important facet of data science is that it is a team endeavour. On the one hand, it is founded on open, shared software developed by widely distributed teams of contributors. On the other hand, the practice of data science typically involves teams of individuals with complementary skillsets, both due to the size and complexity of many projects. In science, these teams often comprise students and faculty members in collaborating labs distributed around the world. Team members with different skillsets can also teach each others new things, often through demonstration in a shared project. We will not have any team-based projects in the course, but you will learn some teamwork skills through occasional pair programming (more about that later) and through interactions with your peers and instructors.

The skills learned in this class will benefit students working in a wide range of areas of kinesiology. As well, the class will provide an introductory foundation in data science that can be applied to a range of areas beyond kinesiology, in academia, industry, and government.


This section was adapted from Aaron J. Newman’s Data Science for Psychology and Neuroscience - in Python.