Introduction to Statistics: an integrated textbook and workbook using R
2023-09-26
Introduction
Welcome to statistics!
If you want, you can also download this book as a PDF or EPUB file. Be aware that the print versions are missing some of the richer formatting of the online version. Besides, the recommended way to work through this material is to download the R notebook file (.Rmd
) at the top of each chapter and work through it in RStudio.
History and goals
In 2015, a group of interdisciplinary faculty at Westminster College (Salt Lake City, UT) started a process that led to the creation of a new Data Science program. Preparatory to creating a more rigorous introductory statistics course using the statistical software R, I wrote a series of 22 modules that filled a gap in the R training literature. Most R training at the time was focused either on learning to program using R as a computer language, or using R to do sophisticated statistical analysis. We needed our students to use R as a tool for elementary statistical methods and we needed the learning curve to be as gentle as possible. I decided early on that to make the modules more useful, they needed to be structured more like an interactive textbook rather than just a series of lab exercises, and so I spent the summer of 2016 writing a free, open-source, self-contained, and nearly fully-featured introductory statistics textbook. The first sections of the newly-created DATA 220 were offered in Fall, 2016, using the materials I created.
Since then, I have been revising and updating the modules a little every semester. At some point, however, it became clear that some big changes needed to happen:
- The modules were more or less aligned with the OpenIntro book Introduction to Statistics with Randomization and Simulation (ISRS) by David Diez, Christopher Barr, and Mine Çetinkaya-Rundel. That book has now been supplanted by Introduction to Modern Statistics (IMS) by Mine Çetinkaya-Rundel and Johanna Hardin, also published through the OpenIntro project.
- The initial materials were written mostly using a mix of base R tools, some
tidyverse
tools, and the amazing resources of themosaic
package. I wanted to convert everything to be more aligned withtidyverse
packages now that they are mature, well-supported, and becoming a de facto standard for doing data analysis in R. - The initial choice of data sets that served as examples and exercises for students was guided by convenience. As I had only a short amount of time to write an entire textbook from scratch, I tended to grab the first data sets I could find that met the conditions needed for the statistical principles I was trying to illustrate. It has become clear in the last few years that the material will be more engaging with more interesting data sets. Ideally, we should use at least some data sets that speak to issues of social justice.
- Making statistics more inclusive requires us to confront some ugly chapters in the development of the subject. Statistical principles are often named after people. (These are supposedly the people who “discovered” the principle, but keep in mind Stigler’s Law of Eponymy which states that no scientific discovery is truly named after its original discoverer. In a neat bit of self-referential irony, Stephen Stigler was not the first person to make this observation.) The beliefs of some of these people were problematic. For example, Francis Galton (famous for the concept of “regression to the mean”), Karl Pearson (of the Pearson correlation coefficient), and Ronald Fisher (famous for many things, including the P-value) were all deeply involved in the eugenics movement of the late 19th and early 20th century. The previous modules almost never referenced this important historical background and context. Additionally, it’s important to discuss ethics, whether that be issues of data provenance, data manipulation, choice of analytic techniques, framing conclusions, and many other topics.
The efforts of my revisions are here online. I’ve tried to address all the concerns mentioned above:
- The chapter are arranged to align somewhat with IMS. There isn’t quite a one-to-one correspondence, but teachers who want to use the chapters of my book to supplement instruction from IMS, or vice versa, should be able to do so pretty easily. In the Appendix, I’ve included a concordance that shows how the books’ chapters match up, along with some notes that explain when one book does more or less than the other.
- The book is now completely aligned with the
tidyverse
and other packages that are designed to integrate into thetidyverse
. All plotting is done withggplot2
and all data manipulation is done withdplyr
,tidyr
, andforcats
. Tables are created usingtabyl
from thejanitor
package. Inference is taught using the cool tools in theinfer
package. - I have made an effort to find more interesting data sets. It’s tremendously difficult to find data that is both fascinating on its merits and also meets the pedagogical requirements of an introductory statistics course. I would like to use even more data that addresses social justice issues. There’s some in the book now, and I plan to incorporate even more in the future as I come across data sets that are suitable.
- When statistical tools are introduced, I have tried to give a little historical context about their development if I can. I’ve also tried to frame every step of the inferential process as a decision-making process that requires not only analytical expertise, but also solid ethical grounding. Again, there’s a lot more I could do here, and my goal is to continue to develop more such discussion as I can in future revisions.
Now, instead of a bunch of separate module files, all the material is gathered in one place as chapters of a book. In each chapter (starting with Chapter 2), students can download the chapter as an R notebook file, open it in RStudio, and work through the material.
Philosophy and pedagogy
To understand my statistics teaching philosophy, it’s worth telling you a little about my background in statistics.
At the risk of undermining my own credibility, I’d like to tell you about the first statistics class I took. In the mid-2000s, I was working on my Ph.D. at the University of California, San Diego, studying geometric topology. To make a little extra money and get some teaching experience under my belt, I started teaching night and summer classes at Miramar College, a local community college in the San Diego Community College District. I had been there for several semesters, mostly teaching pre-calculus, calculus, and other lower-division math classes. One day, I got a call from my department chair with my assignment for the upcoming semester. I was scheduled to teach intro stats. I was about to respond, “Oh, I’ve never taken a stats class before.” But remembering this was the way I earned money to be able to live in expensive San Diego County, I said, “Sounds great. By the way, do you happen to have an extra copy of the textbook we’ll be using?”
Yes, the first statistics class I took was the one I taught. Not ideal, I know.
I was lucky to start teaching with Intro Stats by De Veaux, Velleman, and Bock, a book that was incredibly well-written and included a lot of resources for teachers like me. (I learned quickly that I wasn’t the only math professor in the world who got thrown into teaching statistics classes with little to no training.) I got my full-time appointment at Westminster College in 2008 and continued to teach intro stats classes for many years to follow. As I mentioned earlier, we started the Data Science program at Westminster College in 2016 and moved everything from our earlier hodgepodge of calculators, spreadsheets, and SPSS, over to R.
Eventually, I got interested in Bayesian statistics and read everything I could get my hands on. I became convinced that Bayesian statistics is the “right” way to do statistical analysis. I started teaching special topics courses in Bayesian Data Analysis and working with students on research projects that involved Bayesian methods. If it were up to me, every introductory statistics class in the world would be taught using Bayesian methods. I know that sounds like a strong statement. (And I put it in boldface, so it looks even stronger.) But I truly believe that in an alternate universe where Fisher and his disciples didn’t “win” the stats wars of the 20th century (and perhaps one in which computing power got a little more advanced a little earlier in the development of statistics), we would all be Bayesians. Bayesian thinking is far more intuitive and more closely aligned with our intuitions about probabilities and uncertainty.
Unfortunately, our current universe timeline didn’t play out that way. So we are left with frequentism. It’s not that I necessarily object to frequentist tools. All tools are just tools, after all. However, the standard form of frequentist inference, with its null hypothesis significance testing, P-values, and confidence intervals, can be confusing. It’s bad enough that professional researchers struggle with them. We teach undergraduate students in introductory classes.
Okay, so we are stuck not in the world we want, but the world we’ve got. At my institution and most others, intro stats is a service course that trains far more people who are outside the fields of mathematics and statistics. In that world, students will go on to careers where they interact with research that reports p-values and confidence intervals.
So what’s the best we can do for our students, given that limitation? We need to be laser-focused on teaching the frequentist logic of inference the best we can. I want student to see P-values in papers and know how to interpret those P-values correctly. I want students to understand what a confidence intervals tells them—and even more importantly, what it does not tell them. I want students to respect the severe limitations inherent in tests of significance. If we’re going to train frequentists, the least we can do is help them become good frequentists.
One source of inspiration for good statistical pedagogy comes from the Guidelines for Assessment and Instruction in Statistics Education (GAISE), a set of recommendations made by experienced stats educators and endorsed by the American Statistical Association. Their college guidelines are as follows:
- Teach statistical thinking.
- Teach statistics as an investigative process of problem-solving and decision-making.
- Give students experience with multivariable thinking.
- Focus on conceptual understanding.
- Integrate real data with a context and purpose.
- Foster active learning.
- Use technology to explore concepts and analyze data.
- Use assessments to improve and evaluate student learning.
In every element of this book, I’ve tried to follow these guidelines:
- The first part of the book is an extensive guide for exploratory data analysis. The rest of the book is about inference in the context of specific research questions that are answered using statistical tools. While multivariable thinking is a little harder to do in an intro stats class, I take the opportunity whenever possible to use graphs to explore more variables than we can handle with intro stats inferential techniques. I point out the the simple analyses taught in this class are only the first step in more comprehensive analyses that incorporate more information and control for confounders. I emphasize that students can continue their statistical growth by enrolling in more advanced stats classes.
- I often tell students that if they forget everything else from their stats class, the one think I want them to be able to do is interpret a P-value correctly. It’s not intuitive, so it takes an entire semester to set up the idea of a sampling distribution and explain over and over again how the P-value relates to it. In this book, I try to reinforce the logic of inference until the students know it almost instinctively. A huge pedagogical advantage is derived by using randomization and simulation to keep students from getting lost in the clouds of theoretical probability distributions. But they also need to know about the latter too. Every hypothesis test is presented both ways, a task made easy when using the
infer
package. - This is the thing I struggle with the most. Finding good data is hard. Over the years, I’ve found a few data sets I really like, but my goal is to continue to revise the book to incorporate more interesting data, especially data that serves to highlight issues of social justice.
- Back when I wrote the first set of modules that eventually became this book, the goal was to create assignments that merged content with activities so that students would be engaged in active learning. When these chapters are used in the classroom, students can collaborate with each other and with their professor. They learn by doing.
- Unlike most books out there, this book does not try to be agnostic about technology. This book is about doing statistics in R.
- This one I’ll leave in the capable hands of the professors who use these materials. The chapter assignments should be completed and submitted, and that is one form of assessment. But I also believe in augmenting this material with other forms of assessment that may include supplemental assignments, open-ended data exploration, quizzes and tests, projects, etc.
Course structure
As explained above, this book is meant to be a workbook that students complete as they’re reading.
At Westminster College, we host RStudio Workbench on a server that is connected to our single sign-on (SSO) systems so that students can access RStudio through a browser using their campus online usernames and passwords. If you have the ability to convince your IT folks to get such a server up and running, it’s highly worth it. Rather than spending the first day of class troubleshooting while students try to install software on their machines, you can just have them log in and get started right away. Campus admins install packages and tweak settings to make sure all students have a standardized interface and consistent experience.
If you don’t have that luxury, you will need to have students download and install both R and RStudio. The installation processes for both pieces of software are very easy and straightforward for the majority of students. The book chapters here assume that the necessary packages are installed already, so if your students are running R on their own machines, they will need to use install.packages
at the beginning of some of the chapters for any new packages that are introduced. (They are mentioned at the beginning of each chapter with instructions for installing them.)
Chapter 1 is fully online and introduces R and RStudio very gently using only commands at the Console. By the end of Chapter 1, they will have created a project called intro_stats
in RStudio that should be used all semester to organize their work. There is a reminder at the beginning of all subsequent chapter to make sure they are in that project before starting to do any work. (Generally, there is no reason they will exit the project, but some students get curious and click on stuff.)
In Chapter 2, students are taught to click a link to download an R Notebook file (.Rmd
). I have found that students struggle initially to get this file to the right place. If students are using RStudio Workbench online, they will need to use the “Upload” button in the Files tab in RStudio to get the file from their Downloads folder (or wherever they tell their machine to put downloaded files from the internet) into RStudio. If students are using R on their own machines, they will need to move the file from their Downloads folder into their project directory. There are some students who have never had to move files around on their computers, so this is a task that might require some guidance from classmates, TAs, or the professor. The location of the project directory and the downloaded files can vary from one machine to the next. They will have to use something like File Explorer for Windows or the Finder for MacOS, so there isn’t a single set of instructions that will get all students’ files successfully in the right place. Once the file is in the correct location, students can just click on it to open it in RStudio and start reading. Chapter 2 is all about using R Notebooks: markdown syntax, R code chunks, and inline code.
By Chapter 3, a rhythm is established that students will start to get used to:
- Open the book online and open RStudio.
- Install any packages in RStudio that are new to that chapter. (Not necessary for those using RStudio Workbench in a browser.)
- Check to make sure they’re are in the
intro_stats
project. - Click the link online to download the R Notebook file.
- Move the R Notebook file from the Downloads folder to the project directory.
- Open up the R Notebook file.
- Restart R and Run All Chunks.
- Start reading and working.
Chapters 3 and 4 focus on exploratory data analysis for categorical and numerical data, respectively.
Chapter 5 is a primer on data manipulation using dplyr
.
Chapters 6 and 7 cover correlation and regression. This “early regression” approach mirrors the IMS text. (IMS eventually circles back to hypothesis testing for regression, but this book does not. That’s a topic that is covered extensively in most second-semester stats classes.)
Chapters 8–11 are crucial for building the logical foundations for inference. The idea of a sampling distribution under the assumption of a null hypothesis is built up slowly and intuitively through randomization and simulation. By the end of Chapter 11, students will be fully introduced to the structure of a hypothesis test, and hopefully will have experienced the first sparks of intuition about why it “works.” All inference in this book is conducted using a “rubric” approach—basically, the steps are broken down into bite-sized pieces and students are expected to work through each step of the rubric every time they run a test. (The rubric steps are shown in the Appendix.)
Chapter 12 introduces a few more steps to the rubric for confidence intervals. As we are still using randomization to motivate inference, confidence intervals are calculated using the bootstrap approach for now.
Once students have developed a conceptual intuition for sampling distributions using simulation, we can introduce probability models as well. Chapter 13 introduces normal models and Chapter 14 explains why they are often appropriate for modeling sampling distributions.
The final chapters of the book (Chapters 15–22) are simply applications of inference in specific data settings: inference for one (Ch. 15) and two (Ch. 16) proportions, Chi-square tests for goodness-of-fit (Ch. 17) and independence (Ch. 18), inference for one mean (Ch. 19), paired data (Ch. 20), and two independent means (Ch. 21), and finally ANOVA (Ch. 22). Along the way, students learn about the chi-square, Student t, and F distributions. Although the last part of the book follows a fairly traditional parametric approach, every chapter still includes randomization and simulation to some degree so that students don’t lose track of the intuition behind sampling distributions under the assumption of a null hypothesis.
Onward and upward
I hope you enjoy the textbook. You can provide feedback two ways:
The preferred method is to file an issue on the Github page: https://github.com/jingsai/intro_stats/issues
Alternatively, send me an email: jliang@westminsteru.edu