Introduction to Regression and Data Science

Lucy D’Agostino McGowan

Lucy D’Agostino McGowan

  mcgowald@wfu.edu
  By appointment
  bit.ly/lucystats-office-hours

Lucy D’Agostino McGowan

  • Biostatistician focused on: data science, causal inference, analytic design theory, and statistical communication
  • UNC Chapel Hill (BA Religious Studies and Romance Languages 2012)
  • Washington University in St. Louis (MS Biostatistics 2013)
  • Vanderbilt University (PhD Biostatistics 2018)
  • Johns Hopkins (Postdoc Biostatistics 2019)
  • Fun fact: host a podcast sponsored by the American Journal of Epidemiology called Casual Inference
  • Statistical communication: Atlantic, New York Times, USA Today, BBC Radio
  • Consulting: Pharmaceutical companies, software development, app development
  • See more: lucymcgowan.com

bit.ly/sta-112-s24

data = model + error

\(y = f(\mathbf{X}) + \epsilon\)

🗣 math speak

\(y = \color{orange}{f(\mathbf{X})} + \epsilon\)

model

\(y = f(\color{orange}{\mathbf{X}}) + \epsilon\)

data (to build the model)

\(\color{orange}y = f(\mathbf{X}) + \epsilon\)

data (outcome)

\(y = \color{orange}{\beta_0 + \beta_1X }+ \epsilon\)

simple linear regression

Bob at Easel from Wikipedia

Bob Ross

Bob Ross

\(y = \color{orange}{\beta_0 + \beta_1X}+ \epsilon\)

Bob Ross

\(\text{# of paintings with clouds} = \color{orange}{\beta_0 + \beta_1 season}+ \epsilon\)

Bob Ross

Bob Ross

\(\text{# of paintings with clouds} = \color{orange}{\beta_0 + \boldsymbol\beta f(season)}+ \epsilon\)

\(\tiny y = \color{orange}{\beta_0 + \beta_1X_1 + \beta_2X_2+...}+ \epsilon\)

multiple linear regression

\(\tiny\color{orange}{f(y)} = \beta_0 + \beta_1X_1 + \beta_2X_2+...\)

\(\tiny\color{orange}{logit(P(y = 1))} = \beta_0 + \beta_1X_1 + \beta_2X_2+...\)

logistic regression

Plan

  • Thinking about, visualizing, and wrangling data
  • Simple Linear Regression
  • Multiple Linear Regression
  • Logistic Regression

Let’s go!

Login to RStudio Pro

RStudio Pro Setup

Step 1: Create a New Project

Click File > New Project

RStudio Pro Setup

Step 2: Click “Version Control”

Click the third option.

RStudio Pro Setup

Step 3: Click Git

Click the first option

RStudio Pro Setup

Step 4: Copy my starter files

Paste this link in the top box (Repository url):

https://github.com/sta-112-s24/appex-01-welcome-penguins.git

Penguin fun!

  • Once you log on to RStudio Pro, create a new project from version control (Git)
  • Paste https://github.com/sta-112-s24/appex-01-welcome-penguins.git in the Repository url box
  • Find the file pane (on the bottom right). Click the welcome-penguins.qmd file
  • Click the “Render” button
  • Go back to the file and change your name on top (in the yaml – we’ll talk about what this means later) and render again.
  • Then, scroll to the plot chunk, below Palmer Penguins. Instead of looking at the relationship between flipper length and bill length, plot the relationship between flipper length and bill depth. Hint, look at the full dataset at the bottom of the document for variable names, update the captions to match your new plot.
  • Render again & voila!
08:00

Two truths and a lie

Within your group:

  1. One person tells three personal statements, one of which is a lie.
  2. Others discuss and guess which statement is the lie, and they jointly construct a numerical statement of their certainty in the guess (on a 0–10 scale).
  3. The storyteller reveals which was the lie.
  4. Enter the certainty number and the outcome (success or failure) and submit in the Google form.

Rotate through everyone in your group so that each person plays the storyteller role once.

Two truths and a lie

bit.ly/sta-112-s24-ae1

10:00

Two truths and a lie data

  • What do you think the range of certainty scores will look like: will there be any 0’s or 10’s?
  • Will there be a positive relation between x and y: are guesses with higher certainty be more accurate, on average?
  • How strong will the relation be between x and y: what will a plot look like?

Let’s take a tour - class website

  • Concepts introduced:
    • How to find slides
    • How to find assignments
    • How to find RStudio Pro
    • How to get help
    • How to find policies

Course structure and policies

Class meetings

  • Interactive
  • Some lectures, lots of learn-by-doing
  • Bring your laptop to class every day

Diversity & Inclusiveness:

  • Intent: Students from all diverse backgrounds and perspectives be well-served by this course, that students’ learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.
  • If you have a name and/or set of pronouns that differ from those that appear in your official Wake Forest records, please let me know!

Diversity & Inclusiveness:

  • If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. I want to be a resource for you. If you prefer to speak with someone outside of the course, your academic dean is an excellent resource.
  • I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.

Disability Policy

Students with disabilities who believe that they may need accommodations in the class are encouraged to contact CLASS as soon as possible to better ensure that such accommodations are implemented in a timely fashion.

How to get help

All course questions can be posted on the Canvas Q&A board

  • This is a place to post your course-related questions. I encourage you to try to answer each other’s questions.
  • At the end of the semester, I will tally up the total number of questions answered and you can get up to 1 point extra credit on your final grade.
  • For personal and grade related questions, use email.

How to get help

Math & Stats center

Academic integrity

Adhere to the Wake Forest Honor Code. Academic dishonesty will not be tolerated.

Sharing/reusing code

  • There are many online resources for sharing code (for example, StackOverflow, ChatGPT) - you may use these resources but must explicitly cite where you have obtained code (both code you used directly and “paraphrased” code / code used as inspiration). Any reused code that is not explicitly cited will be treated as plagiarism.
  • You may discuss the content of assignments with others in this class. If you do so, please acknowledge your collaborator(s) at the top of your assignment, for example: “Collaborators: Gertrude Cox, Florence Nightingale David”. Failure to acknowledge collaborators will result in a grade of 0. You may not copy code and/or answers directly from another student. If you copy someone else’s work, both parties will receive a grade of 0.
  • Rather than copying someone else’s work, ask for help. You are not alone in this course!

Course components:

  • Application exercises: Usually start in class and finish in teams by the next class period, check/no check
  • Check-ins
  • Lab: start in class
  • Exams: 2 in class midterms
  • Final project: Presentations during the last week of class

Grading

Application exercises & Annotations 5%
Midterm 01 20%
Midterm 02 20%
Check-ins 20%
Labs 20%
Final Project 15%

Late/missed work policy

  • Late work policy for homework assignments:
    • late, but within 24 hours of due date/time: -50%
    • any later: no credit
  • Late work will not be accepted for the final project.

Other policies

  • Please refrain from texting or using your computer for anything other than coursework during class.
  • You must be in class on a day when you’re scheduled to present, there are no make ups for presentations.
  • Regrade requests must be made within 1 week of when the assignment is returned.

RStudio Pro

  • If you had issues creating your RStudio Pro account, opening the project, or running the analysis, stick around to try it again.