Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

EXPLORING BIVARIATE NUMERICAL DAT

PART 1 Introduction to scatterplots

PART 2 Introduction to trend lines

PART 3 Least-squares regression equations

PART 4 Assessing the fit in least-squares regression

PART 5 More on regression

 


PART 1 Introduction to scatterplots

1. Bivariate data: it is data on each of two variables, where each value of one of the variables is paired with a value of the other variable. Typically, it would be of interest to investigate the possible association between the two variables.

2. Scatterplot: uses dots to represent values for two different numeric variables. 

(1) It shows the relationship between two numeric variables.

(2) Each member of the dataset gets plotted as a point whose (Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA) coordinates relates to its values for the two variables

(3) A good scatter plot uses a reasonable scale on both axes and puts the explanatory variable on Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA-axis.

3. Correlation: shows that whether and how strongly the pairs of variables are related

(1) when the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable tends to increase as the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable increases, we say there is a positive correlation between the variables

(2) when the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable tends to decreases as the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable increases, we say there is a negative correlation between the variables

(3) when there is no clear relationship between the two variables, we say there is no correlation between the two variables. 

4. Describing the association in a scatterplot should always include a description of the form, direction, and strength of the association, along with the presence of any outliers

(1) Form: is the association linear or non-linear?

(2) Direction: is the association positive or negative?

(3) Strength: does the association appear to be strong, moderately strong, or weak?

(4) Outliers: do there appear to be any data points that are unusually far away from the general pattern?

Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

5. Outliers in scatterplots: scatterplots often have a pattern, we call a data point an outlier if it doesn’t fit the pattern.

6. Clusters in scatterplots: sometimes the data points in a scatter plot form distinct groups; these groups are called clusters.

7. Correlation coefficient(Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA): measures the direction and strength of a linear relationship

(1) it always has a value between -1 and 1

(2) strong positive linear relationships have values of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA closer to 1

(3) strong negative linear relationships have values of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA closer to -1

(4) weaker relationships have values of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA closer to 0

(5) Formula: Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

 


PART 2 Introduction to trend lines

1. Linear regression: it attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.  

  • When we see a relationship in a scatterplot, we can use a line to summarize the relationship in the data. 
  • We can also use that line to make predictions in the data.

2. After we fit the best line to the data, we can find its equation and use that equation to make predictions.

3. Line of best fit: 最佳拟合曲线

 


PART 3 Least-squares regression equations

1. Least squares regression line: is the line that makes the vertical distance from the data points to the regression line as small as possible. 

(1) It’s called a “least squares” because the best line of fit is one that minimizes the variance (the sum of squares of the errors).

(2) Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA   Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA   Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is called residual

(3) Why we square the residuals: because there are some positive residuals and negative residuals, squaring them can take care of issues of negative and positive residuals canceling out with each other.

(4) When we square a number, things with larger residuals will become even larger. This is valuable because it tries to take into account things that are significant outliers, things that sit pretty far away from the model.

2. Residual: the difference between the actual value of the point and its prediction based on the line of fit

(1) Residual is a measure of how well a line fits an individual data point.

(2) residual = actual - prediction

(3) For data points above the line, the residuals are positive; and for data points below the line, the residuals are negative.

(4) The closer a data point's residual is to 0, the better the fit.

3. Calculate the equation of a regression line

(1) For the least squares regression line, it’s definitely going through the point of the sample mean of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and sample mean of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA, that is Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA.

(2) The equation of the least squares regression line is: Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA, where

  • Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA
  • Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA
  • Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

4. Impact of removing outliers on regression lines

 


PART 4 Assessing the fit in least-squares regression

1. Residual plots: is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. 

(1) A residual plot gives a sense of how good a fit it is and whether a line is good at explaining the relationship between the variables. 

(2) If the points in a residual plot are randomly dispersed around the horizontal axis and there is not any trend, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.

2. Standard deviation of residuals / Root-mean-square error(RMSE): 

(1) Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

(2) It is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit / how well the data points agree with the model

(3) The lower Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is, the better the fit of the model.

3. R-squared (Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA): represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.

(1) In other words, Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA tells us what percent of the prediction error in the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable is eliminated when we use least-squares regression on the Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA variable.

(2) Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is also called the coefficient of determination

(3) Coefficient of determination (Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA) versus Coefficient of correlation (Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA)

  • Coefficient of determination is the square of the coefficient of correlation
  • Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
  • Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA shows that whether and how strongly the pairs of variables are related, it measures the direction and strength of a linear relationship

(4) Total squared error between the points and the line: Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

  • it is the total variations in   that are not described by the regression line

(5) Total squared error between the points and the mean of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATAKhan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

  • it is the total variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA
  • Variance in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATAKhan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

(6) Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

  • it is what percent of the variation in   is not described by the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA / by the regression line

(7) Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

  • it is what percent of the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is explained by the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA / by the regression line.
  • If Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is small, then this regression line is a good fit
  • If Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is small, then Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is close to 1, which tells us that a lot of the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is described by the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and the line is a good fit.
  • If Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is large, then Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is close to 0, which tells us that very little of the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is described by the variation in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and the line is not a good fit.

4. Important note: correlation does not necessarily imply causation.

(1) It means that, even if there is a strong relationship between Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA, we can not say that the changes in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is the cause of the changes in Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA.

(2) Causation: it indicates that one event is the result of the occurrence of the other event.

 


PART 5 More on regression

How to minimize the squared error to regression line

(1) Formula of the least squared error: 

Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

(2) For simple linear regression, when Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA, then Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA is the best line of fit

(3) The partial derivatives of the coefficients are:

Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA

(4) Solving the system of the equations, we can get the value of Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA and Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA for the best line of fit

Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA        Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA         Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA