Khan Academy - Statistics and Probability - Unit 5 EXPLORING BIVARIATE NUMERICAL DATA
EXPLORING BIVARIATE NUMERICAL DAT
PART 1 Introduction to scatterplots
PART 2 Introduction to trend lines
PART 3 Least-squares regression equations
PART 4 Assessing the fit in least-squares regression
PART 1 Introduction to scatterplots
1. Bivariate data: it is data on each of two variables, where each value of one of the variables is paired with a value of the other variable. Typically, it would be of interest to investigate the possible association between the two variables.
2. Scatterplot: uses dots to represent values for two different numeric variables.
(1) It shows the relationship between two numeric variables.
(2) Each member of the dataset gets plotted as a point whose () coordinates relates to its values for the two variables
(3) A good scatter plot uses a reasonable scale on both axes and puts the explanatory variable on -axis.
3. Correlation: shows that whether and how strongly the pairs of variables are related
(1) when the variable tends to increase as the
variable increases, we say there is a positive correlation between the variables
(2) when the variable tends to decreases as the
variable increases, we say there is a negative correlation between the variables
(3) when there is no clear relationship between the two variables, we say there is no correlation between the two variables.
4. Describing the association in a scatterplot should always include a description of the form, direction, and strength of the association, along with the presence of any outliers
(1) Form: is the association linear or non-linear?
(2) Direction: is the association positive or negative?
(3) Strength: does the association appear to be strong, moderately strong, or weak?
(4) Outliers: do there appear to be any data points that are unusually far away from the general pattern?
5. Outliers in scatterplots: scatterplots often have a pattern, we call a data point an outlier if it doesn’t fit the pattern.
6. Clusters in scatterplots: sometimes the data points in a scatter plot form distinct groups; these groups are called clusters.
7. Correlation coefficient(): measures the direction and strength of a linear relationship
(1) it always has a value between -1 and 1
(2) strong positive linear relationships have values of closer to 1
(3) strong negative linear relationships have values of closer to -1
(4) weaker relationships have values of closer to 0
(5) Formula:
PART 2 Introduction to trend lines
1. Linear regression: it attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.
- When we see a relationship in a scatterplot, we can use a line to summarize the relationship in the data.
- We can also use that line to make predictions in the data.
2. After we fit the best line to the data, we can find its equation and use that equation to make predictions.
3. Line of best fit: 最佳拟合曲线
PART 3 Least-squares regression equations
1. Least squares regression line: is the line that makes the vertical distance from the data points to the regression line as small as possible.
(1) It’s called a “least squares” because the best line of fit is one that minimizes the variance (the sum of squares of the errors).
(2)
is called residual
(3) Why we square the residuals: because there are some positive residuals and negative residuals, squaring them can take care of issues of negative and positive residuals canceling out with each other.
(4) When we square a number, things with larger residuals will become even larger. This is valuable because it tries to take into account things that are significant outliers, things that sit pretty far away from the model.
2. Residual: the difference between the actual value of the point and its prediction based on the line of fit
(1) Residual is a measure of how well a line fits an individual data point.
(2) residual = actual - prediction
(3) For data points above the line, the residuals are positive; and for data points below the line, the residuals are negative.
(4) The closer a data point's residual is to 0, the better the fit.
3. Calculate the equation of a regression line
(1) For the least squares regression line, it’s definitely going through the point of the sample mean of and sample mean of
, that is
.
(2) The equation of the least squares regression line is: , where
4. Impact of removing outliers on regression lines
PART 4 Assessing the fit in least-squares regression
1. Residual plots: is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.
(1) A residual plot gives a sense of how good a fit it is and whether a line is good at explaining the relationship between the variables.
(2) If the points in a residual plot are randomly dispersed around the horizontal axis and there is not any trend, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
2. Standard deviation of residuals / Root-mean-square error(RMSE):
(1)
(2) It is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit / how well the data points agree with the model
(3) The lower is, the better the fit of the model.
3. R-squared (): represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
(1) In other words, tells us what percent of the prediction error in the
variable is eliminated when we use least-squares regression on the
variable.
(2) is also called the coefficient of determination
(3) Coefficient of determination () versus Coefficient of correlation (
)
- Coefficient of determination is the square of the coefficient of correlation
-
represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model.
-
shows that whether and how strongly the pairs of variables are related, it measures the direction and strength of a linear relationship
(4) Total squared error between the points and the line:
- it is the total variations in that are not described by the regression line
(5) Total squared error between the points and the mean of :
- it is the total variation in
- Variance in
:
(6)
- it is what percent of the variation in is not described by the variation in
/ by the regression line
(7)
- it is what percent of the variation in
is explained by the variation in
/ by the regression line.
- If
is small, then this regression line is a good fit
- If
is small, then
is close to 1, which tells us that a lot of the variation in
is described by the variation in
and the line is a good fit.
- If
is large, then
is close to 0, which tells us that very little of the variation in
is described by the variation in
and the line is not a good fit.
4. Important note: correlation does not necessarily imply causation.
(1) It means that, even if there is a strong relationship between and
, we can not say that the changes in
is the cause of the changes in
.
(2) Causation: it indicates that one event is the result of the occurrence of the other event.
PART 5 More on regression
How to minimize the squared error to regression line
(1) Formula of the least squared error:
(2) For simple linear regression, when and
, then
is the best line of fit
(3) The partial derivatives of the coefficients are:
(4) Solving the system of the equations, we can get the value of and
for the best line of fit