class: center, middle ## IMSE 586 ## Big Data Analytics and Visualization
### More on linear regression
### Instructor: Fred Feng --- $$Y=\beta_0+\beta_1x_1+\cdots+\beta_px_p+\epsilon$$ $$\text{where }\;\epsilon \sim \text{N}(0, \sigma^2)$$ One of the assumptions is that the errors are [i.i.d.](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables). -- For some data, the values are *not* independent. - Time series data (or [longitudinal](https://en.wikipedia.org/wiki/Longitudinal_study) data) -- .center[![:scale 70%](images/msft.png)] -- Regression is typically used for [cross-sectional data](https://en.wikipedia.org/wiki/Cross-sectional_data). --- # Nonlinear relationship -- .center[Ice cream sales vs. outside temperature] -- .center[.gray[(hypothetical data)] ![:scale 90%](images/icecream_vs_temp.png) ] --- ### Knowing .red[what variables to consider] is crucial. -- .center[Distance vs. elevation gain for Fred's 25 bike rides] ![:scale 53%](images/dist_vs_elevation_1.png) -- ![:scale 44%](images/dist_vs_elevation_2.png) --- - X: Ice cream sales - Y: Number of people drowning in swimming pools Are X and Y correlated? --- class: middle, center # .red[Correlation does not equal causation.] --- # Spurious correlation https://www.tylervigen.com/spurious-correlations