summaryrefslogtreecommitdiff
path: root/tutorials/module_4/4.5 Statistical Analysis II.md
diff options
context:
space:
mode:
Diffstat (limited to 'tutorials/module_4/4.5 Statistical Analysis II.md')
-rw-r--r--tutorials/module_4/4.5 Statistical Analysis II.md96
1 files changed, 80 insertions, 16 deletions
diff --git a/tutorials/module_4/4.5 Statistical Analysis II.md b/tutorials/module_4/4.5 Statistical Analysis II.md
index 458bada..da25643 100644
--- a/tutorials/module_4/4.5 Statistical Analysis II.md
+++ b/tutorials/module_4/4.5 Statistical Analysis II.md
@@ -1,24 +1,25 @@
# 4.5 Statistical Analysis II
-As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.ile changes in local repository
-​=======
- File changes in remote repository
+## Modelling Relationships
+As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.
## Least Square Regression and Line of Best Fit
-
-### What is Linear Regression?
+### What is Regression?
Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data.
-
-
### Linear
-To find a linear regression line we can apply the
+The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line
+$$
+y=mx+b\tag{}
+$$
+
### Exponential and Power functions
-Logarithm trick
+You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method.
### Polynomial
- For non-linear equations function such as a polynomial Numpy has a nice feature.
+https://www.geeksforgeeks.org/machine-learning/python-implementation-of-polynomial-regression/
+Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature.
```python
@@ -61,14 +62,77 @@ plt.show()
```
-
-
-
### How well did we do?
+After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data.
+#### Sum of Squares
+We define several *sum of squares* quantities that measure total variation and error:
+
+$$
+\begin{aligned}
+S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\
+S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\
+S_l &= S_t - S_r &\text{(variation explained by the regression line)}
+\end{aligned}
+$$
+Where:
+* $y_i$ = observed data
+* $\hat{y}_i$ = predicted data from the model
+* $\bar{y}$ = mean of observed data
+#### Standard Error of the Estimate
+
+If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line:
+
+$$
+s_{y/x} = \sqrt{\frac{S_r}{n - 2}}
+$$
+where $n$ is the number of data points.
+Smaller $s_{y/x}$ means the regression line passes closer to the data points.
+
+#### Coefficient of Determination – (R^2)
+The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression:
+
+$$
+R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t}
+$$
+* (R^2 = 1.0) → perfect fit (all points on the line)
+* (R^2 = 0) → model explains none of the variation
+
+In engineering terms, a high (R^2) indicates that your model captures most of the physical trend — for example, how deflection scales with load.
+
+
+#### Correlation Coefficient – (r)
+For linear regression, the **correlation coefficient** (r) is the square root of (R^2), with sign matching the slope of the line:
+
+$$
+r = \pm \sqrt{R^2}
+$$
+* (r > 0): positive correlation (both variables increase together)
+* (r < 0): negative correlation (one increases, the other decreases)
+#### Example – Evaluating Fit in Python
-Using the
-
+```python
+import numpy as np
+
+# Example data
+x = np.array([0, 1, 2, 3, 4, 5])
+y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2])
+
+# Linear fit
+m, b = np.polyfit(x, y, 1)
+y_pred = m*x + b
+
+# Calculate residuals and metrics
+Sr = np.sum((y - y_pred)**2)
+St = np.sum((y - np.mean(y))**2)
+syx = np.sqrt(Sr / (len(y) - 2))
+R2 = 1 - Sr/St
+r = np.sign(m) * np.sqrt(R2)
+
+print(f"s_y/x = {syx:.3f}")
+print(f"R^2 = {R2:.3f}")
+print(f"r = {r:.3f}")
+```
## Extrapolation
basis funct
-## Moving average
+## Moving average \ No newline at end of file