tutorials/module_4/4.5 Statistical Analysis II.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

# 4.5 Statistical Analysis II
## Modelling Relationships
As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.

## Least Square Regression and Line of Best Fit

### What is Regression?
Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data.

### Linear
The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line
$$
y=mx+b\tag{}
$$


### Exponential and Power functions
You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. 

### Polynomial
https://www.geeksforgeeks.org/machine-learning/python-implementation-of-polynomial-regression/
Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature.


```python
x_d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
y_d = np.array([0, 0.8, 0.9, 0.1, -0.6, -0.8, -1, -0.9, -0.4])

plt.figure(figsize = (12, 8))
for i in range(1, 7):
    
    # get the polynomial coefficients
    y_est = np.polyfit(x_d, y_d, i)
    plt.subplot(2,3,i)
    plt.plot(x_d, y_d, 'o')
    # evaluate the values for a polynomial
    plt.plot(x_d, np.polyval(y_est, x_d))
    plt.title(f'Polynomial order {i}')

plt.tight_layout()
plt.show()
```

### Using Scipy
```python
# let's define the function form
def func(x, a, b):
    y = a*np.exp(b*x)
    return y

alpha, beta = optimize.curve_fit(func, xdata =
 x, ydata = y)[0]
print(f'alpha={alpha}, beta={beta}')

# Let's have a look of the data
plt.figure(figsize = (10,8))
plt.plot(x, y, 'b.')
plt.plot(x, alpha*np.exp(beta*x), 'r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
```


### How well did we do?
After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data.
#### Sum of Squares
We define several *sum of squares* quantities that measure total variation and error:

$$
\begin{aligned}
S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\
S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\
S_l &= S_t - S_r &\text{(variation explained by the regression line)}
\end{aligned}
$$
Where:
* $y_i$ = observed data
* $\hat{y}_i$ = predicted data from the model
* $\bar{y}$ = mean of observed data
#### Standard Error of the Estimate

If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line:

$$
s_{y/x} = \sqrt{\frac{S_r}{n - 2}}
$$
where $n$ is the number of data points.
Smaller $s_{y/x}$ means the regression line passes closer to the data points.

#### Coefficient of Determination – (R^2)
The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression:

$$
R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t}
$$
* (R^2 = 1.0) → perfect fit (all points on the line)
* (R^2 = 0) → model explains none of the variation

In engineering terms, a high (R^2) indicates that your model captures most of the physical trend — for example, how deflection scales with load.


#### Correlation Coefficient – (r)
For linear regression, the **correlation coefficient** (r) is the square root of (R^2), with sign matching the slope of the line:

$$
r = \pm \sqrt{R^2}
$$
* (r > 0): positive correlation (both variables increase together)
* (r < 0): negative correlation (one increases, the other decreases)
#### Example – Evaluating Fit in Python

```python
import numpy as np

# Example data
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2])

# Linear fit
m, b = np.polyfit(x, y, 1)
y_pred = m*x + b

# Calculate residuals and metrics
Sr = np.sum((y - y_pred)**2)
St = np.sum((y - np.mean(y))**2)
syx = np.sqrt(Sr / (len(y) - 2))
R2 = 1 - Sr/St
r = np.sign(m) * np.sqrt(R2)

print(f"s_y/x = {syx:.3f}")
print(f"R^2 = {R2:.3f}")
print(f"r = {r:.3f}")
```

## Extrapolation
basis funct
## Moving average