tutorials/module_4/4.5 Statistical Analysis II.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179

# 4.5 Statistical Analysis II
## Modelling Relationships
As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.

## Least Square Regression and Line of Best Fit

### What is Regression?
Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data.

### Linear
The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line
$$
y=mx+b\tag{}
$$


### Exponential and Power functions
You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. 

### Polynomial
Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature.

```python
x_d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
y_d = np.array([0, 0.8, 0.9, 0.1, -0.6, -0.8, -1, -0.9, -0.4])

plt.figure(figsize = (12, 8))
for i in range(1, 7):
    
    # get the polynomial coefficients
    y_est = np.polyfit(x_d, y_d, i)
    plt.subplot(2,3,i)
    plt.plot(x_d, y_d, 'o')
    # evaluate the values for a polynomial
    plt.plot(x_d, np.polyval(y_est, x_d))
    plt.title(f'Polynomial order {i}')

plt.tight_layout()
plt.show()
```

### Using Scipy
You can also use scipy to 
```python
# let's define the function form
def func(x, a, b):
    y = a*np.exp(b*x)
    return y

alpha, beta = optimize.curve_fit(func, xdata = x, ydata = y)[0]
print(f'alpha={alpha}, beta={beta}')

# Let's have a look of the data
plt.figure(figsize = (10,8))
plt.plot(x, y, 'b.')
plt.plot(x, alpha*np.exp(beta*x), 'r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
```


### How well did we do?
After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data.
#### Sum of Squares
We define several *sum of squares* quantities that measure total variation and error:

$$
\begin{aligned}
S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\
S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\
S_l &= S_t - S_r &\text{(variation explained by the regression line)}
\end{aligned}
$$
Where:
* $y_i$ = observed data
* $\hat{y}_i$ = predicted data from the model
* $\bar{y}$ = mean of observed data
#### Standard Error of the Estimate
If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line:

$$
s_{y/x} = \sqrt{\frac{S_r}{n - 2}}
$$
where $n$ is the number of data points.
Smaller $s_{y/x}$ means the regression line passes closer to the data points.

#### Coefficient of Determination – ($R^2$)
The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression:
$$
R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t}
$$
- ($R^2$ = 1.0) → perfect fit (all points on the line)
- ($R^2$ = 0) → model explains none of the variation
In engineering terms, a high (R^2) indicates that your model captures most of the physical trend, for example, how deflection scales with load.

#### Correlation Coefficient – ($r$)
For linear regression, the correlation coefficient (r) is the square root of (R^2), with sign matching the slope of the line:

$$
r = \pm \sqrt{R^2}
$$
- ($r$ > 0): positive correlation (both variables increase together)
- ($r$ < 0): negative correlation (one increases, the other decreases)
## Problem 1: 
Fit a linear and polynomial model to stress-strain data. Compute R^2 and discuss which model fits better.

```python
import numpy as np

# Example data
x = np.array([0, 1, 2, 3, 4, 5])
y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2])

# Linear fit
m, b = np.polyfit(x, y, 1)
y_pred = m*x + b

# Calculate residuals and metrics
Sr = np.sum((y - y_pred)**2)
St = np.sum((y - np.mean(y))**2)
syx = np.sqrt(Sr / (len(y) - 2))
R2 = 1 - Sr/St
r = np.sign(m) * np.sqrt(R2)

print(f"s_y/x = {syx:.3f}")
print(f"R^2 = {R2:.3f}")
print(f"r = {r:.3f}")
```

## Extrapolation
Once we have a regression model, it’s tempting to use it to predict values beyond the range of measured data. This process is called extrapolation.

In interpolation, the model is supported by real data on both sides of the point. In extrapolation, we’re assuming that the same physical relationship continues indefinitely and that’s often not true in engineering systems.

Most regression equations are empirical as they describe the trend in the range of observed conditions but may not capture the true physics. Common issues may originate from nonlinear behavior outside range such as stress–strain curves. Physical limitations, such as below absolute 0 temperatures, or greater than 100% efficiencies. Another case could be where the mechanism changes in the real world making the model inapplicable such as heat transfer switching from convection to radiation at higher temperatures.

Some guidelines of using extrapolation:
- Plot the data used for fitting
- Avoid predicting far beyond the range of your data unless supported by physical models
## Moving average
Real experimental data often contains small random fluctuations that obscure the underlying trend a.k.a. noise. Rather than fitting a complex equation, we can smooth the data using a moving average, which replaces each point with the average of its nearby neighbors. This simple method reduces random variation while preserving the overall shape of the signal.

A moving average or rolling mean takes the average over a sliding window of data points given by the equation:
$$\bar{y}_i = \frac{1}{N} \sum_{j=i-k}^{i+k} y_j$$
where:
- $N$ = window size (number of points averaged),
- $k = (N-1)/2$ if the window is centered,
- $y_j$ = original data values.

If you select a larger window you'll have a smoother curve, but you loose detail. A smaller windows retains more detail but reduces less noise.
### Example: Smoothing sensor noise
```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Generate noisy signal
x = np.linspace(0, 4*np.pi, 100)
y = np.sin(x) + 0.2*np.random.randn(100)

# Apply moving average with different window sizes
df = pd.DataFrame({'x': x, 'y': y})
df['y_smooth_5'] = df['y'].rolling(window=5, center=True).mean()
df['y_smooth_15'] = df['y'].rolling(window=15, center=True).mean()

plt.plot(df['x'], df['y'], 'k.', alpha=0.4, label='Raw data')
plt.plot(df['x'], df['y_smooth_5'], 'r-', label='Window = 5')
plt.plot(df['x'], df['y_smooth_15'], 'b-', label='Window = 15')
plt.xlabel('Time (s)')
plt.ylabel('Signal')
plt.title('Effect of Moving Average Window Size')
plt.legend()
plt.show()
```

## Problem 2:  Moving average
Apply a moving average to noisy temperature data and compare raw vs. smoothed signals