1 files changed, 179 insertions, 0 deletions
diff --git a/tutorials/module_4/4.5_Statistical_Analysis_II.md b/tutorials/module_4/4.5_Statistical_Analysis_II.md
new file mode 100644
index 0000000..20805c9
--- /dev/null
+++ b/tutorials/module_4/4.5_Statistical_Analysis_II.md
@@ -0,0 +1,179 @@
+# 4.5 Statistical Analysis II
+## Modelling Relationships
+As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.
+
+## Least Square Regression and Line of Best Fit
+
+### What is Regression?
+Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data.
+
+### Linear
+The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line
+$$
+y=mx+b\tag{}
+$$
+
+
+### Exponential and Power functions
+You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. 
+
+### Polynomial
+Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature.
+
+```python
+x_d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8])
+y_d = np.array([0, 0.8, 0.9, 0.1, -0.6, -0.8, -1, -0.9, -0.4])
+
+plt.figure(figsize = (12, 8))
+for i in range(1, 7):
+    
+    # get the polynomial coefficients
+    y_est = np.polyfit(x_d, y_d, i)
+    plt.subplot(2,3,i)
+    plt.plot(x_d, y_d, 'o')
+    # evaluate the values for a polynomial
+    plt.plot(x_d, np.polyval(y_est, x_d))
+    plt.title(f'Polynomial order {i}')
+
+plt.tight_layout()
+plt.show()
+```
+
+### Using Scipy
+You can also use scipy to 
+```python
+# let's define the function form
+def func(x, a, b):
+    y = a*np.exp(b*x)
+    return y
+
+alpha, beta = optimize.curve_fit(func, xdata = x, ydata = y)[0]
+print(f'alpha={alpha}, beta={beta}')
+
+# Let's have a look of the data
+plt.figure(figsize = (10,8))
+plt.plot(x, y, 'b.')
+plt.plot(x, alpha*np.exp(beta*x), 'r')
+plt.xlabel('x')
+plt.ylabel('y')
+plt.show()
+```
+
+
+### How well did we do?
+After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data.
+#### Sum of Squares
+We define several *sum of squares* quantities that measure total variation and error:
+
+$$
+\begin{aligned}
+S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\
+S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\
+S_l &= S_t - S_r &\text{(variation explained by the regression line)}
+\end{aligned}
+$$
+Where:
+* $y_i$ = observed data
+* $\hat{y}_i$ = predicted data from the model
+* $\bar{y}$ = mean of observed data
+#### Standard Error of the Estimate
+If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line:
+
+$$
+s_{y/x} = \sqrt{\frac{S_r}{n - 2}}
+$$
+where $n$ is the number of data points.
+Smaller $s_{y/x}$ means the regression line passes closer to the data points.
+
+#### Coefficient of Determination – ($R^2$)
+The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression:
+$$
+R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t}
+$$
+- ($R^2$ = 1.0) → perfect fit (all points on the line)
+- ($R^2$ = 0) → model explains none of the variation
+In engineering terms, a high (R^2) indicates that your model captures most of the physical trend, for example, how deflection scales with load.
+
+#### Correlation Coefficient – ($r$)
+For linear regression, the correlation coefficient (r) is the square root of (R^2), with sign matching the slope of the line:
+
+$$
+r = \pm \sqrt{R^2}
+$$
+- ($r$ > 0): positive correlation (both variables increase together)
+- ($r$ < 0): negative correlation (one increases, the other decreases)
+## Problem 1: 
+Fit a linear and polynomial model to stress-strain data. Compute R^2 and discuss which model fits better.
+
+```python
+import numpy as np
+
+# Example data
+x = np.array([0, 1, 2, 3, 4, 5])
+y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2])
+
+# Linear fit
+m, b = np.polyfit(x, y, 1)
+y_pred = m*x + b
+
+# Calculate residuals and metrics
+Sr = np.sum((y - y_pred)**2)
+St = np.sum((y - np.mean(y))**2)
+syx = np.sqrt(Sr / (len(y) - 2))
+R2 = 1 - Sr/St
+r = np.sign(m) * np.sqrt(R2)
+
+print(f"s_y/x = {syx:.3f}")
+print(f"R^2 = {R2:.3f}")
+print(f"r = {r:.3f}")
+```
+
+## Extrapolation
+Once we have a regression model, it’s tempting to use it to predict values beyond the range of measured data. This process is called extrapolation.
+
+In interpolation, the model is supported by real data on both sides of the point. In extrapolation, we’re assuming that the same physical relationship continues indefinitely and that’s often not true in engineering systems.
+
+Most regression equations are empirical as they describe the trend in the range of observed conditions but may not capture the true physics. Common issues may originate from nonlinear behavior outside range such as stress–strain curves. Physical limitations, such as below absolute 0 temperatures, or greater than 100% efficiencies. Another case could be where the mechanism changes in the real world making the model inapplicable such as heat transfer switching from convection to radiation at higher temperatures.
+
+Some guidelines of using extrapolation:
+- Plot the data used for fitting
+- Avoid predicting far beyond the range of your data unless supported by physical models
+## Moving average
+Real experimental data often contains small random fluctuations that obscure the underlying trend a.k.a. noise. Rather than fitting a complex equation, we can smooth the data using a moving average, which replaces each point with the average of its nearby neighbors. This simple method reduces random variation while preserving the overall shape of the signal.
+
+A moving average or rolling mean takes the average over a sliding window of data points given by the equation:
+$$\bar{y}_i = \frac{1}{N} \sum_{j=i-k}^{i+k} y_j$$
+where:
+- $N$ = window size (number of points averaged),
+- $k = (N-1)/2$ if the window is centered,
+- $y_j$ = original data values.
+
+If you select a larger window you'll have a smoother curve, but you loose detail. A smaller windows retains more detail but reduces less noise.
+### Example: Smoothing sensor noise
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+
+# Generate noisy signal
+x = np.linspace(0, 4*np.pi, 100)
+y = np.sin(x) + 0.2*np.random.randn(100)
+
+# Apply moving average with different window sizes
+df = pd.DataFrame({'x': x, 'y': y})
+df['y_smooth_5'] = df['y'].rolling(window=5, center=True).mean()
+df['y_smooth_15'] = df['y'].rolling(window=15, center=True).mean()
+
+plt.plot(df['x'], df['y'], 'k.', alpha=0.4, label='Raw data')
+plt.plot(df['x'], df['y_smooth_5'], 'r-', label='Window = 5')
+plt.plot(df['x'], df['y_smooth_15'], 'b-', label='Window = 15')
+plt.xlabel('Time (s)')
+plt.ylabel('Signal')
+plt.title('Effect of Moving Average Window Size')
+plt.legend()
+plt.show()
+```
+
+## Problem 2:  Moving average
+Apply a moving average to noisy temperature data and compare raw vs. smoothed signals
+