tutorials/module_4/4.4 Statistical Analysis.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

# Statistical Analysis
**Learning Objectives:**

- Descriptive statistics (mean, median, variance, std deviation)
- Histograms and probability distributions
- Correlation and regression
- Uncertainty, error bars, confidence intervals
---
## Statistical tools
Numpy comes with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data.
```python
import numpy as np

mean = np.mean([1, 2, 3, 4, 5])
median = np.median([1, 2, 3, 4, 5])
std = np.std([1, 2, 3, 4, 5])
variance = np.var([1, 2, 3, 4, 5])
```

As seen in the previous lecture, pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. Instead of applying individual NumPy functions to each column, you can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once.

## Linear Regression
### What is Linear Regression?
Linear regression is one of the most fundamental techniques in data analysis.  
It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data.

Mathematically, the model assumes a linear equation:
$$
y = m x + b
$$
where  
- $y$ = dependent variable 
- $x$ = independent variable
- $m$ = slope (rate of change)  
- $b$ = intercept (value of $y$ when $x = 0$) 

Linear regression helps identify proportional relationships, estimate calibration constants, or model linear system responses.

### Problem 1: Stress–Strain Relationship
Let’s assume we’ve measured the stress (σ) and strain (ε) for a material test and want to estimate Young’s modulus (E) from the slope.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Example data (strain, stress)
strain = np.array([0.000, 0.0005, 0.0010, 0.0015, 0.0020, 0.0025])
stress = np.array([0.0, 52.0, 104.5, 157.2, 208.1, 261.4])  # MPa

# Fit a linear regression line using NumPy
coeffs = np.polyfit(strain, stress, deg=1)
m, b = coeffs
print(f"Slope (E) = {m:.2f} MPa, Intercept = {b:.2f}")

# Predicted stress
stress_pred = m * strain + b

# Plot
plt.figure()
plt.scatter(strain, stress, label="Experimental Data", color="navy")
plt.plot(strain, stress_pred, color="red", label="Linear Fit")
plt.xlabel("Strain (mm/mm)")
plt.ylabel("Stress (MPa)")
plt.title("Linear Regression – Stress–Strain Curve")
plt.legend()
plt.grid(True)
plt.show()

```

The slope `m` represents the Young’s Modulus (E), showing the stiffness of the material in the linear elastic region. 

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# ------------------------------------------------
# 1. Example Data: Stress vs. Strain
# (Simulated material test data)
strain = np.array([0.000, 0.0005, 0.0010, 0.0015, 0.0020, 0.0025])
stress = np.array([0.0, 52.0, 104.5, 157.2, 208.1, 261.4])  # MPa

# Reshape strain for scikit-learn (expects 2D input)
X = strain.reshape(-1, 1)
y = stress

# ------------------------------------------------
# 2. Fit Linear Regression Model
model = LinearRegression()
model.fit(X, y)

# Extract slope and intercept
m = model.coef_[0]
b = model.intercept_
print(f"Linear model: Stress = {m:.2f} * Strain + {b:.2f}")

# ------------------------------------------------
# 3. Predict Stress Values and Evaluate the Fit
y_pred = model.predict(X)

# Coefficient of determination (R²)
r2 = r2_score(y, y_pred)

# Root mean square error (RMSE)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"R² = {r2:.4f}")
print(f"RMSE = {rmse:.3f} MPa")

# ------------------------------------------------
# 4. Visualize Data and Regression Line
plt.figure(figsize=(6, 4))
plt.scatter(X, y, color="navy", label="Experimental Data")
plt.plot(X, y_pred, color="red", label="Linear Fit")
plt.xlabel("Strain (mm/mm)")
plt.ylabel("Stress (MPa)")
plt.title("Linear Regression – Stress–Strain Relationship")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

```