tutorials/module_4/4.4_Statistical_Analysis.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

# Statistical Analysis
## Subtitle: Using statistics to reduce uncertainty

**Learning Objectives:**

- Descriptive statistics (mean, median, variance, std deviation)
- Histograms and probability distributions
- Correlation and regression
- Uncertainty, error bars, confidence intervals
---
## Engineering Models
#### Why Do We Care?
In engineering, data is more than just a collection of numbers, it tells a story about how systems behave. By analyzing data, we can develop mathematical models that describe and predict physical behavior. These models help us answer questions such as:
- How does stress relate to strain for a given material?
- How does temperature affect efficiency in a heat engine?
- How does flow rate change with pressure in a pipe?

When we fit equations to experimental data, we turn observations into predictive tools. This process allows engineers to forecast performance, optimize designs, and identify system limitations.
#### From Data to Models
A common way to build an engineering model is through curve fitting, finding a mathematical expression that best represents the trend in your data.

You’ve likely done this before in Excel by adding a “trendline” to a plot. In this module, we’ll take that concept further using Python, which allows for more control, flexibility, and insight into the underlying math.

We’ll learn to:
- Fit linear, exponential, and polynomial relationships.
- Evaluate how well our model fits the data using metrics like R².
- Use models to predict outcomes beyond the measured range (carefully).

By the end of this section, you’ll understand not just how to fit data, but why certain models work better for specific engineering problems.
## Statistics Review
Let's take a second to remind ourselves of some statistical terms and how we define it mathematically

|                          | Formula                                                                                                          | Measurement                                                                            |
| ------------------------ | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- |
| Arithmetic Mean          | $$\bar{y} = \frac{\sum y_i}{n}$$                                                                                 | Average. (in Units)                                                                    |
| Standard Deviation       | $$s_y = \sqrt{\frac{S_t}{n - 1}}, \quad S_t = \sum (y_i - \bar{y})^2$$                                           | Absolute measure of spread from the average. (in Units)                                |
| Variance                 | $$s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1} = \frac{\left(\sum y_i^2 - \frac{(\sum y_i)^2}{n}\right)}{n- 1}$$ | How spread out the data is from the average. (in Squared Units)                        |
| Coefficient of Variation | $$c.v. = \frac{s_y}{\bar{y}} \times 100\%$$                                                                      | Relative measure of spread from the average or the consistency of the data. (Unitless) |

## Statistics function in python
Both Numpy and Pandas come with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data.
```python
import numpy as np

mean = np.mean([1, 2, 3, 4, 5])
median = np.median([1, 2, 3, 4, 5])
std = np.std([1, 2, 3, 4, 5])
variance = np.var([1, 2, 3, 4, 5])
```

Pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. When working with pandas we can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once.

## Problem: Use pd.describe() to report on a dataseries


---

## Reducing uncertainty using statistics
Great, so we
## Statistical Distributions
Every engineering measurement contains some amount of variation. Whether you’re measuring the intensity of a spectral line, the pressure in a cylinder, or the thickness of a machined part, small deviations are inevitable. Statistical distributions help us quantify and visualize that variation, giving engineers a way to decide what is normal and what is error.
### Normal Distribution
Most experimental data follows a normal (Gaussian) distribution, where values cluster around the mean and taper off symmetrically toward the extremes.

<img src="image_1761513820040.png" width="650">
In a normal distribution:
- About 68 % of data lies within $\pm 1 \sigma$ of the mean
- About 95 % lies within $\pm 2 \sigma$
- About 99.7 % lies within $± 3 \sigma$

This helps to assess confidence in their results and identify outliers that may indicate bad readings or faulty sensors.

>[!NOTE] Design Thinking - Reliability
>Motorola popularized Six Sigma design to minimize manufacturing defects. The goal was to design processes where the probability of failure is less than 3.4 per million parts, essentially operating six standard deviations from the mean. The mindset here is proactive design: if we understand how variability behaves, we can design systems that tolerate it. 
>
>Takeaway: Statistical distributions aren’t just for data analysis they guide how reliable we make our products.

## Practical Application: Spectroscopy
### Background
Spectroscopy is the study of how matter interacts with electromagnetic radiation, including the absorption and emission of light and other forms of radiation. It examines how these interactions depend on the wavelength of the radiation, providing insight into the physical and chemical properties of materials. This is how NASA determines the composition of planetary surfaces and atmospheres. It's also applied in combustion and thermal analysis where spectroscopy measure plasma temperature and monitors exhaust composition in rocket engines. 

In simple terms, spectroscopy helps us understand what substances are made of and how they behave when exposed to high levels energy to help improve system performance and efficiency. These applications allow us to better understand material behavior under extreme conditions and improve system performance and efficiency.

### Spectrometer
The instrument used to measure the spectra of light is called a spectrometer. It works on the basis of taking light, scatters it and then projecting the spectra onto a detector allowing us to capture the intensity of the light at different wavelengths. See the supplementary video of the inside of a spectrometer.
![How spectrometers work](https://www.youtube.com/watch?v=OI3pIvLhVcc)

Once the data is collected we can compare our data with know spectra of elements to then identify their composition. The figure below, show the spectra of different elements.

<img src="image_1762366586870.png" width="500">


## Problem: Eliminating uncertainty in Spectroscopy readings
When using spectroscopy to measure emission intensity, each reading fluctuates slightly due to sensor noise, temperature drift or electronic fluctuations. By taking multiple readings and averaging them, random errors (positive and negative) tend to cancel out, the mean converges to the true value. The standard deviation quantifies how precise the measurement is. Plot all readings of intensity as a function of wavelength on top of each other. Calculate the mean, standard deviation and variance. Then plot the intensity readings as a histogram. Comment on the distributions type.