diff options
| author | Christian Kolset <christian.kolset@gmail.com> | 2025-11-01 19:07:33 -0600 |
|---|---|---|
| committer | Christian Kolset <christian.kolset@gmail.com> | 2025-11-01 19:07:33 -0600 |
| commit | ef1d5ab76a8ebfea3038b15e90cc61ac14f4fbed (patch) | |
| tree | 74f44ec95990d6f7edeed56e0252a0e07414fcb9 /tutorials/module_4 | |
| parent | 81da658b3000aa02bd72771b06dfdcd726c0a075 (diff) | |
Worked through adding material to module 4
Diffstat (limited to 'tutorials/module_4')
| -rw-r--r-- | tutorials/module_4/4.0 Outline.md | 50 | ||||
| -rw-r--r-- | tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md | 5 | ||||
| -rw-r--r-- | tutorials/module_4/4.4 Statistical Analysis.md | 61 | ||||
| -rw-r--r-- | tutorials/module_4/4.5 Statistical Analysis II.md | 96 |
4 files changed, 163 insertions, 49 deletions
diff --git a/tutorials/module_4/4.0 Outline.md b/tutorials/module_4/4.0 Outline.md index f847484..8156651 100644 --- a/tutorials/module_4/4.0 Outline.md +++ b/tutorials/module_4/4.0 Outline.md @@ -1,47 +1,67 @@ # Module 4: Outline 1. Introduction to Data and Scientific Datasets - a. What is scientfic data + a. What is scientific data b. Data Processing flow work c. Intro to Pandas d. Manipulating data frames - e. Problem: Create a daraframe from Numpy arrays + e. Problem 1: Create a dataframe from Numpy arrays + f. Problem 2: Selecting data from a dataframe to calculate work done. 2. Interpreting Data a. Understanding your data b. Purpose c. Composition d. Color - e. Problem 1: Composing or fixing a plot + e. Problem 1: Composing or fixing a plot. Apply PCC f. Data don't lie - g. Problem 2: Misleading plots + g. Problem 2: Misleading plots by changing axis limits or omitting context. Explain *why* it's misleading. 3. Importing, Exporting and Managing Data a. File types b. Importing spreadsheets with pandas c. Handling header, units and metadata d. Writing and editing data in pandas - e. Problem: Importing time stamped data + e. Problem: Importing time stamped data pressure and temperature data. Convert timestaps to datetime and plot timperature vs. time + f. Problem: Add metadata () [TBD] -4. Statistical Analysis +4. Statistical Analysis I a. Engineering Models b. Statistics Review c. Statistics function in python (Numpy and Pandas describe) d. Statistical Distributions e. Spectrocopy (basics) - f. Problem: Statistical tools in Spectroscopy readings + f. Problem: Statistical tools in Spectroscopy readings (intensity vs wavelangth) to compute mean, variance and detect outliers. + g. Problem 2: Fit a Gaussian distribution to the same data and overlay it on the histogram. -5. Statistical Analysis +5. Statistical Analysis II: Regression and Smoothing a. Linear Square Regression and Line of Best Fit - b. Linear - c. Exponential and Power functions - d. Polynomial**m 2:** From the DataFrame, add a + b. Linear, Exponential and Power functions + d. Polynomial e. Using scipy f. How well did we do? (R and R^2) - g. Extrapolation - h. Moving average + g. Extrapolation and limitations + h. Moving averages + i. Problem 1: Fit a linear and polynomial model to stress-strain data. Compute R^2 and discuss which model fits better. + j. Problem 2: Apply a moving average to noisy temperature data and ocmpare ra vs. smoothed signals. 6. Data Filtering and Signal Processing - + a. What is it and why it matters - noise vs. signal + b. Moving average and window functions + c. Frequency domain basics (sampling rate, Nyquist frequency) + d. Fourier transform overiew (numpy.fft, scipy.fft) + e. Low-pass and high-pass filters (scipy.singla.butter, filtfilt) + f. Example: Removing high-frequency noise from a displacement signal + g. Example: Removing noise from an image to help for further analysis (PIV) + h. Problem 1: Generate a synthetic signal (sum of two sine waves+random noise). Apply a moving average and FFT to show frequency components.) + i. Problem 2: Design a Butterworkth low-pass filter to isolate the funcamental frequency of a vibration signal (e.g. roating machinery). Plot before and after. + 7. Data Visualization and Presentation - a. Problem: Using pandas to plot spectroscopy data from raw data.
\ No newline at end of file + a. Review of PCC framework + b. Plotting with Pandas and Matplotlib + c. Subplots, twin axes, and annotations + d. Colomaps and figure aesthetics + e. Exporitn gplots for reports (DPI, figure size) + f. Creating dashboards or summary figures + g. Problem 1: Using pandas to plot spectroscopy data from raw data. Add labels, units, title, and annotations for peaks + h. Problem 2: Create a multi-panel figure showing raw data, fitted curve, and residuals. Format with consistent style, legend and color scheme for publication-ready quality.
\ No newline at end of file diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md index 5cd3879..882ce59 100644 --- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md +++ b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md @@ -100,7 +100,10 @@ air_quality["ratio_paris_antwerp"] = ( https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html https://pandas.pydata.org/docs/user_guide/reshaping.html -### Problem: Create a dataframe from data +### Problem: Create a dataframe from Numpy arrays + + + Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe ```python import pandas as pd diff --git a/tutorials/module_4/4.4 Statistical Analysis.md b/tutorials/module_4/4.4 Statistical Analysis.md index 1112ca8..c4c32b0 100644 --- a/tutorials/module_4/4.4 Statistical Analysis.md +++ b/tutorials/module_4/4.4 Statistical Analysis.md @@ -1,4 +1,6 @@ # Statistical Analysis +## Subtitle: Using statistics to reduce uncertainty + **Learning Objectives:** - Descriptive statistics (mean, median, variance, std deviation) @@ -7,19 +9,33 @@ - Uncertainty, error bars, confidence intervals --- ## Engineering Models -Why care? - By analyzing data engineers can use statistical tools to create a mathematical model to help us predict something. You've probably used excel for this before, we will do it with python. -- Curve fitting - You've probably used excel for this before, we will do it with python. +#### Why Do We Care? +In engineering, data is more than just a collection of numbers, it tells a story about how systems behave. By analyzing data, we can develop mathematical models that describe and predict physical behavior. These models help us answer questions such as: +- How does stress relate to strain for a given material? +- How does temperature affect efficiency in a heat engine? +- How does flow rate change with pressure in a pipe? + +When we fit equations to experimental data, we turn observations into predictive tools. This process allows engineers to forecast performance, optimize designs, and identify system limitations. +#### From Data to Models +A common way to build an engineering model is through curve fitting, finding a mathematical expression that best represents the trend in your data. + +You’ve likely done this before in Excel by adding a “trendline” to a plot. In this module, we’ll take that concept further using Python, which allows for more control, flexibility, and insight into the underlying math. +We’ll learn to: +- Fit linear, exponential, and polynomial relationships. +- Evaluate how well our model fits the data using metrics like R². +- Use models to predict outcomes beyond the measured range (carefully). + +By the end of this section, you’ll understand not just how to fit data, but why certain models work better for specific engineering problems. ## Statistics Review Let's take a second to remind ourselves of some statistical terms and how we define it mathematically -| | Formula | -| ------------------------ | ---------------------------------------------------------------------------------------------------------------- | -| Arithmetic Mean | $$\bar{y} = \frac{\sum y_i}{n}$$ | -| Standard Deviation | $$s_y = \sqrt{\frac{S_t}{n - 1}}, \quad S_t = \sum (y_i - \bar{y})^2$$ | -| Variance | $$s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1} = \frac{\left(\sum y_i^2 - \frac{(\sum y_i)^2}{n}\right)}{n- 1}$$ | -| Coefficient of Variation | $$c.v. = \frac{s_y}{\bar{y}} \times 100\%$$ | +| | Formula | Measurement | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | +| Arithmetic Mean | $$\bar{y} = \frac{\sum y_i}{n}$$ | Average. (in Units) | +| Standard Deviation | $$s_y = \sqrt{\frac{S_t}{n - 1}}, \quad S_t = \sum (y_i - \bar{y})^2$$ | Absolute measure of spread from the average. (in Units) | +| Variance | $$s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1} = \frac{\left(\sum y_i^2 - \frac{(\sum y_i)^2}{n}\right)}{n- 1}$$ | How spread out the data is from the average. (in Squared Units) | +| Coefficient of Variation | $$c.v. = \frac{s_y}{\bar{y}} \times 100\%$$ | Relative measure of spread from the average or the consistency of the data. (Unitless) | ## Statistics function in python Both Numpy and Pandas come with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data. @@ -39,19 +55,30 @@ Pandas also includes several built-in statistical tools that make it easy to ana --- -Great, so we +## Reducing uncertainty using statistics +Great, so we ## Statistical Distributions -Normal distributions +Every engineering measurement contains some amount of variation. Whether you’re measuring the intensity of a spectral line, the pressure in a cylinder, or the thickness of a machined part, small deviations are inevitable. Statistical distributions help us quantify and visualize that variation, giving engineers a way to decide what is normal and what is error. +### Normal Distribution +Most experimental data follows a normal (Gaussian) distribution, where values cluster around the mean and taper off symmetrically toward the extremes. + <img src="image_1761513820040.png" width="650"> -- Design thinking -> Motorola starting Six sigma organization based on the probability of a product to fail. Adopted world wide. -- Statistical analysis of data. +In a normal distribution: +- About 68 % of data lies within $\pm 1 \sigma$ of the mean +- About 95 % lies within $\pm 2 \sigma$ +- About 99.7 % lies within $± 3 \sigma$ + +This helps to assess confidence in their results and identify outliers that may indicate bad readings or faulty sensors. + +>[!NOTE] Design Thinking - Reliability +>Motorola popularized Six Sigma design to minimize manufacturing defects. The goal was to design processes where the probability of failure is less than 3.4 per million parts, essentially operating six standard deviations from the mean. The mindset here is proactive design: if we understand how variability behaves, we can design systems that tolerate it. +> +>Takeaway: Statistical distributions aren’t just for data analysis they guide how reliable we make our products. ## Spectroscopy ### Background Spectroscopy is the study of how matter interacts with electromagnetic radiation, including the absorption and emission of light and other forms of radiation. It examines how these interactions depend on the wavelength of the radiation, providing insight into the physical and chemical properties of materials. In simple terms, spectroscopy helps us understand what substances are made of and how they behave when exposed to energy. In engineering applications, spectroscopy is a powerful diagnostic and analysis tool. It can be used for material identification, such as how NASA determines the composition of planetary surfaces and atmospheres. It’s also applied in combustion and thermal analysis, where emission spectroscopy measures plasma temperatures and monitors exhaust composition in rocket engines. These applications allow engineers to better understand material behavior under extreme conditions and improve system performance and efficiency. - - -## Problem: Spectroscopy - +## Problem: Eliminating uncertainty in Spectroscopy readings +When using spectroscopy to measure emission intensity, each reading fluctuates slightly due to sensor noise, temperature drift or electronic fluctuations. By taking multiple readings and averaging them, random errors (positive and negative) tend to cancel out, the mean converges to the true value. The standard deviation quantifies how precise the measurement is. Plot all readings of intensity as a function of wavelength on top of each other. Calculate the mean, standard deviation and variance. Then plot the intensity readings as a histogram. Comment on the distributions type.
\ No newline at end of file diff --git a/tutorials/module_4/4.5 Statistical Analysis II.md b/tutorials/module_4/4.5 Statistical Analysis II.md index 458bada..da25643 100644 --- a/tutorials/module_4/4.5 Statistical Analysis II.md +++ b/tutorials/module_4/4.5 Statistical Analysis II.md @@ -1,24 +1,25 @@ # 4.5 Statistical Analysis II -As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions.ile changes in local repository -======= - File changes in remote repository +## Modelling Relationships +As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions. ## Least Square Regression and Line of Best Fit - -### What is Linear Regression? +### What is Regression? Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data. - - ### Linear -To find a linear regression line we can apply the +The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line +$$ +y=mx+b\tag{} +$$ + ### Exponential and Power functions -Logarithm trick +You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. ### Polynomial - For non-linear equations function such as a polynomial Numpy has a nice feature. +https://www.geeksforgeeks.org/machine-learning/python-implementation-of-polynomial-regression/ +Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature. ```python @@ -61,14 +62,77 @@ plt.show() ``` - - - ### How well did we do? +After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data. +#### Sum of Squares +We define several *sum of squares* quantities that measure total variation and error: + +$$ +\begin{aligned} +S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\ +S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\ +S_l &= S_t - S_r &\text{(variation explained by the regression line)} +\end{aligned} +$$ +Where: +* $y_i$ = observed data +* $\hat{y}_i$ = predicted data from the model +* $\bar{y}$ = mean of observed data +#### Standard Error of the Estimate + +If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line: + +$$ +s_{y/x} = \sqrt{\frac{S_r}{n - 2}} +$$ +where $n$ is the number of data points. +Smaller $s_{y/x}$ means the regression line passes closer to the data points. + +#### Coefficient of Determination – (R^2) +The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression: + +$$ +R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t} +$$ +* (R^2 = 1.0) → perfect fit (all points on the line) +* (R^2 = 0) → model explains none of the variation + +In engineering terms, a high (R^2) indicates that your model captures most of the physical trend — for example, how deflection scales with load. + + +#### Correlation Coefficient – (r) +For linear regression, the **correlation coefficient** (r) is the square root of (R^2), with sign matching the slope of the line: + +$$ +r = \pm \sqrt{R^2} +$$ +* (r > 0): positive correlation (both variables increase together) +* (r < 0): negative correlation (one increases, the other decreases) +#### Example – Evaluating Fit in Python -Using the - +```python +import numpy as np + +# Example data +x = np.array([0, 1, 2, 3, 4, 5]) +y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2]) + +# Linear fit +m, b = np.polyfit(x, y, 1) +y_pred = m*x + b + +# Calculate residuals and metrics +Sr = np.sum((y - y_pred)**2) +St = np.sum((y - np.mean(y))**2) +syx = np.sqrt(Sr / (len(y) - 2)) +R2 = 1 - Sr/St +r = np.sign(m) * np.sqrt(R2) + +print(f"s_y/x = {syx:.3f}") +print(f"R^2 = {R2:.3f}") +print(f"r = {r:.3f}") +``` ## Extrapolation basis funct -## Moving average +## Moving average
\ No newline at end of file |
