diff options
| author | Christian Kolset <christian.kolset@gmail.com> | 2025-10-22 14:21:30 -0600 |
|---|---|---|
| committer | Christian Kolset <christian.kolset@gmail.com> | 2025-10-22 14:21:30 -0600 |
| commit | 6a9d212c80848e7601fe114ef569f7355e4c1f22 (patch) | |
| tree | 3c49793d362fe63cbf9b3b7a2e2a599cca4f5dbd /tutorials/module_4 | |
| parent | 1630ff5771ba7aa4623d25fd9d97af3a6facecbb (diff) | |
Made progress on statistics tutorials.
Diffstat (limited to 'tutorials/module_4')
| -rw-r--r-- | tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md | 109 | ||||
| -rw-r--r-- | tutorials/module_4/4.3 Importing and Managing Data.md | 2 | ||||
| -rw-r--r-- | tutorials/module_4/4.4 Statistical Analysis.md | 14 | ||||
| -rw-r--r-- | tutorials/module_4/4.5 Statistical Analysis II.md | 17 | ||||
| -rw-r--r-- | tutorials/module_4/image_1761156588079.png | bin | 0 -> 11452 bytes |
5 files changed, 97 insertions, 45 deletions
diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md index d52c33c..3ad34e4 100644 --- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md +++ b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md @@ -3,7 +3,6 @@ **Learning objectives:** - Understand what makes data “scientific” (units, precision, metadata) -- Recognize types of data: time-series, experimental, simulation, and imaging data - Identify challenges in data processing (missing data, noise, outliers) - Overview of the data-analysis workflow --- @@ -14,6 +13,11 @@ We may collect this in the following ways: - **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities. - **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions. - **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems. +## Data Processing flow works +```mermaid +flowchart + A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization] +``` ## Introduction to pandas `pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**. @@ -21,45 +25,26 @@ A `Series` represents a single column or one-dimensional labeled array, while DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently. +### Problem 1: Create a dataframe from an array +Given the data `force_N` and `time_s` -### Problem 1: Create a dataframe from a text file -Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe ```python import pandas as pd -file_path = "force_displacement_data.txt" +force_N = [10, 20, 30, 25, 15] +time_s = [0, 1, 2, 3, 4] -df_txt = pd.read_csv( - file_path, - delim_whitespace=True, - comment="#", - skiprows=0, - header=0 -) +df = pd.DataFrame({ + 'Time (s)': time_s, + 'Force (N)': force_N +}) print("\n=== Basic Statistics ===") -print(df_txt.describe())232 - -if "Force_N" in df_txt.columns: - print("\nFirst five Force readings:") - print(df_txt["Force_N"].head()) - -try: - import matplotlib.pyplot as plt - - plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) - plt.xlabel(df_txt.columns[0]) - plt.ylabel(df_txt.columns[1]) - plt.title("Loaded Data from Text File") - plt.grid(True) - plt.show() - -except ImportError: - print("\nmatplotlib not installed — skipping plot.") - +print(df.describe()) ``` - -## Subsetting and Conditional filtering +Notice how `the describe()` function outputs some statistical data the we may find useful. +### Manipulating dataframes +#### Subsets and Conditional filtering You can select rows, columns, or specific conditions from a DataFrame. ```python @@ -73,10 +58,9 @@ subset = df[["Time_s", "Force_N"]] df_high_force = df[df["Force_N"] > 50] ``` - ![[Pasted image 20251013064718.png]] -## Combining and Merging Datasets +#### Combining and Merging Datasets Often, multiple sensors or experiments must be merged into one dataset for analysis. ```python @@ -88,21 +72,62 @@ combined = pd.concat([df_run1, df_run2], axis=0) ``` -## Problem 1: Describe a dataset -Use pandas built-in describe data to report on the statistical mean of the given experimental data. +https://pandas.pydata.org/docs/user_guide/merging.html + +#### Creating new columns based on existing ones +<img src="image_1761156588079.png" width="600"> +Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant. ```python -import matplotlib.pyplot as plt +air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 +``` -plt.plot(df["Time_s"], df["Force_N"]) -plt.xlabel("Time (s)") -plt.ylabel("Force (N)") -plt.title("Force vs. Time") -plt.show() +We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows: +```python +air_quality["ratio_paris_antwerp"] = ( + air_quality["station_paris"] / air_quality["station_antwerp"] + ) ``` +https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html +https://pandas.pydata.org/docs/user_guide/reshaping.html +### Problem 1: Create a dataframe from data +Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe +```python +import pandas as pd + +file_path = "force_displacement_data.txt" + +df_txt = pd.read_csv( + file_path, + delim_whitespace=True, + comment="#", + skiprows=0, + header=0 +) + +print("\n=== Basic Statistics ===") +print(df_txt.describe()) + +if "Force_N" in df_txt.columns: + print("\nFirst five Force readings:") + print(df_txt["Force_N"].head()) + +try: + import matplotlib.pyplot as plt + + plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) + plt.xlabel(df_txt.columns[0]) + plt.ylabel(df_txt.columns[1]) + plt.title("Loaded Data from Text File") + plt.grid(True) + plt.show() + +except ImportError: + print("\nmatplotlib not installed — skipping plot.") +``` **Activities & Examples:** diff --git a/tutorials/module_4/4.3 Importing and Managing Data.md b/tutorials/module_4/4.3 Importing and Managing Data.md index ef44a7a..cd66164 100644 --- a/tutorials/module_4/4.3 Importing and Managing Data.md +++ b/tutorials/module_4/4.3 Importing and Managing Data.md @@ -1,4 +1,4 @@ -# Importing and Managing Data +# Importing and Exporting Data **Learning objectives:** diff --git a/tutorials/module_4/4.4 Statistical Analysis.md b/tutorials/module_4/4.4 Statistical Analysis.md index bf3a8bd..09ac1fb 100644 --- a/tutorials/module_4/4.4 Statistical Analysis.md +++ b/tutorials/module_4/4.4 Statistical Analysis.md @@ -6,8 +6,12 @@ - Correlation and regression - Uncertainty, error bars, confidence intervals --- +## Engineering Models + +- Curve fitting +- ## Statistical tools -Numpy comes with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data. +Both Numpy and Pandas come with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data. ```python import numpy as np @@ -17,4 +21,10 @@ std = np.std([1, 2, 3, 4, 5]) variance = np.var([1, 2, 3, 4, 5]) ``` -As seen in the previous lecture, pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. Instead of applying individual NumPy functions to each column, you can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once.
\ No newline at end of file +Pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. When working with pandas we can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once. + +## Statistical Distribution + + + +## Problem: Spectroscopy diff --git a/tutorials/module_4/4.5 Statistical Analysis II.md b/tutorials/module_4/4.5 Statistical Analysis II.md new file mode 100644 index 0000000..df1b585 --- /dev/null +++ b/tutorials/module_4/4.5 Statistical Analysis II.md @@ -0,0 +1,17 @@ +# 4.5 Statistical Analysis II +[Introduction text] + +## Least Square Regression and Line of Best Fit +### What is Linear Regression? +Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data. + +Linear regression helps identify proportional relationships, estimate calibration constants, or model linear system responses. + + +## Least square fitting + + +## Extrapolation + + +## Moving average diff --git a/tutorials/module_4/image_1761156588079.png b/tutorials/module_4/image_1761156588079.png Binary files differnew file mode 100644 index 0000000..dd18131 --- /dev/null +++ b/tutorials/module_4/image_1761156588079.png |
