From 5b9fd00087ca8594ddb716134cd210a9f9ae5876 Mon Sep 17 00:00:00 2001 From: Christian Kolset Date: Wed, 3 Dec 2025 15:50:47 -0700 Subject: Made some updates to older files and renamed files. Also updated README.md --- tutorials/module_1/basics_of_python.md | 13 +- tutorials/module_1/classes_and_objects.md | 33 ++- tutorials/module_1/control_structures.md | 2 - ...Introduction to Data and Scientific Datasets.md | 144 ------------- ...Introduction_to_Data_and_Scientific_Datasets.md | 144 +++++++++++++ tutorials/module_4/4.2 Interpreting Data.md | 111 ---------- tutorials/module_4/4.2_Interpreting_Data.md | 111 ++++++++++ .../module_4/4.3 Importing and Managing Data.md | 107 ---------- .../module_4/4.3_Importing_and_Managing_Data.md | 107 ++++++++++ tutorials/module_4/4.4 Statistical Analysis.md | 94 --------- tutorials/module_4/4.4_Statistical_Analysis.md | 94 +++++++++ tutorials/module_4/4.5 Statistical Analysis II.md | 179 ---------------- tutorials/module_4/4.5_Statistical_Analysis_II.md | 179 ++++++++++++++++ .../4.6 Data Filtering and Signal Processing.md | 225 --------------------- .../4.6_Data_Filtering_and_Signal_Processing.md | 225 +++++++++++++++++++++ .../4.7 Data Visualization and Presentation.md | 72 ------- .../4.7_Data_Visualization_and_Presentation.md | 72 +++++++ 17 files changed, 969 insertions(+), 943 deletions(-) delete mode 100644 tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md create mode 100644 tutorials/module_4/4.1_Introduction_to_Data_and_Scientific_Datasets.md delete mode 100644 tutorials/module_4/4.2 Interpreting Data.md create mode 100644 tutorials/module_4/4.2_Interpreting_Data.md delete mode 100644 tutorials/module_4/4.3 Importing and Managing Data.md create mode 100644 tutorials/module_4/4.3_Importing_and_Managing_Data.md delete mode 100644 tutorials/module_4/4.4 Statistical Analysis.md create mode 100644 tutorials/module_4/4.4_Statistical_Analysis.md delete mode 100644 tutorials/module_4/4.5 Statistical Analysis II.md create mode 100644 tutorials/module_4/4.5_Statistical_Analysis_II.md delete mode 100644 tutorials/module_4/4.6 Data Filtering and Signal Processing.md create mode 100644 tutorials/module_4/4.6_Data_Filtering_and_Signal_Processing.md delete mode 100644 tutorials/module_4/4.7 Data Visualization and Presentation.md create mode 100644 tutorials/module_4/4.7_Data_Visualization_and_Presentation.md (limited to 'tutorials') diff --git a/tutorials/module_1/basics_of_python.md b/tutorials/module_1/basics_of_python.md index 80b05d3..8dcd37b 100644 --- a/tutorials/module_1/basics_of_python.md +++ b/tutorials/module_1/basics_of_python.md @@ -1,19 +1,16 @@ # Basics of Python - -^d2a6b6 - -This page contains important fundamental concepts used in Python such as syntax, operators, order or precedence and more. +This page contains important fundamental concepts and terminology used in Python such as syntax, operators, order or precedence and more. Although this course uses python, many of these terms are transferable to other programming languages. ## Syntax +Syntax in Python works like grammar in natural languages: it defines the rules for how we must arrange words, symbols, and indentation so the code becomes a valid instruction. ### Indentations and blocks In python *indentations* or the space at the start of each line, signifies a block of code. This becomes important when we start working with function and loops. We will talk more about this in the controls structures tutorial. - ### Comments -Comments can be added to your code using the hash operator (#). Any text behind the comment operator till the end of the line will be rendered as a comment. -If you have an entire block of text or code that needs to be commented out, the triple quotation marks (""") can be used. Once used all the code after it will be considered a comment until the comment is ended with the triple quotation marks.f +Comments can be added to your code using the hash operator (`#`). Any text behind the comment operator till the end of the line will be rendered as a comment. +If you have an entire block of text or code that needs to be commented out, the triple quotation marks (`"""`) can be used. Once used all the code after it will be considered a comment until the comment is ended with the triple quotation marks. ## Operators -In python, operators are special symbols or keywords that perform operations on values or variables. This section covers some of the most common operator that you will see in this course. +Operators are special symbols or keywords that perform operations on values or variables. This section covers some of the most common operator that you will see in this course. ### Arithmetic operators | Operator | Name | diff --git a/tutorials/module_1/classes_and_objects.md b/tutorials/module_1/classes_and_objects.md index 759a3a3..ba1dab9 100644 --- a/tutorials/module_1/classes_and_objects.md +++ b/tutorials/module_1/classes_and_objects.md @@ -42,6 +42,31 @@ - D. Bonus: Integrate two objects to simulate interaction --- +# Background +To understand what modular programming is and why using it, let's take a look a the origin of computer code. +# Traditional Programming +Procedural Oriented Programming. + +| OOP | POP | +| ---------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| [Object oriented](https://www.geeksforgeeks.org/dsa/introduction-of-object-oriented-programming/). | [Structure oriented](https://www.geeksforgeeks.org/computer-networks/difference-between-structured-programming-and-object-oriented-programming/). | +| Program is divided into objects. | Program is divided into functions. | +| Bottom-up approach. | Top-down approach. | +| Inheritance property is used. | Inheritance is not allowed. | +| Encapsulation is used to hide the data. | No data hiding. | +| Concept of virtual function. | No virtual function. | +| Object functions are linked through message passing. | Parts of program are linked through parameter passing. | +| Adding new data and functions is easy | Expanding new data and functions is not easy. | +| The existing code can be reused. | No code reusability. | +| use for solving big problems. | Not suitable for solving big problems. | +| Python, [C++](https://www.geeksforgeeks.org/cpp/c-plus-plus/), [Java](https://www.geeksforgeeks.org/java/java/). | [C](https://www.geeksforgeeks.org/c/c-programming-language/), Pascal. | + + +https://www.geeksforgeeks.org/cpp/difference-between-oop-and-pop/ +## POP vs. OOP + + + # Modular Programming Modular programming or better known as Object-Oriented Programming (OOP) is a way we can structure programs so that properties and behaviors of objects are grouped together. It allows us to re-use code which simplifies the code for better readability in larger programs and reduces the potential for bugs. You're probably familiar with the saying "Don't re-invent the wheel". OOP is the programmers solution to this. Python allows us to import libraries so that we don't have to write everything from scratch. The libraries used in this course provide us with algorithms to calculate a solution, structure data and plot data. When looking at the source code of these library you may see keywords, such as `class`. This tutorial will delve into the fundamental concepts of OOP. @@ -51,6 +76,10 @@ One analogy for OOP goes as follows. Imagine you are an urban planner in the tow #### Objects Once we have the general structure of what a house is going to look like, we can use this blueprint to start building houses faster. With this blueprint defined. We can start building houses from the blueprint. In python we call this *initializing objects*. Where an actual house built from this blueprint with specific attributes (e.g. house #305, blue house, single-hung window, attached garage) is referred to as an `object`. +>[!NOTE] Difference between functions and objects +> **Functions:** A procedure. Such as a force. +> **Objects:** Include properties. Such as a spring, with a defined spring constant. + #### Example Let's take a look at an example. Of how this looks like in python. Do not worry if you don't understand the syntax below, we will explain this later. Let's define the class: ```python @@ -270,4 +299,6 @@ Notice how the `__init__` method defaults the inputs`breed="Mixed"` and `age=1` ```python -``` \ No newline at end of file +``` + +https://eng.libretexts.org/Courses/Arkansas_Tech_University/Engineering_Modeling_and_Analysis_with_Python/01%3A_Introduction_to_Engineering_Modeling_and_Analysis_with_Python \ No newline at end of file diff --git a/tutorials/module_1/control_structures.md b/tutorials/module_1/control_structures.md index 68db0c5..5b222ce 100644 --- a/tutorials/module_1/control_structures.md +++ b/tutorials/module_1/control_structures.md @@ -1,7 +1,5 @@ # Control Structures -^e8fcff - Control structures allow us to control the flow of execution in a Python program. The two main types are **conditional statements (`if` statements)** and **loops (`for` and `while` loops)**. [Input complete flowchart here] diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md deleted file mode 100644 index 882ce59..0000000 --- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md +++ /dev/null @@ -1,144 +0,0 @@ -# Introduction to Data and Scientific Datasets - -**Learning objectives:** - -- Understand what makes data “scientific” (units, precision, metadata) -- Identify challenges in data processing (missing data, noise, outliers) -- Overview of the data-analysis workflow ---- -### What is scientific data? -Scientific data refers to **measured or simulated information** that describes a physical phenomenon in a quantitative and reproducible way. Scientific data is rooted in physical laws and carries information about the system’s behavior, boundary conditions, and measurement uncertainty whether this is collected experimentally or predicted with a model. - -We may collect this in the following ways: -- **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities. -- **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions. -- **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems. -## Data Processing flow works -```mermaid -flowchart - A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization] -``` -Data processing begins with **collection**, where measurements are recorded either manually using instruments or electronically through sensors. Regardless of the method, every measurement contains some degree of error, whether due to instrument limitations or external interference. In engineering, recognizing and quantifying this uncertainty is essential, as it defines the confidence range of our predictions. - -Once the data has been collected, the next step is **cleaning and filtering**. This involves addressing missing data points, managing outliers, and reducing noise. Errors can arise from faulty readings, sensor drift, or transcription mistakes. By cleaning and filtering the data, we ensure it accurately represents the system being measured. - -After the data is refined, we move into **analysis**. Here, statistical methods and computational tools are applied to model the data, uncover trends, and test hypotheses. This stage transforms raw numbers into meaningful insight. - -Finally, **visualization** allows us to communicate these insights effectively. Visualization can occur alongside analysis to guide interpretation or as the concluding step to present results clearly and purposefully. Well-designed visualizations make complex findings intuitive and accessible to the intended audience. - -To carry out this workflow efficiently, particularly during the cleaning, analysis, and visualization stages, we rely on powerful computational tools. In Python, one of the most versatile and widely used libraries for handling tabular data is pandas. It simplifies the process of managing, transforming, and analyzing datasets, allowing engineers and scientists to focus on interpreting results rather than wrestling with raw data. - -## Introduction to pandas -`pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**. - -A `Series` represents a single column or one-dimensional labeled array, while a `DataFrame` is a two-dimensional table of data, similar to a spreadsheet table, where each column is a `Series` and each row has a labeled index. - -DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently. - -### Problem: Create a dataframe from an array -Given the data `force_N` and `time_s` - -```python -import pandas as pd - -force_N = [10, 20, 30, 25, 15] -time_s = [0, 1, 2, 3, 4] - -df = pd.DataFrame({ - 'Time (s)': time_s, - 'Force (N)': force_N -}) - -print("\n=== Basic Statistics ===") -print(df.describe()) -``` -Notice how `the describe()` function outputs some statistical data the we may find useful. -### Manipulating dataframes -#### Subsets and Conditional filtering -You can select rows, columns, or specific conditions from a DataFrame. - -```python -# Select a column -force = df["Force_N"] - -# Select multiple columns -subset = df[["Time_s", "Force_N"]] - -# Conditional filtering -df_high_force = df[df["Force_N"] > 50] -``` - -![[Pasted image 20251013064718.png]] - -#### Combining and Merging Datasets -Often, multiple sensors or experiments must be merged into one dataset for analysis. - -```python -# Merge on a common column (e.g., time) -merged = pd.merge(df_force, df_temp, on="Time_s") - -# Stack multiple test runs vertically -combined = pd.concat([df_run1, df_run2], axis=0) -``` -https://pandas.pydata.org/docs/user_guide/merging.html - -#### Creating new columns based on existing ones - - -Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant. -```python -air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 -``` - -We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows: -```python -air_quality["ratio_paris_antwerp"] = ( - air_quality["station_paris"] / air_quality["station_antwerp"] - ) -``` - -https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html -https://pandas.pydata.org/docs/user_guide/reshaping.html - -### Problem: Create a dataframe from Numpy arrays - - - -Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe -```python -import pandas as pd - -file_path = "force_displacement_data.txt" - -df_txt = pd.read_csv( - file_path, - delim_whitespace=True, - comment="#", - skiprows=0, - header=0 -) - -print("\n=== Basic Statistics ===") -print(df_txt.describe()) - -if "Force_N" in df_txt.columns: - print("\nFirst five Force readings:") - print(df_txt["Force_N"].head()) - -try: - import matplotlib.pyplot as plt - - plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) - plt.xlabel(df_txt.columns[0]) - plt.ylabel(df_txt.columns[1]) - plt.title("Loaded Data from Text File") - plt.grid(True) - plt.show() - -except ImportError: - print("\nmatplotlib not installed — skipping plot.") -``` - - -## **Activities & Examples:** -- Discuss real ME examples: strain gauge data, thermocouple readings, pressure transducers \ No newline at end of file diff --git a/tutorials/module_4/4.1_Introduction_to_Data_and_Scientific_Datasets.md b/tutorials/module_4/4.1_Introduction_to_Data_and_Scientific_Datasets.md new file mode 100644 index 0000000..882ce59 --- /dev/null +++ b/tutorials/module_4/4.1_Introduction_to_Data_and_Scientific_Datasets.md @@ -0,0 +1,144 @@ +# Introduction to Data and Scientific Datasets + +**Learning objectives:** + +- Understand what makes data “scientific” (units, precision, metadata) +- Identify challenges in data processing (missing data, noise, outliers) +- Overview of the data-analysis workflow +--- +### What is scientific data? +Scientific data refers to **measured or simulated information** that describes a physical phenomenon in a quantitative and reproducible way. Scientific data is rooted in physical laws and carries information about the system’s behavior, boundary conditions, and measurement uncertainty whether this is collected experimentally or predicted with a model. + +We may collect this in the following ways: +- **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities. +- **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions. +- **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems. +## Data Processing flow works +```mermaid +flowchart + A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization] +``` +Data processing begins with **collection**, where measurements are recorded either manually using instruments or electronically through sensors. Regardless of the method, every measurement contains some degree of error, whether due to instrument limitations or external interference. In engineering, recognizing and quantifying this uncertainty is essential, as it defines the confidence range of our predictions. + +Once the data has been collected, the next step is **cleaning and filtering**. This involves addressing missing data points, managing outliers, and reducing noise. Errors can arise from faulty readings, sensor drift, or transcription mistakes. By cleaning and filtering the data, we ensure it accurately represents the system being measured. + +After the data is refined, we move into **analysis**. Here, statistical methods and computational tools are applied to model the data, uncover trends, and test hypotheses. This stage transforms raw numbers into meaningful insight. + +Finally, **visualization** allows us to communicate these insights effectively. Visualization can occur alongside analysis to guide interpretation or as the concluding step to present results clearly and purposefully. Well-designed visualizations make complex findings intuitive and accessible to the intended audience. + +To carry out this workflow efficiently, particularly during the cleaning, analysis, and visualization stages, we rely on powerful computational tools. In Python, one of the most versatile and widely used libraries for handling tabular data is pandas. It simplifies the process of managing, transforming, and analyzing datasets, allowing engineers and scientists to focus on interpreting results rather than wrestling with raw data. + +## Introduction to pandas +`pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**. + +A `Series` represents a single column or one-dimensional labeled array, while a `DataFrame` is a two-dimensional table of data, similar to a spreadsheet table, where each column is a `Series` and each row has a labeled index. + +DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently. + +### Problem: Create a dataframe from an array +Given the data `force_N` and `time_s` + +```python +import pandas as pd + +force_N = [10, 20, 30, 25, 15] +time_s = [0, 1, 2, 3, 4] + +df = pd.DataFrame({ + 'Time (s)': time_s, + 'Force (N)': force_N +}) + +print("\n=== Basic Statistics ===") +print(df.describe()) +``` +Notice how `the describe()` function outputs some statistical data the we may find useful. +### Manipulating dataframes +#### Subsets and Conditional filtering +You can select rows, columns, or specific conditions from a DataFrame. + +```python +# Select a column +force = df["Force_N"] + +# Select multiple columns +subset = df[["Time_s", "Force_N"]] + +# Conditional filtering +df_high_force = df[df["Force_N"] > 50] +``` + +![[Pasted image 20251013064718.png]] + +#### Combining and Merging Datasets +Often, multiple sensors or experiments must be merged into one dataset for analysis. + +```python +# Merge on a common column (e.g., time) +merged = pd.merge(df_force, df_temp, on="Time_s") + +# Stack multiple test runs vertically +combined = pd.concat([df_run1, df_run2], axis=0) +``` +https://pandas.pydata.org/docs/user_guide/merging.html + +#### Creating new columns based on existing ones + + +Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant. +```python +air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 +``` + +We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows: +```python +air_quality["ratio_paris_antwerp"] = ( + air_quality["station_paris"] / air_quality["station_antwerp"] + ) +``` + +https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html +https://pandas.pydata.org/docs/user_guide/reshaping.html + +### Problem: Create a dataframe from Numpy arrays + + + +Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe +```python +import pandas as pd + +file_path = "force_displacement_data.txt" + +df_txt = pd.read_csv( + file_path, + delim_whitespace=True, + comment="#", + skiprows=0, + header=0 +) + +print("\n=== Basic Statistics ===") +print(df_txt.describe()) + +if "Force_N" in df_txt.columns: + print("\nFirst five Force readings:") + print(df_txt["Force_N"].head()) + +try: + import matplotlib.pyplot as plt + + plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) + plt.xlabel(df_txt.columns[0]) + plt.ylabel(df_txt.columns[1]) + plt.title("Loaded Data from Text File") + plt.grid(True) + plt.show() + +except ImportError: + print("\nmatplotlib not installed — skipping plot.") +``` + + +## **Activities & Examples:** +- Discuss real ME examples: strain gauge data, thermocouple readings, pressure transducers \ No newline at end of file diff --git a/tutorials/module_4/4.2 Interpreting Data.md b/tutorials/module_4/4.2 Interpreting Data.md deleted file mode 100644 index bbb4240..0000000 --- a/tutorials/module_4/4.2 Interpreting Data.md +++ /dev/null @@ -1,111 +0,0 @@ -# Interpreting Data for Plotting -# How to represent data scientifically -Philosophy of visualizing data - -A useful tool is using the acronym **PCC** to enforce legibility of your data. These three principles form the foundation of data visualization and ensure your figures communicate meaning rather than just display numbers. - -```mermaid -flowchart LR - A[Purpose] --> B[Composition] - B --> C[Color] - C --> |Clarify| A -``` - -Whether we are preparing figures for a lab report or a research paper, these three elements should always be applied when presenting data. They ensure that our figures are clear, effective, and convey the intended message to our audience. -- **Purpose** -> What are you trying to communicate? Are you explaining a process, comparing results, showing change, or revealing a relationship? -- **Composition** -> How do you arrange the elements of your figure so that the story is clear? -- **Color** -> How can you use contrast and tone to highlight key insights and guide your viewer’s attention? - -Remember: great figures rarely emerge on the first attempt. Iterating, refining layout, simplifying elements, or adjusting colors, helps ensure your data is represented honestly and effectively. - -*Remember:* Data don't lie and neither should your figures, even unintentionally. -## Syntax and semantics in Mathematics - The meaning of our data -In the English language, grammar defines the syntax, the structural rules that determine how words are arranged in a sentence. However, meaning arises only through semantics, which tells us what the sentence actually conveys. - -Similarly, in the language of mathematics, syntax consists of the formal rules that govern how we combine symbols, perform operations, and manipulate equations. Yet it is semantics, the interpretation of those symbols and relationships, that gives mathematics its meaning and connection to the real world. - -As engineers and scientists, we must grasp the semantics of our work, not merely the procedures, it is our responsibility to understand the meaning behind it. YouTube creator and rocket engineer Destin Sandlin, better known as SmarterEveryDay, illustrates this concept in his video on the “backwards bicycle,” which demonstrates how syntax and semantics parallel the difference between knowledge and understanding. - -![Backwards Brain Bike](https://www.youtube.com/watch?v=MFzDaBzBlL0) - - -## Purpose - Why? -> Does the figure show the overall story or main point when you hide the text? - -Starting with the most important aspect of a figure is the purpose. What do you want to show? Why are we showing this? What is so important? These questions will help us decide on what time of plot we need. There are many types of plots and some are better for different purposes. - -Often in engineering you find yourself **comparing** or **contrasting** or **show a change** between sets of data. For these cases you should use either a *line chart* or a *scatter plot*. This is often used when plotting mathematical function. - -In a lab report you may find yourself **explaining a process**. For this you may want to use a: *flowchart*, *diagram*, info graphic, illustration, *Gantt chart*, timeline, .etc. - -There are many other types of plots that you can choose from so it can be useful to think about who you're sharing your data with. This may be -- Colleague/Supervisor .etc -- Research conference -- Clients (may not always be technical professionals). - -## Composition - Making good plots ->Can you remove or adjust unnecessary elements that attract your attention? - -Composition refers to how you choose to format your plot, including labeling, gridlines, and axis scaling. - -Often, the main message of a figure can be obscured by too much information. To improve clarity, consider removing or simplifying unnecessary elements such as repetitive labels, bounding boxes, background colors, extra lines or colors, redundant text, and shadows or shading. You can also reduce clutter by adjusting or removing excess data and moving supporting information to supplementary figures. - -If applicable, be sure to follow any additional formatting or figure guidelines required by your target journal. - - - -## Color - Highlight Meaning ->Does the color palette enhance or distract from the story? - -Similarly to composition using color or the absence thereof (gray scale) can help you draw the attention of the read to a specific element of the plot. Here is an example of how color can be used to enhance the difference between the private-for-profit. - - - -Checklist - - [ ] Select appropriate type - - [ ] Labels - - [ ] Grid - - [ ] Axis - - [ ] Clarity - -## Problem 1: - -```python -import matplotlib.pyplot as plt -import numpy as np - -# Pseudo data -time_s = np.linspace(0,300,15) -temperature_C = 20 + 0.05 * time_s + 2 * np.random.randn(len(time_s)) - - -# Plot -plt.figure(figsize=(8,6)) -plt.plot(time_s, temperature_C, 'r--o', linewidth=5) -plt.title("Experiment 3") -plt.xlabel("x") -plt.ylabel("y") -plt.grid(True) -plt.legend(["line1"]) -plt.show() - -# Plot (IMPROVED) -plt.figure(figsize=(7,5)) -plt.plot(time_s, temperature_C, color='steelblue', marker='o', linewidth=2, label='Measured Temperature') -plt.title("Temperature Rise of Metal Rod During Heating", fontsize=14, weight='bold') -plt.xlabel("Time [s]", fontsize=12) -plt.ylabel("Temperature [°C]", fontsize=12) -plt.grid(True, linestyle='--', alpha=0.5) -plt.legend(frameon=False) -plt.tight_layout() -plt.show() -``` - - -## Data don't lie -And neither should your figures, even unintentionally. So it's important that you understand every step that stands between your raw data and the final figure. One way to think of this is that your data undergoes a series of transformations to get from what you measure to what ends up in your final results. Nothing in the workflow should be a magic "black box". - -## Problem 2: Misleading plots - - - diff --git a/tutorials/module_4/4.2_Interpreting_Data.md b/tutorials/module_4/4.2_Interpreting_Data.md new file mode 100644 index 0000000..bbb4240 --- /dev/null +++ b/tutorials/module_4/4.2_Interpreting_Data.md @@ -0,0 +1,111 @@ +# Interpreting Data for Plotting +# How to represent data scientifically +Philosophy of visualizing data + +A useful tool is using the acronym **PCC** to enforce legibility of your data. These three principles form the foundation of data visualization and ensure your figures communicate meaning rather than just display numbers. + +```mermaid +flowchart LR + A[Purpose] --> B[Composition] + B --> C[Color] + C --> |Clarify| A +``` + +Whether we are preparing figures for a lab report or a research paper, these three elements should always be applied when presenting data. They ensure that our figures are clear, effective, and convey the intended message to our audience. +- **Purpose** -> What are you trying to communicate? Are you explaining a process, comparing results, showing change, or revealing a relationship? +- **Composition** -> How do you arrange the elements of your figure so that the story is clear? +- **Color** -> How can you use contrast and tone to highlight key insights and guide your viewer’s attention? + +Remember: great figures rarely emerge on the first attempt. Iterating, refining layout, simplifying elements, or adjusting colors, helps ensure your data is represented honestly and effectively. + +*Remember:* Data don't lie and neither should your figures, even unintentionally. +## Syntax and semantics in Mathematics - The meaning of our data +In the English language, grammar defines the syntax, the structural rules that determine how words are arranged in a sentence. However, meaning arises only through semantics, which tells us what the sentence actually conveys. + +Similarly, in the language of mathematics, syntax consists of the formal rules that govern how we combine symbols, perform operations, and manipulate equations. Yet it is semantics, the interpretation of those symbols and relationships, that gives mathematics its meaning and connection to the real world. + +As engineers and scientists, we must grasp the semantics of our work, not merely the procedures, it is our responsibility to understand the meaning behind it. YouTube creator and rocket engineer Destin Sandlin, better known as SmarterEveryDay, illustrates this concept in his video on the “backwards bicycle,” which demonstrates how syntax and semantics parallel the difference between knowledge and understanding. + +![Backwards Brain Bike](https://www.youtube.com/watch?v=MFzDaBzBlL0) + + +## Purpose - Why? +> Does the figure show the overall story or main point when you hide the text? + +Starting with the most important aspect of a figure is the purpose. What do you want to show? Why are we showing this? What is so important? These questions will help us decide on what time of plot we need. There are many types of plots and some are better for different purposes. + +Often in engineering you find yourself **comparing** or **contrasting** or **show a change** between sets of data. For these cases you should use either a *line chart* or a *scatter plot*. This is often used when plotting mathematical function. + +In a lab report you may find yourself **explaining a process**. For this you may want to use a: *flowchart*, *diagram*, info graphic, illustration, *Gantt chart*, timeline, .etc. + +There are many other types of plots that you can choose from so it can be useful to think about who you're sharing your data with. This may be +- Colleague/Supervisor .etc +- Research conference +- Clients (may not always be technical professionals). + +## Composition - Making good plots +>Can you remove or adjust unnecessary elements that attract your attention? + +Composition refers to how you choose to format your plot, including labeling, gridlines, and axis scaling. + +Often, the main message of a figure can be obscured by too much information. To improve clarity, consider removing or simplifying unnecessary elements such as repetitive labels, bounding boxes, background colors, extra lines or colors, redundant text, and shadows or shading. You can also reduce clutter by adjusting or removing excess data and moving supporting information to supplementary figures. + +If applicable, be sure to follow any additional formatting or figure guidelines required by your target journal. + + + +## Color - Highlight Meaning +>Does the color palette enhance or distract from the story? + +Similarly to composition using color or the absence thereof (gray scale) can help you draw the attention of the read to a specific element of the plot. Here is an example of how color can be used to enhance the difference between the private-for-profit. + + + +Checklist + - [ ] Select appropriate type + - [ ] Labels + - [ ] Grid + - [ ] Axis + - [ ] Clarity + +## Problem 1: + +```python +import matplotlib.pyplot as plt +import numpy as np + +# Pseudo data +time_s = np.linspace(0,300,15) +temperature_C = 20 + 0.05 * time_s + 2 * np.random.randn(len(time_s)) + + +# Plot +plt.figure(figsize=(8,6)) +plt.plot(time_s, temperature_C, 'r--o', linewidth=5) +plt.title("Experiment 3") +plt.xlabel("x") +plt.ylabel("y") +plt.grid(True) +plt.legend(["line1"]) +plt.show() + +# Plot (IMPROVED) +plt.figure(figsize=(7,5)) +plt.plot(time_s, temperature_C, color='steelblue', marker='o', linewidth=2, label='Measured Temperature') +plt.title("Temperature Rise of Metal Rod During Heating", fontsize=14, weight='bold') +plt.xlabel("Time [s]", fontsize=12) +plt.ylabel("Temperature [°C]", fontsize=12) +plt.grid(True, linestyle='--', alpha=0.5) +plt.legend(frameon=False) +plt.tight_layout() +plt.show() +``` + + +## Data don't lie +And neither should your figures, even unintentionally. So it's important that you understand every step that stands between your raw data and the final figure. One way to think of this is that your data undergoes a series of transformations to get from what you measure to what ends up in your final results. Nothing in the workflow should be a magic "black box". + +## Problem 2: Misleading plots + + + diff --git a/tutorials/module_4/4.3 Importing and Managing Data.md b/tutorials/module_4/4.3 Importing and Managing Data.md deleted file mode 100644 index 91411c6..0000000 --- a/tutorials/module_4/4.3 Importing and Managing Data.md +++ /dev/null @@ -1,107 +0,0 @@ -# Importing and Exporting Data - -**Learning objectives:** - -- Import data from CSV, Excel, and text files using Pandas -- Handle headers, delimiters, and units -- Combine and merge multiple datasets -- Manage data with time or index labels ---- -## File types -Once data is collected, the first step is importing it into a structured form that Python can interpret. The `pandas` library provides the foundation for this, it can read nearly any file format used in engineering (text files, CSV, Excel sheets, CFD results, etc. as well as many python formats such as, arrays, lists, dicitonaries, Numpy arrays etc.) and organize the data in a DataFrame, a tabular structure similar to an Excel sheet but optimized for coding. -![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg) -## Importing spreadsheets using Pandas -Comma-Separated Values (CSV) files is a common spreadsheet type file. It is essentially a text file where each line is a now row of tables and commas indicate that a new column has stated. It is a standard convention to save spreadsheets in this format. - -Let's take a look at how this works in python. -```python -import pandas as pd - -# Read a CSV file -df = pd.read_csv("data_experiment.csv") - -# Optional arguments -df_csv = pd.read_csv( - "data_experiment.csv", - delimiter=",", # specify custom delimiter - header=0, # row number to use as header - index_col=None, # or specify a column as index - skiprows=0, # skip metadata lines -) -print df -``` - -We now created a new dataframe with the data from our .csv file. - -We can also do this for **excel files**. Pandas has a built-in function to make this easier for us. -```python -df_xlsx = pd.read_excel("temperature_log.xlsx", sheet_name="Sheet1") -print(df_xlsx.head()) -``` - -Additionally, although not a very common practice in engineering but very useful: Pandas can import a wide variety of file types such as JSON, HTML, SQL or even your clipboard. - -### Handling Headers, Units, and Metadata -Raw data often contains metadata or units above the table. Pandas can account for this metadata by skipping the first few rows. - -```python -df = pd.read_csv("sensor_data.csv", skiprows=3) -df.columns = ["Time_s", "Force_N", "Displacement_mm"] - -# Convert units -df["Displacement_m"] = df["Displacement_mm"] / 1000 -``` - -### Writing and Editing Data in pandas -https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html - -Once data has been analyzed or cleaned, `pandas` allows you to **export results** to multiple file types for reporting or further processing. Similarily to importing we can also export .csv files and Excel files. Pandas makes it easy to modify individual datapoints directly within a DataFrame. You can localize entries either by label or position - -```python -# by name -df.loc[row_label, column_label]`  -#or by position  -df.iloc[row_index, column_index] -``` - - -```python -import pandas as pd - -# Create DataFrame manually -data = { - "Time_s": [0, 1, 2, 3], - "Force_N": [0.0, 5.2, 10.4, 15.5], - "Displacement_mm": [0.0, 0.3, 0.6, 0.9] -} -df = pd.DataFrame(data) - -# Edit a single value -df.loc[1, "Force_N"] = 5.5 - -# Export to CSV -df.to_csv("edited_experiment.csv", index=False) -``` - -This workflow makes pandas ideal for working with tabular data, you can quickly edit or generate datasets, verify values, and save clean, structured files for later visualization or analysis. - - -### Problem: Import time stamped data - - - - - - - - - -# Further Docs -[Comparison with Spreadsheets](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_spreadsheets.html#compare-with-spreadsheets) -[Intro to Reading/Writing Files](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) -[Subsetting Data](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) -[Adding Columns](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) -[Reshaping Data](https://pandas.pydata.org/docs/user_guide/reshaping.html) -[Merging DataFrames](https://pandas.pydata.org/docs/user_guide/merging.html) -[Combining DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html) - diff --git a/tutorials/module_4/4.3_Importing_and_Managing_Data.md b/tutorials/module_4/4.3_Importing_and_Managing_Data.md new file mode 100644 index 0000000..91411c6 --- /dev/null +++ b/tutorials/module_4/4.3_Importing_and_Managing_Data.md @@ -0,0 +1,107 @@ +# Importing and Exporting Data + +**Learning objectives:** + +- Import data from CSV, Excel, and text files using Pandas +- Handle headers, delimiters, and units +- Combine and merge multiple datasets +- Manage data with time or index labels +--- +## File types +Once data is collected, the first step is importing it into a structured form that Python can interpret. The `pandas` library provides the foundation for this, it can read nearly any file format used in engineering (text files, CSV, Excel sheets, CFD results, etc. as well as many python formats such as, arrays, lists, dicitonaries, Numpy arrays etc.) and organize the data in a DataFrame, a tabular structure similar to an Excel sheet but optimized for coding. +![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg) +## Importing spreadsheets using Pandas +Comma-Separated Values (CSV) files is a common spreadsheet type file. It is essentially a text file where each line is a now row of tables and commas indicate that a new column has stated. It is a standard convention to save spreadsheets in this format. + +Let's take a look at how this works in python. +```python +import pandas as pd + +# Read a CSV file +df = pd.read_csv("data_experiment.csv") + +# Optional arguments +df_csv = pd.read_csv( + "data_experiment.csv", + delimiter=",", # specify custom delimiter + header=0, # row number to use as header + index_col=None, # or specify a column as index + skiprows=0, # skip metadata lines +) +print df +``` + +We now created a new dataframe with the data from our .csv file. + +We can also do this for **excel files**. Pandas has a built-in function to make this easier for us. +```python +df_xlsx = pd.read_excel("temperature_log.xlsx", sheet_name="Sheet1") +print(df_xlsx.head()) +``` + +Additionally, although not a very common practice in engineering but very useful: Pandas can import a wide variety of file types such as JSON, HTML, SQL or even your clipboard. + +### Handling Headers, Units, and Metadata +Raw data often contains metadata or units above the table. Pandas can account for this metadata by skipping the first few rows. + +```python +df = pd.read_csv("sensor_data.csv", skiprows=3) +df.columns = ["Time_s", "Force_N", "Displacement_mm"] + +# Convert units +df["Displacement_m"] = df["Displacement_mm"] / 1000 +``` + +### Writing and Editing Data in pandas +https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html + +Once data has been analyzed or cleaned, `pandas` allows you to **export results** to multiple file types for reporting or further processing. Similarily to importing we can also export .csv files and Excel files. Pandas makes it easy to modify individual datapoints directly within a DataFrame. You can localize entries either by label or position + +```python +# by name +df.loc[row_label, column_label]`  +#or by position  +df.iloc[row_index, column_index] +``` + + +```python +import pandas as pd + +# Create DataFrame manually +data = { + "Time_s": [0, 1, 2, 3], + "Force_N": [0.0, 5.2, 10.4, 15.5], + "Displacement_mm": [0.0, 0.3, 0.6, 0.9] +} +df = pd.DataFrame(data) + +# Edit a single value +df.loc[1, "Force_N"] = 5.5 + +# Export to CSV +df.to_csv("edited_experiment.csv", index=False) +``` + +This workflow makes pandas ideal for working with tabular data, you can quickly edit or generate datasets, verify values, and save clean, structured files for later visualization or analysis. + + +### Problem: Import time stamped data + + + + + + + + + +# Further Docs +[Comparison with Spreadsheets](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_spreadsheets.html#compare-with-spreadsheets) +[Intro to Reading/Writing Files](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) +[Subsetting Data](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) +[Adding Columns](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) +[Reshaping Data](https://pandas.pydata.org/docs/user_guide/reshaping.html) +[Merging DataFrames](https://pandas.pydata.org/docs/user_guide/merging.html) +[Combining DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html) + diff --git a/tutorials/module_4/4.4 Statistical Analysis.md b/tutorials/module_4/4.4 Statistical Analysis.md deleted file mode 100644 index f61caa9..0000000 --- a/tutorials/module_4/4.4 Statistical Analysis.md +++ /dev/null @@ -1,94 +0,0 @@ -# Statistical Analysis -## Subtitle: Using statistics to reduce uncertainty - -**Learning Objectives:** - -- Descriptive statistics (mean, median, variance, std deviation) -- Histograms and probability distributions -- Correlation and regression -- Uncertainty, error bars, confidence intervals ---- -## Engineering Models -#### Why Do We Care? -In engineering, data is more than just a collection of numbers, it tells a story about how systems behave. By analyzing data, we can develop mathematical models that describe and predict physical behavior. These models help us answer questions such as: -- How does stress relate to strain for a given material? -- How does temperature affect efficiency in a heat engine? -- How does flow rate change with pressure in a pipe? - -When we fit equations to experimental data, we turn observations into predictive tools. This process allows engineers to forecast performance, optimize designs, and identify system limitations. -#### From Data to Models -A common way to build an engineering model is through curve fitting, finding a mathematical expression that best represents the trend in your data. - -You’ve likely done this before in Excel by adding a “trendline” to a plot. In this module, we’ll take that concept further using Python, which allows for more control, flexibility, and insight into the underlying math. - -We’ll learn to: -- Fit linear, exponential, and polynomial relationships. -- Evaluate how well our model fits the data using metrics like R². -- Use models to predict outcomes beyond the measured range (carefully). - -By the end of this section, you’ll understand not just how to fit data, but why certain models work better for specific engineering problems. -## Statistics Review -Let's take a second to remind ourselves of some statistical terms and how we define it mathematically - -| | Formula | Measurement | -| ------------------------ | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | -| Arithmetic Mean | $$\bar{y} = \frac{\sum y_i}{n}$$ | Average. (in Units) | -| Standard Deviation | $$s_y = \sqrt{\frac{S_t}{n - 1}}, \quad S_t = \sum (y_i - \bar{y})^2$$ | Absolute measure of spread from the average. (in Units) | -| Variance | $$s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1} = \frac{\left(\sum y_i^2 - \frac{(\sum y_i)^2}{n}\right)}{n- 1}$$ | How spread out the data is from the average. (in Squared Units) | -| Coefficient of Variation | $$c.v. = \frac{s_y}{\bar{y}} \times 100\%$$ | Relative measure of spread from the average or the consistency of the data. (Unitless) | - -## Statistics function in python -Both Numpy and Pandas come with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data. -```python -import numpy as np - -mean = np.mean([1, 2, 3, 4, 5]) -median = np.median([1, 2, 3, 4, 5]) -std = np.std([1, 2, 3, 4, 5]) -variance = np.var([1, 2, 3, 4, 5]) -``` - -Pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. When working with pandas we can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once. - -## Problem: Use pd.describe() to report on a dataseries - - ---- - -## Reducing uncertainty using statistics -Great, so we -## Statistical Distributions -Every engineering measurement contains some amount of variation. Whether you’re measuring the intensity of a spectral line, the pressure in a cylinder, or the thickness of a machined part, small deviations are inevitable. Statistical distributions help us quantify and visualize that variation, giving engineers a way to decide what is normal and what is error. -### Normal Distribution -Most experimental data follows a normal (Gaussian) distribution, where values cluster around the mean and taper off symmetrically toward the extremes. - - -In a normal distribution: -- About 68 % of data lies within $\pm 1 \sigma$ of the mean -- About 95 % lies within $\pm 2 \sigma$ -- About 99.7 % lies within $± 3 \sigma$ - -This helps to assess confidence in their results and identify outliers that may indicate bad readings or faulty sensors. - ->[!NOTE] Design Thinking - Reliability ->Motorola popularized Six Sigma design to minimize manufacturing defects. The goal was to design processes where the probability of failure is less than 3.4 per million parts, essentially operating six standard deviations from the mean. The mindset here is proactive design: if we understand how variability behaves, we can design systems that tolerate it. -> ->Takeaway: Statistical distributions aren’t just for data analysis they guide how reliable we make our products. - -## Practical Application: Spectroscopy -### Background -Spectroscopy is the study of how matter interacts with electromagnetic radiation, including the absorption and emission of light and other forms of radiation. It examines how these interactions depend on the wavelength of the radiation, providing insight into the physical and chemical properties of materials. This is how NASA determines the composition of planetary surfaces and atmospheres. It's also applied in combustion and thermal analysis where spectroscopy measure plasma temperature and monitors exhaust composition in rocket engines. - -In simple terms, spectroscopy helps us understand what substances are made of and how they behave when exposed to high levels energy to help improve system performance and efficiency. These applications allow us to better understand material behavior under extreme conditions and improve system performance and efficiency. - -### Spectrometer -The instrument used to measure the spectra of light is called a spectrometer. It works on the basis of taking light, scatters it and then projecting the spectra onto a detector allowing us to capture the intensity of the light at different wavelengths. See the supplementary video of the inside of a spectrometer. -![How spectrometers work](https://www.youtube.com/watch?v=OI3pIvLhVcc) - -Once the data is collected we can compare our data with know spectra of elements to then identify their composition. The figure below, show the spectra of different elements. - - - - -## Problem: Eliminating uncertainty in Spectroscopy readings -When using spectroscopy to measure emission intensity, each reading fluctuates slightly due to sensor noise, temperature drift or electronic fluctuations. By taking multiple readings and averaging them, random errors (positive and negative) tend to cancel out, the mean converges to the true value. The standard deviation quantifies how precise the measurement is. Plot all readings of intensity as a function of wavelength on top of each other. Calculate the mean, standard deviation and variance. Then plot the intensity readings as a histogram. Comment on the distributions type. \ No newline at end of file diff --git a/tutorials/module_4/4.4_Statistical_Analysis.md b/tutorials/module_4/4.4_Statistical_Analysis.md new file mode 100644 index 0000000..f61caa9 --- /dev/null +++ b/tutorials/module_4/4.4_Statistical_Analysis.md @@ -0,0 +1,94 @@ +# Statistical Analysis +## Subtitle: Using statistics to reduce uncertainty + +**Learning Objectives:** + +- Descriptive statistics (mean, median, variance, std deviation) +- Histograms and probability distributions +- Correlation and regression +- Uncertainty, error bars, confidence intervals +--- +## Engineering Models +#### Why Do We Care? +In engineering, data is more than just a collection of numbers, it tells a story about how systems behave. By analyzing data, we can develop mathematical models that describe and predict physical behavior. These models help us answer questions such as: +- How does stress relate to strain for a given material? +- How does temperature affect efficiency in a heat engine? +- How does flow rate change with pressure in a pipe? + +When we fit equations to experimental data, we turn observations into predictive tools. This process allows engineers to forecast performance, optimize designs, and identify system limitations. +#### From Data to Models +A common way to build an engineering model is through curve fitting, finding a mathematical expression that best represents the trend in your data. + +You’ve likely done this before in Excel by adding a “trendline” to a plot. In this module, we’ll take that concept further using Python, which allows for more control, flexibility, and insight into the underlying math. + +We’ll learn to: +- Fit linear, exponential, and polynomial relationships. +- Evaluate how well our model fits the data using metrics like R². +- Use models to predict outcomes beyond the measured range (carefully). + +By the end of this section, you’ll understand not just how to fit data, but why certain models work better for specific engineering problems. +## Statistics Review +Let's take a second to remind ourselves of some statistical terms and how we define it mathematically + +| | Formula | Measurement | +| ------------------------ | ---------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | +| Arithmetic Mean | $$\bar{y} = \frac{\sum y_i}{n}$$ | Average. (in Units) | +| Standard Deviation | $$s_y = \sqrt{\frac{S_t}{n - 1}}, \quad S_t = \sum (y_i - \bar{y})^2$$ | Absolute measure of spread from the average. (in Units) | +| Variance | $$s_y^2 = \frac{\sum (y_i - \bar{y})^2}{n - 1} = \frac{\left(\sum y_i^2 - \frac{(\sum y_i)^2}{n}\right)}{n- 1}$$ | How spread out the data is from the average. (in Squared Units) | +| Coefficient of Variation | $$c.v. = \frac{s_y}{\bar{y}} \times 100\%$$ | Relative measure of spread from the average or the consistency of the data. (Unitless) | + +## Statistics function in python +Both Numpy and Pandas come with some useful statistical tools that we can use to analyze our data. We can use these tools when working with data, it’s important to understand the **central tendency** and **spread** of your dataset. NumPy provides several built-in functions to quickly compute common statistical metrics such as **mean**, **median**, **standard deviation**, and **variance**. These are fundamental tools for analyzing measurement consistency, uncertainty, and identifying trends in data. +```python +import numpy as np + +mean = np.mean([1, 2, 3, 4, 5]) +median = np.median([1, 2, 3, 4, 5]) +std = np.std([1, 2, 3, 4, 5]) +variance = np.var([1, 2, 3, 4, 5]) +``` + +Pandas also includes several built-in statistical tools that make it easy to analyze entire datasets directly from a DataFrame. When working with pandas we can use methods such as `.mean()`, `.std()`, `.var()`, and especially `.describe()` to generate quick summaries of your data. These tools are convenient when working with experimental or simulation data that contain multiple variables, allowing you to assess trends, variability, and potential outliers all at once. + +## Problem: Use pd.describe() to report on a dataseries + + +--- + +## Reducing uncertainty using statistics +Great, so we +## Statistical Distributions +Every engineering measurement contains some amount of variation. Whether you’re measuring the intensity of a spectral line, the pressure in a cylinder, or the thickness of a machined part, small deviations are inevitable. Statistical distributions help us quantify and visualize that variation, giving engineers a way to decide what is normal and what is error. +### Normal Distribution +Most experimental data follows a normal (Gaussian) distribution, where values cluster around the mean and taper off symmetrically toward the extremes. + + +In a normal distribution: +- About 68 % of data lies within $\pm 1 \sigma$ of the mean +- About 95 % lies within $\pm 2 \sigma$ +- About 99.7 % lies within $± 3 \sigma$ + +This helps to assess confidence in their results and identify outliers that may indicate bad readings or faulty sensors. + +>[!NOTE] Design Thinking - Reliability +>Motorola popularized Six Sigma design to minimize manufacturing defects. The goal was to design processes where the probability of failure is less than 3.4 per million parts, essentially operating six standard deviations from the mean. The mindset here is proactive design: if we understand how variability behaves, we can design systems that tolerate it. +> +>Takeaway: Statistical distributions aren’t just for data analysis they guide how reliable we make our products. + +## Practical Application: Spectroscopy +### Background +Spectroscopy is the study of how matter interacts with electromagnetic radiation, including the absorption and emission of light and other forms of radiation. It examines how these interactions depend on the wavelength of the radiation, providing insight into the physical and chemical properties of materials. This is how NASA determines the composition of planetary surfaces and atmospheres. It's also applied in combustion and thermal analysis where spectroscopy measure plasma temperature and monitors exhaust composition in rocket engines. + +In simple terms, spectroscopy helps us understand what substances are made of and how they behave when exposed to high levels energy to help improve system performance and efficiency. These applications allow us to better understand material behavior under extreme conditions and improve system performance and efficiency. + +### Spectrometer +The instrument used to measure the spectra of light is called a spectrometer. It works on the basis of taking light, scatters it and then projecting the spectra onto a detector allowing us to capture the intensity of the light at different wavelengths. See the supplementary video of the inside of a spectrometer. +![How spectrometers work](https://www.youtube.com/watch?v=OI3pIvLhVcc) + +Once the data is collected we can compare our data with know spectra of elements to then identify their composition. The figure below, show the spectra of different elements. + + + + +## Problem: Eliminating uncertainty in Spectroscopy readings +When using spectroscopy to measure emission intensity, each reading fluctuates slightly due to sensor noise, temperature drift or electronic fluctuations. By taking multiple readings and averaging them, random errors (positive and negative) tend to cancel out, the mean converges to the true value. The standard deviation quantifies how precise the measurement is. Plot all readings of intensity as a function of wavelength on top of each other. Calculate the mean, standard deviation and variance. Then plot the intensity readings as a histogram. Comment on the distributions type. \ No newline at end of file diff --git a/tutorials/module_4/4.5 Statistical Analysis II.md b/tutorials/module_4/4.5 Statistical Analysis II.md deleted file mode 100644 index 20805c9..0000000 --- a/tutorials/module_4/4.5 Statistical Analysis II.md +++ /dev/null @@ -1,179 +0,0 @@ -# 4.5 Statistical Analysis II -## Modelling Relationships -As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions. - -## Least Square Regression and Line of Best Fit - -### What is Regression? -Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data. - -### Linear -The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line -$$ -y=mx+b\tag{} -$$ - - -### Exponential and Power functions -You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. - -### Polynomial -Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature. - -```python -x_d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]) -y_d = np.array([0, 0.8, 0.9, 0.1, -0.6, -0.8, -1, -0.9, -0.4]) - -plt.figure(figsize = (12, 8)) -for i in range(1, 7): - - # get the polynomial coefficients - y_est = np.polyfit(x_d, y_d, i) - plt.subplot(2,3,i) - plt.plot(x_d, y_d, 'o') - # evaluate the values for a polynomial - plt.plot(x_d, np.polyval(y_est, x_d)) - plt.title(f'Polynomial order {i}') - -plt.tight_layout() -plt.show() -``` - -### Using Scipy -You can also use scipy to -```python -# let's define the function form -def func(x, a, b): - y = a*np.exp(b*x) - return y - -alpha, beta = optimize.curve_fit(func, xdata = x, ydata = y)[0] -print(f'alpha={alpha}, beta={beta}') - -# Let's have a look of the data -plt.figure(figsize = (10,8)) -plt.plot(x, y, 'b.') -plt.plot(x, alpha*np.exp(beta*x), 'r') -plt.xlabel('x') -plt.ylabel('y') -plt.show() -``` - - -### How well did we do? -After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data. -#### Sum of Squares -We define several *sum of squares* quantities that measure total variation and error: - -$$ -\begin{aligned} -S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\ -S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\ -S_l &= S_t - S_r &\text{(variation explained by the regression line)} -\end{aligned} -$$ -Where: -* $y_i$ = observed data -* $\hat{y}_i$ = predicted data from the model -* $\bar{y}$ = mean of observed data -#### Standard Error of the Estimate -If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line: - -$$ -s_{y/x} = \sqrt{\frac{S_r}{n - 2}} -$$ -where $n$ is the number of data points. -Smaller $s_{y/x}$ means the regression line passes closer to the data points. - -#### Coefficient of Determination – ($R^2$) -The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression: -$$ -R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t} -$$ -- ($R^2$ = 1.0) → perfect fit (all points on the line) -- ($R^2$ = 0) → model explains none of the variation -In engineering terms, a high (R^2) indicates that your model captures most of the physical trend, for example, how deflection scales with load. - -#### Correlation Coefficient – ($r$) -For linear regression, the correlation coefficient (r) is the square root of (R^2), with sign matching the slope of the line: - -$$ -r = \pm \sqrt{R^2} -$$ -- ($r$ > 0): positive correlation (both variables increase together) -- ($r$ < 0): negative correlation (one increases, the other decreases) -## Problem 1: -Fit a linear and polynomial model to stress-strain data. Compute R^2 and discuss which model fits better. - -```python -import numpy as np - -# Example data -x = np.array([0, 1, 2, 3, 4, 5]) -y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2]) - -# Linear fit -m, b = np.polyfit(x, y, 1) -y_pred = m*x + b - -# Calculate residuals and metrics -Sr = np.sum((y - y_pred)**2) -St = np.sum((y - np.mean(y))**2) -syx = np.sqrt(Sr / (len(y) - 2)) -R2 = 1 - Sr/St -r = np.sign(m) * np.sqrt(R2) - -print(f"s_y/x = {syx:.3f}") -print(f"R^2 = {R2:.3f}") -print(f"r = {r:.3f}") -``` - -## Extrapolation -Once we have a regression model, it’s tempting to use it to predict values beyond the range of measured data. This process is called extrapolation. - -In interpolation, the model is supported by real data on both sides of the point. In extrapolation, we’re assuming that the same physical relationship continues indefinitely and that’s often not true in engineering systems. - -Most regression equations are empirical as they describe the trend in the range of observed conditions but may not capture the true physics. Common issues may originate from nonlinear behavior outside range such as stress–strain curves. Physical limitations, such as below absolute 0 temperatures, or greater than 100% efficiencies. Another case could be where the mechanism changes in the real world making the model inapplicable such as heat transfer switching from convection to radiation at higher temperatures. - -Some guidelines of using extrapolation: -- Plot the data used for fitting -- Avoid predicting far beyond the range of your data unless supported by physical models -## Moving average -Real experimental data often contains small random fluctuations that obscure the underlying trend a.k.a. noise. Rather than fitting a complex equation, we can smooth the data using a moving average, which replaces each point with the average of its nearby neighbors. This simple method reduces random variation while preserving the overall shape of the signal. - -A moving average or rolling mean takes the average over a sliding window of data points given by the equation: -$$\bar{y}_i = \frac{1}{N} \sum_{j=i-k}^{i+k} y_j$$ -where: -- $N$ = window size (number of points averaged), -- $k = (N-1)/2$ if the window is centered, -- $y_j$​ = original data values. - -If you select a larger window you'll have a smoother curve, but you loose detail. A smaller windows retains more detail but reduces less noise. -### Example: Smoothing sensor noise -```python -import numpy as np -import matplotlib.pyplot as plt -import pandas as pd - -# Generate noisy signal -x = np.linspace(0, 4*np.pi, 100) -y = np.sin(x) + 0.2*np.random.randn(100) - -# Apply moving average with different window sizes -df = pd.DataFrame({'x': x, 'y': y}) -df['y_smooth_5'] = df['y'].rolling(window=5, center=True).mean() -df['y_smooth_15'] = df['y'].rolling(window=15, center=True).mean() - -plt.plot(df['x'], df['y'], 'k.', alpha=0.4, label='Raw data') -plt.plot(df['x'], df['y_smooth_5'], 'r-', label='Window = 5') -plt.plot(df['x'], df['y_smooth_15'], 'b-', label='Window = 15') -plt.xlabel('Time (s)') -plt.ylabel('Signal') -plt.title('Effect of Moving Average Window Size') -plt.legend() -plt.show() -``` - -## Problem 2: Moving average -Apply a moving average to noisy temperature data and compare raw vs. smoothed signals - diff --git a/tutorials/module_4/4.5_Statistical_Analysis_II.md b/tutorials/module_4/4.5_Statistical_Analysis_II.md new file mode 100644 index 0000000..20805c9 --- /dev/null +++ b/tutorials/module_4/4.5_Statistical_Analysis_II.md @@ -0,0 +1,179 @@ +# 4.5 Statistical Analysis II +## Modelling Relationships +As mentioned in the previous tutorial. Data is what gives us the basis to create models. By now you've probably used excel to create a line of best fit. In this tutorial, we will go deeper into how this works and how we can apply this to create our own models to make our own predictions. + +## Least Square Regression and Line of Best Fit + +### What is Regression? +Linear regression is one of the most fundamental techniques in data analysis. It models the relationship between two (or more) variables by fitting a **straight line** that best describes the trend in the data. + +### Linear +The easiest form of regression a linear regression line. This is based on the principle of finding a straight line through our data that minimizes the error between the data and the predicted line of best fit. It is quite intuitive to do visually. However is there a way we can do this mathematically to ensure we the optimal line? Let's consider a straight line +$$ +y=mx+b\tag{} +$$ + + +### Exponential and Power functions +You may have asked yourself. "What if my data is not linear?". If the variables in your data is related to each other by exponential or power we can use a logarithm trick. We can apply a log scale to the function to linearize the function and then apply the linear least-squares method. + +### Polynomial +Least squares method can also be applied to polynomial functions. For non-linear equations function such as a polynomial, Numpy has a nice feature. + +```python +x_d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]) +y_d = np.array([0, 0.8, 0.9, 0.1, -0.6, -0.8, -1, -0.9, -0.4]) + +plt.figure(figsize = (12, 8)) +for i in range(1, 7): + + # get the polynomial coefficients + y_est = np.polyfit(x_d, y_d, i) + plt.subplot(2,3,i) + plt.plot(x_d, y_d, 'o') + # evaluate the values for a polynomial + plt.plot(x_d, np.polyval(y_est, x_d)) + plt.title(f'Polynomial order {i}') + +plt.tight_layout() +plt.show() +``` + +### Using Scipy +You can also use scipy to +```python +# let's define the function form +def func(x, a, b): + y = a*np.exp(b*x) + return y + +alpha, beta = optimize.curve_fit(func, xdata = x, ydata = y)[0] +print(f'alpha={alpha}, beta={beta}') + +# Let's have a look of the data +plt.figure(figsize = (10,8)) +plt.plot(x, y, 'b.') +plt.plot(x, alpha*np.exp(beta*x), 'r') +plt.xlabel('x') +plt.ylabel('y') +plt.show() +``` + + +### How well did we do? +After fitting a regression model, we may ask ourselves how closely the model actually represent the data. To quantify this, we use **error metrics** that compare the predicted values from our model to the measured data. +#### Sum of Squares +We define several *sum of squares* quantities that measure total variation and error: + +$$ +\begin{aligned} +S_t &= \sum (y_i - \bar{y})^2 &\text{(total variation in data)}\\ +S_r &= \sum (y_i - \hat{y}_i)^2 &\text{(residual variation — unexplained by the model)}\\ +S_l &= S_t - S_r &\text{(variation explained by the regression line)} +\end{aligned} +$$ +Where: +* $y_i$ = observed data +* $\hat{y}_i$ = predicted data from the model +* $\bar{y}$ = mean of observed data +#### Standard Error of the Estimate +If the scatter of data about the regression line is approximately normal, the **standard error of the estimate** represents the typical deviation of a point from the fitted line: + +$$ +s_{y/x} = \sqrt{\frac{S_r}{n - 2}} +$$ +where $n$ is the number of data points. +Smaller $s_{y/x}$ means the regression line passes closer to the data points. + +#### Coefficient of Determination – ($R^2$) +The coefficient of determination, (R^2), tells us how much of the total variation in (y) is explained by the regression: +$$ +R^2 = \frac{S_l}{S_t} = 1 - \frac{S_r}{S_t} +$$ +- ($R^2$ = 1.0) → perfect fit (all points on the line) +- ($R^2$ = 0) → model explains none of the variation +In engineering terms, a high (R^2) indicates that your model captures most of the physical trend, for example, how deflection scales with load. + +#### Correlation Coefficient – ($r$) +For linear regression, the correlation coefficient (r) is the square root of (R^2), with sign matching the slope of the line: + +$$ +r = \pm \sqrt{R^2} +$$ +- ($r$ > 0): positive correlation (both variables increase together) +- ($r$ < 0): negative correlation (one increases, the other decreases) +## Problem 1: +Fit a linear and polynomial model to stress-strain data. Compute R^2 and discuss which model fits better. + +```python +import numpy as np + +# Example data +x = np.array([0, 1, 2, 3, 4, 5]) +y = np.array([0, 1.2, 2.3, 3.1, 3.9, 5.2]) + +# Linear fit +m, b = np.polyfit(x, y, 1) +y_pred = m*x + b + +# Calculate residuals and metrics +Sr = np.sum((y - y_pred)**2) +St = np.sum((y - np.mean(y))**2) +syx = np.sqrt(Sr / (len(y) - 2)) +R2 = 1 - Sr/St +r = np.sign(m) * np.sqrt(R2) + +print(f"s_y/x = {syx:.3f}") +print(f"R^2 = {R2:.3f}") +print(f"r = {r:.3f}") +``` + +## Extrapolation +Once we have a regression model, it’s tempting to use it to predict values beyond the range of measured data. This process is called extrapolation. + +In interpolation, the model is supported by real data on both sides of the point. In extrapolation, we’re assuming that the same physical relationship continues indefinitely and that’s often not true in engineering systems. + +Most regression equations are empirical as they describe the trend in the range of observed conditions but may not capture the true physics. Common issues may originate from nonlinear behavior outside range such as stress–strain curves. Physical limitations, such as below absolute 0 temperatures, or greater than 100% efficiencies. Another case could be where the mechanism changes in the real world making the model inapplicable such as heat transfer switching from convection to radiation at higher temperatures. + +Some guidelines of using extrapolation: +- Plot the data used for fitting +- Avoid predicting far beyond the range of your data unless supported by physical models +## Moving average +Real experimental data often contains small random fluctuations that obscure the underlying trend a.k.a. noise. Rather than fitting a complex equation, we can smooth the data using a moving average, which replaces each point with the average of its nearby neighbors. This simple method reduces random variation while preserving the overall shape of the signal. + +A moving average or rolling mean takes the average over a sliding window of data points given by the equation: +$$\bar{y}_i = \frac{1}{N} \sum_{j=i-k}^{i+k} y_j$$ +where: +- $N$ = window size (number of points averaged), +- $k = (N-1)/2$ if the window is centered, +- $y_j$​ = original data values. + +If you select a larger window you'll have a smoother curve, but you loose detail. A smaller windows retains more detail but reduces less noise. +### Example: Smoothing sensor noise +```python +import numpy as np +import matplotlib.pyplot as plt +import pandas as pd + +# Generate noisy signal +x = np.linspace(0, 4*np.pi, 100) +y = np.sin(x) + 0.2*np.random.randn(100) + +# Apply moving average with different window sizes +df = pd.DataFrame({'x': x, 'y': y}) +df['y_smooth_5'] = df['y'].rolling(window=5, center=True).mean() +df['y_smooth_15'] = df['y'].rolling(window=15, center=True).mean() + +plt.plot(df['x'], df['y'], 'k.', alpha=0.4, label='Raw data') +plt.plot(df['x'], df['y_smooth_5'], 'r-', label='Window = 5') +plt.plot(df['x'], df['y_smooth_15'], 'b-', label='Window = 15') +plt.xlabel('Time (s)') +plt.ylabel('Signal') +plt.title('Effect of Moving Average Window Size') +plt.legend() +plt.show() +``` + +## Problem 2: Moving average +Apply a moving average to noisy temperature data and compare raw vs. smoothed signals + diff --git a/tutorials/module_4/4.6 Data Filtering and Signal Processing.md b/tutorials/module_4/4.6 Data Filtering and Signal Processing.md deleted file mode 100644 index 74b7cb0..0000000 --- a/tutorials/module_4/4.6 Data Filtering and Signal Processing.md +++ /dev/null @@ -1,225 +0,0 @@ -# Data Filtering and Signal Processing - -**Learning Objectives** - -- Understand the purpose of filtering in experimental and computational data -- Differentiate between noise, bias, and true signal -- Apply time-domain and frequency-domain filters to remove unwanted noise -- Introduce basic spatial (2-D) filtering for imaging or contour data -- Interpret filter performance and trade-offs (cutoff frequency, phase lag) - ---- -## What is data filtering and why does it matter? - -Filtering is a process in signal processing to remove unwanted parts of the signal within certain frequency range. Low-pass filters remove all signals above certain cut-off frequency; high-pass filters do the opposite. Combining low- and high-pass filters allows constructing a band-pass filter, which means we only keep the signals within a pair of frequencies. - -Measurements from sensors, test rigs, or simulations are rarely perfect. Electrical interference, vibration, quantization error, or even airflow turbulence can create random variations that obscure the trend. - -Different filtering methods are used depending on the data set type, the nature of the noise, and the desired outcome, whether it’s removing interference, detecting anomalies, or smoothing fluctuations in time-series data. Choosing the right filter ensures cleaner, more reliable data for analysis and decision-making. - -**Key Data filtering Methods** - -| **Filtering Method** | **Types of Filters** | **Purpose** | **Applications** | -| --------------------------------------------------- | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | -| Frequency-based filters (signal processing filters) | - Low-pass
- High-pass
- Bandpass
- Bandstop (notch) | Remove or retain specific frequency components in data | Noise reduction (e.g., low-pass filtering to remove high-frequency noise), image processing, sensor data analysis | -| Smoothing filters (statistical methods) | - Median filter
- Moving average
- Gaussian filter
- Exponential smoothing | Smooth data by reducing noise and variability | Time-series smoothing, image processing, outlier removal | -| Rule-based filters (conditional filtering) | - Threshold filters (e.g., greater than, less than)
- Rule-based filters | Filter data based on predefined logical conditions | Data cleaning, outlier detection, quality control | -| Trend-based filters (time-series methods) | - Hodrick-Prescott filter
- Kalman filter
- Wavelet filter | Identify trends, remove seasonality, smooth fluctuations | Stock market analysis, climate data, sensor monitoring | -| Machine learning–based filters | - Anomaly detection algorithms
- Autoencoders
- Clustering-based filtering | Use AI and machine learning to detect and remove noisy or irrelevant data | Fraud detection, predictive maintenance, automated data cleaning | - -## Frequency domain basics -So far, we’ve looked at data in the time domain, for example, Temperature(t), Pressure(t) or Displacement(t). The frequency domain is a different way of looking at the same information. Instead of asking _“how does this signal change over time?”_, we ask *“What frequencies make up this signal?”*. Every repeating vibration, oscillation, or wave can be described by the frequencies it contains. Transforming to the frequency domain helps us see the hidden structure of a signal, especially when it’s a mix of multiple oscillations. - -Let's consider the vibration of a shaft due to two rotating components, one at 10 Hz and another at 50 Hz. In the time domain, the combined signal looks complicated, but in the frequency domain, two clear peaks appear at 10 Hz and 50 Hz, instantly revealing the underlying behavior. - -The mathematical tool that converts a time-domain signal to its frequency components is the Fourier Transform. -$$X(f) = \int_{-\infty}^{\infty} x(t)\, e^{-j 2 \pi f t}\, dt$$ -In practice, we use the Discrete Fourier Transform (DFT) or its efficient implementation, the Fast Fourier Transform (FFT). This process decomposes any signal into a sum of sine and cosine waves at different frequencies. -##### Visualization -```python -import numpy as np -import matplotlib.pyplot as plt - -# Create a time vector (1 second at 1000 Hz sampling) -fs = 1000 -t = np.linspace(0, 1, fs, endpoint=False) - -# Signal: 10 Hz + 50 Hz sine waves + noise -x = np.sin(2*np.pi*10*t) + 0.5*np.sin(2*np.pi*50*t) + 0.2*np.random.randn(len(t)) - -# Compute FFT -X = np.fft.fft(x) -freq = np.fft.fftfreq(len(x), d=1/fs) - -# Plot magnitude spectrum (only positive frequencies) -plt.figure(figsize=(10,5)) -plt.plot(freq[:fs//2], np.abs(X)[:fs//2]) -plt.title('Frequency Domain Representation') -plt.xlabel('Frequency [Hz]') -plt.ylabel('Amplitude') -plt.grid(True) -plt.show() - -``` - -## Fourier transform overiew (numpy.fft, scipy.fft) -The Fourier Transform is the mathematical bridge between the time and frequency domains. In python we can use both numpy and scipy to perform DFT on a set of data in the time domain. For a discrete signal $x[n]$ sampled at uniform intervals we write the fourier transform function as: -$$X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j \frac{2\pi}{N}kn}$$ -where: -- $x[n]$= nth sample of the time-domain signal -- $X[k]$= kth frequency component (complex number) -- $N$ = total number of samples -- $e^{-j 2\pi kn / N}$ = basis functions complex exponentials representing sinusoids of different frequencies - -Both numpy and Scipy use Fast Fourier Transform (FFT) which is an algorithm that computes the DFT using the equation above. For this section we will use Scipy as it has a modern algorithm which is optimized for larger data sizes. - -```python -import numpy as np -import matplotlib.pyplot as plt -from scipy.fft import fft, fftfreq - -# --- Create a sample signal --- -fs = 1000 # sampling frequency [Hz] -T = 1/fs # sampling period [s] -t = np.arange(0, 1, T) # 1 second of data - -# signal: two sine waves + random noise -x = np.sin(2*np.pi*10*t) + 0.5*np.sin(2*np.pi*50*t) + 0.2*np.random.randn(len(t)) - -# --- Compute FFT --- -N = len(x) -X = fft(x) -freq = fftfreq(N, T)[:N//2] # positive frequencies only - -# --- Plot results --- -plt.figure(figsize=(12,5)) -plt.subplot(1,2,1) -plt.plot(t, x) -plt.title('Time Domain') -plt.xlabel('Time [s]') -plt.ylabel('Amplitude') - -plt.subplot(1,2,2) -plt.plot(freq, 2/N * np.abs(X[:N//2])) # scaled magnitude spectrum -plt.title('Frequency Domain') -plt.xlabel('Frequency [Hz]') -plt.ylabel('Amplitude') -plt.tight_layout() -plt.show() - -``` - - -## Low-pass and high-pass filters (scipy.singla.butter, filtfilt) -After analyzing signals in the frequency domain, we often want to keep only the frequencies that matter and remove those that don’t. You may have encountered a filters when dealing with circuitry, we can also apply a filter digitally to a signal. - -The Butterworth filter is one of the most common digital filters because it has a smooth, flat frequency response in the passband (no ripple). - -Its gain function is: -$$|H(\omega)| = \frac{1}{\sqrt{1 + \left( \frac{\omega}{\omega_c} \right)^{2n}}}$$ -where: -- $\omega_c$​ = cut-off frequency -- $n$ = filter order (higher = sharper roll-off) - -In Python, `scipy.signal.butter` designs this filter, and `scipy.signal.filtfilt` applies it _forward and backward_ to avoid phase shift. -## Example: Removing high-frequency noise from a displacement signal - -```python -import numpy as np -import matplotlib.pyplot as plt -from scipy.signal import butter, filtfilt - -# --- Create a noisy signal --- -fs = 1000 # sampling frequency [Hz] -t = np.linspace(0, 1, fs) -signal = np.sin(2*np.pi*5*t) + 0.5*np.sin(2*np.pi*50*t) # 5 Hz + 50 Hz components -noisy_signal = signal + 0.3*np.random.randn(len(t)) - -# --- Design a low-pass Butterworth filter --- -cutoff = 10 # desired cutoff frequency [Hz] -order = 4 -b, a = butter(order, cutoff/(fs/2), btype='low', analog=False) - -# --- Apply the filter --- -filtered = filtfilt(b, a, noisy_signal) - -# --- Plot results --- -plt.figure(figsize=(10,5)) -plt.plot(t, noisy_signal, 'gray', alpha=0.5, label='Noisy Signal') -plt.plot(t, filtered, 'r', linewidth=2, label='Low-pass Filtered') -plt.xlabel('Time [s]') -plt.ylabel('Amplitude') -plt.title('Low-pass Butterworth Filter') -plt.legend() -plt.show() - -``` - -```python -# --- Design a high-pass filter --- -cutoff_hp = 20 -b_hp, a_hp = butter(order, cutoff_hp/(fs/2), btype='high', analog=False) -filtered_hp = filtfilt(b_hp, a_hp, noisy_signal) - -# --- Plot comparison --- -plt.figure(figsize=(10,5)) -plt.plot(t, noisy_signal, 'gray', alpha=0.5, label='Original Signal') -plt.plot(t, filtered_hp, 'b', linewidth=2, label='High-pass Filtered') -plt.xlabel('Time [s]') -plt.ylabel('Amplitude') -plt.title('High-pass Butterworth Filter') -plt.legend() -plt.show() - -``` - -## Example: Removing noise from an image to help for further analysis (for PIV) - -```python -import numpy as np -import matplotlib.pyplot as plt -from scipy.fft import fft2, ifft2, fftshift, ifftshift -from skimage import data, img_as_float -from skimage.util import random_noise - -# --- Load and corrupt an image --- -image = img_as_float(data.camera()) # grayscale test image -noisy = random_noise(image, mode='s&p', amount=0.05) # add salt & pepper noise - -# --- 2D FFT --- -F = fft2(noisy) -Fshift = fftshift(F) # move zero frequency to center - -# --- Build a circular low-pass mask --- -rows, cols = noisy.shape -crow, ccol = rows//2, cols//2 -radius = 30 # cutoff radius in frequency domain -mask = np.zeros_like(noisy) -Y, X = np.ogrid[:rows, :cols] -dist = np.sqrt((X-ccol)**2 + (Y-crow)**2) -mask[dist <= radius] = 1 - -# --- Apply mask and inverse FFT --- -Fshift_filtered = Fshift * mask -F_ishift = ifftshift(Fshift_filtered) -filtered = np.real(ifft2(F_ishift)) - -# --- Plot results --- -fig, ax = plt.subplots(1, 3, figsize=(12,5)) -ax[0].imshow(noisy, cmap='gray') -ax[0].set_title('Noisy Image') -ax[1].imshow(np.log(1+np.abs(Fshift)), cmap='gray') -ax[1].set_title('FFT Magnitude Spectrum') -ax[2].imshow(filtered, cmap='gray') -ax[2].set_title('Low-pass Filtered Image') -for a in ax: a.axis('off') -plt.tight_layout() -plt.show() -``` - -## Problem 1: -Generate a synthetic signal (sum of two sine waves+random noise). Apply a moving average and FFT to show frequency components.) - - -## Problem 2: -Design a Butterworkth low-pass filter to isolate the funcamental frequency of a vibration signal (e.g. roating machinery). Plot before and after. \ No newline at end of file diff --git a/tutorials/module_4/4.6_Data_Filtering_and_Signal_Processing.md b/tutorials/module_4/4.6_Data_Filtering_and_Signal_Processing.md new file mode 100644 index 0000000..74b7cb0 --- /dev/null +++ b/tutorials/module_4/4.6_Data_Filtering_and_Signal_Processing.md @@ -0,0 +1,225 @@ +# Data Filtering and Signal Processing + +**Learning Objectives** + +- Understand the purpose of filtering in experimental and computational data +- Differentiate between noise, bias, and true signal +- Apply time-domain and frequency-domain filters to remove unwanted noise +- Introduce basic spatial (2-D) filtering for imaging or contour data +- Interpret filter performance and trade-offs (cutoff frequency, phase lag) + +--- +## What is data filtering and why does it matter? + +Filtering is a process in signal processing to remove unwanted parts of the signal within certain frequency range. Low-pass filters remove all signals above certain cut-off frequency; high-pass filters do the opposite. Combining low- and high-pass filters allows constructing a band-pass filter, which means we only keep the signals within a pair of frequencies. + +Measurements from sensors, test rigs, or simulations are rarely perfect. Electrical interference, vibration, quantization error, or even airflow turbulence can create random variations that obscure the trend. + +Different filtering methods are used depending on the data set type, the nature of the noise, and the desired outcome, whether it’s removing interference, detecting anomalies, or smoothing fluctuations in time-series data. Choosing the right filter ensures cleaner, more reliable data for analysis and decision-making. + +**Key Data filtering Methods** + +| **Filtering Method** | **Types of Filters** | **Purpose** | **Applications** | +| --------------------------------------------------- | ----------------------------------------------------------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | +| Frequency-based filters (signal processing filters) | - Low-pass
- High-pass
- Bandpass
- Bandstop (notch) | Remove or retain specific frequency components in data | Noise reduction (e.g., low-pass filtering to remove high-frequency noise), image processing, sensor data analysis | +| Smoothing filters (statistical methods) | - Median filter
- Moving average
- Gaussian filter
- Exponential smoothing | Smooth data by reducing noise and variability | Time-series smoothing, image processing, outlier removal | +| Rule-based filters (conditional filtering) | - Threshold filters (e.g., greater than, less than)
- Rule-based filters | Filter data based on predefined logical conditions | Data cleaning, outlier detection, quality control | +| Trend-based filters (time-series methods) | - Hodrick-Prescott filter
- Kalman filter
- Wavelet filter | Identify trends, remove seasonality, smooth fluctuations | Stock market analysis, climate data, sensor monitoring | +| Machine learning–based filters | - Anomaly detection algorithms
- Autoencoders
- Clustering-based filtering | Use AI and machine learning to detect and remove noisy or irrelevant data | Fraud detection, predictive maintenance, automated data cleaning | + +## Frequency domain basics +So far, we’ve looked at data in the time domain, for example, Temperature(t), Pressure(t) or Displacement(t). The frequency domain is a different way of looking at the same information. Instead of asking _“how does this signal change over time?”_, we ask *“What frequencies make up this signal?”*. Every repeating vibration, oscillation, or wave can be described by the frequencies it contains. Transforming to the frequency domain helps us see the hidden structure of a signal, especially when it’s a mix of multiple oscillations. + +Let's consider the vibration of a shaft due to two rotating components, one at 10 Hz and another at 50 Hz. In the time domain, the combined signal looks complicated, but in the frequency domain, two clear peaks appear at 10 Hz and 50 Hz, instantly revealing the underlying behavior. + +The mathematical tool that converts a time-domain signal to its frequency components is the Fourier Transform. +$$X(f) = \int_{-\infty}^{\infty} x(t)\, e^{-j 2 \pi f t}\, dt$$ +In practice, we use the Discrete Fourier Transform (DFT) or its efficient implementation, the Fast Fourier Transform (FFT). This process decomposes any signal into a sum of sine and cosine waves at different frequencies. +##### Visualization +```python +import numpy as np +import matplotlib.pyplot as plt + +# Create a time vector (1 second at 1000 Hz sampling) +fs = 1000 +t = np.linspace(0, 1, fs, endpoint=False) + +# Signal: 10 Hz + 50 Hz sine waves + noise +x = np.sin(2*np.pi*10*t) + 0.5*np.sin(2*np.pi*50*t) + 0.2*np.random.randn(len(t)) + +# Compute FFT +X = np.fft.fft(x) +freq = np.fft.fftfreq(len(x), d=1/fs) + +# Plot magnitude spectrum (only positive frequencies) +plt.figure(figsize=(10,5)) +plt.plot(freq[:fs//2], np.abs(X)[:fs//2]) +plt.title('Frequency Domain Representation') +plt.xlabel('Frequency [Hz]') +plt.ylabel('Amplitude') +plt.grid(True) +plt.show() + +``` + +## Fourier transform overiew (numpy.fft, scipy.fft) +The Fourier Transform is the mathematical bridge between the time and frequency domains. In python we can use both numpy and scipy to perform DFT on a set of data in the time domain. For a discrete signal $x[n]$ sampled at uniform intervals we write the fourier transform function as: +$$X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j \frac{2\pi}{N}kn}$$ +where: +- $x[n]$= nth sample of the time-domain signal +- $X[k]$= kth frequency component (complex number) +- $N$ = total number of samples +- $e^{-j 2\pi kn / N}$ = basis functions complex exponentials representing sinusoids of different frequencies + +Both numpy and Scipy use Fast Fourier Transform (FFT) which is an algorithm that computes the DFT using the equation above. For this section we will use Scipy as it has a modern algorithm which is optimized for larger data sizes. + +```python +import numpy as np +import matplotlib.pyplot as plt +from scipy.fft import fft, fftfreq + +# --- Create a sample signal --- +fs = 1000 # sampling frequency [Hz] +T = 1/fs # sampling period [s] +t = np.arange(0, 1, T) # 1 second of data + +# signal: two sine waves + random noise +x = np.sin(2*np.pi*10*t) + 0.5*np.sin(2*np.pi*50*t) + 0.2*np.random.randn(len(t)) + +# --- Compute FFT --- +N = len(x) +X = fft(x) +freq = fftfreq(N, T)[:N//2] # positive frequencies only + +# --- Plot results --- +plt.figure(figsize=(12,5)) +plt.subplot(1,2,1) +plt.plot(t, x) +plt.title('Time Domain') +plt.xlabel('Time [s]') +plt.ylabel('Amplitude') + +plt.subplot(1,2,2) +plt.plot(freq, 2/N * np.abs(X[:N//2])) # scaled magnitude spectrum +plt.title('Frequency Domain') +plt.xlabel('Frequency [Hz]') +plt.ylabel('Amplitude') +plt.tight_layout() +plt.show() + +``` + + +## Low-pass and high-pass filters (scipy.singla.butter, filtfilt) +After analyzing signals in the frequency domain, we often want to keep only the frequencies that matter and remove those that don’t. You may have encountered a filters when dealing with circuitry, we can also apply a filter digitally to a signal. + +The Butterworth filter is one of the most common digital filters because it has a smooth, flat frequency response in the passband (no ripple). + +Its gain function is: +$$|H(\omega)| = \frac{1}{\sqrt{1 + \left( \frac{\omega}{\omega_c} \right)^{2n}}}$$ +where: +- $\omega_c$​ = cut-off frequency +- $n$ = filter order (higher = sharper roll-off) + +In Python, `scipy.signal.butter` designs this filter, and `scipy.signal.filtfilt` applies it _forward and backward_ to avoid phase shift. +## Example: Removing high-frequency noise from a displacement signal + +```python +import numpy as np +import matplotlib.pyplot as plt +from scipy.signal import butter, filtfilt + +# --- Create a noisy signal --- +fs = 1000 # sampling frequency [Hz] +t = np.linspace(0, 1, fs) +signal = np.sin(2*np.pi*5*t) + 0.5*np.sin(2*np.pi*50*t) # 5 Hz + 50 Hz components +noisy_signal = signal + 0.3*np.random.randn(len(t)) + +# --- Design a low-pass Butterworth filter --- +cutoff = 10 # desired cutoff frequency [Hz] +order = 4 +b, a = butter(order, cutoff/(fs/2), btype='low', analog=False) + +# --- Apply the filter --- +filtered = filtfilt(b, a, noisy_signal) + +# --- Plot results --- +plt.figure(figsize=(10,5)) +plt.plot(t, noisy_signal, 'gray', alpha=0.5, label='Noisy Signal') +plt.plot(t, filtered, 'r', linewidth=2, label='Low-pass Filtered') +plt.xlabel('Time [s]') +plt.ylabel('Amplitude') +plt.title('Low-pass Butterworth Filter') +plt.legend() +plt.show() + +``` + +```python +# --- Design a high-pass filter --- +cutoff_hp = 20 +b_hp, a_hp = butter(order, cutoff_hp/(fs/2), btype='high', analog=False) +filtered_hp = filtfilt(b_hp, a_hp, noisy_signal) + +# --- Plot comparison --- +plt.figure(figsize=(10,5)) +plt.plot(t, noisy_signal, 'gray', alpha=0.5, label='Original Signal') +plt.plot(t, filtered_hp, 'b', linewidth=2, label='High-pass Filtered') +plt.xlabel('Time [s]') +plt.ylabel('Amplitude') +plt.title('High-pass Butterworth Filter') +plt.legend() +plt.show() + +``` + +## Example: Removing noise from an image to help for further analysis (for PIV) + +```python +import numpy as np +import matplotlib.pyplot as plt +from scipy.fft import fft2, ifft2, fftshift, ifftshift +from skimage import data, img_as_float +from skimage.util import random_noise + +# --- Load and corrupt an image --- +image = img_as_float(data.camera()) # grayscale test image +noisy = random_noise(image, mode='s&p', amount=0.05) # add salt & pepper noise + +# --- 2D FFT --- +F = fft2(noisy) +Fshift = fftshift(F) # move zero frequency to center + +# --- Build a circular low-pass mask --- +rows, cols = noisy.shape +crow, ccol = rows//2, cols//2 +radius = 30 # cutoff radius in frequency domain +mask = np.zeros_like(noisy) +Y, X = np.ogrid[:rows, :cols] +dist = np.sqrt((X-ccol)**2 + (Y-crow)**2) +mask[dist <= radius] = 1 + +# --- Apply mask and inverse FFT --- +Fshift_filtered = Fshift * mask +F_ishift = ifftshift(Fshift_filtered) +filtered = np.real(ifft2(F_ishift)) + +# --- Plot results --- +fig, ax = plt.subplots(1, 3, figsize=(12,5)) +ax[0].imshow(noisy, cmap='gray') +ax[0].set_title('Noisy Image') +ax[1].imshow(np.log(1+np.abs(Fshift)), cmap='gray') +ax[1].set_title('FFT Magnitude Spectrum') +ax[2].imshow(filtered, cmap='gray') +ax[2].set_title('Low-pass Filtered Image') +for a in ax: a.axis('off') +plt.tight_layout() +plt.show() +``` + +## Problem 1: +Generate a synthetic signal (sum of two sine waves+random noise). Apply a moving average and FFT to show frequency components.) + + +## Problem 2: +Design a Butterworkth low-pass filter to isolate the funcamental frequency of a vibration signal (e.g. roating machinery). Plot before and after. \ No newline at end of file diff --git a/tutorials/module_4/4.7 Data Visualization and Presentation.md b/tutorials/module_4/4.7 Data Visualization and Presentation.md deleted file mode 100644 index f97721b..0000000 --- a/tutorials/module_4/4.7 Data Visualization and Presentation.md +++ /dev/null @@ -1,72 +0,0 @@ -#data #visualization -# Data Visualization and Presentation - -## How to represent data scientifically - -Remember PCC: -1. Purpose -2. Composition -3. Color - -## Plotting with Matplotlib -### Simple plot -You've probably seen the `matplotlib` package being imported at the top of the scripts. Matplotlib allows us to create static, animated and interactive visualizations in Python. It can even create publication quality plots. - -Initialize -```python -import numpy as np -import matplotlib.pyplot as plt -``` -Prepare data -```python -x = np.linspace(0,10*np.pi,1000) -y = np.sin(x) -``` -Render -```python -fig, ax = plt.subplots() -ax.plot(X,Y) -plt.show() -``` - - -### Customizing plots -subplots, twin axis, labels, annotations - -Colormaps and figure aesthetics. - - - - -### Other types of plots -- `scatter` -- `bar` -- `imshow` -- `contourf` -- `pie` -- `hist` -- `errorbar` -- `boxplot` - -## Plotting different types of data - - - -## Plotting for reports and publication quality graphs -Now that you've - - -### Saving figures -save formats - figure size - bitmap vs vector format - - - - -## Problem: -Using pandas to plot spectroscopy data from raw data - - -## Problem: -Create a muli-panel figure showing raw data, fitted curve and residuals. Format with consistent style, legend, and color scheme for publication-ready quality. \ No newline at end of file diff --git a/tutorials/module_4/4.7_Data_Visualization_and_Presentation.md b/tutorials/module_4/4.7_Data_Visualization_and_Presentation.md new file mode 100644 index 0000000..f97721b --- /dev/null +++ b/tutorials/module_4/4.7_Data_Visualization_and_Presentation.md @@ -0,0 +1,72 @@ +#data #visualization +# Data Visualization and Presentation + +## How to represent data scientifically + +Remember PCC: +1. Purpose +2. Composition +3. Color + +## Plotting with Matplotlib +### Simple plot +You've probably seen the `matplotlib` package being imported at the top of the scripts. Matplotlib allows us to create static, animated and interactive visualizations in Python. It can even create publication quality plots. + +Initialize +```python +import numpy as np +import matplotlib.pyplot as plt +``` +Prepare data +```python +x = np.linspace(0,10*np.pi,1000) +y = np.sin(x) +``` +Render +```python +fig, ax = plt.subplots() +ax.plot(X,Y) +plt.show() +``` + + +### Customizing plots +subplots, twin axis, labels, annotations + +Colormaps and figure aesthetics. + + + + +### Other types of plots +- `scatter` +- `bar` +- `imshow` +- `contourf` +- `pie` +- `hist` +- `errorbar` +- `boxplot` + +## Plotting different types of data + + + +## Plotting for reports and publication quality graphs +Now that you've + + +### Saving figures +save formats + figure size + bitmap vs vector format + + + + +## Problem: +Using pandas to plot spectroscopy data from raw data + + +## Problem: +Create a muli-panel figure showing raw data, fitted curve and residuals. Format with consistent style, legend, and color scheme for publication-ready quality. \ No newline at end of file -- cgit v1.2.3