summaryrefslogtreecommitdiff
path: root/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
diff options
context:
space:
mode:
Diffstat (limited to 'tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md')
-rw-r--r--tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md109
1 files changed, 67 insertions, 42 deletions
diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
index d52c33c..3ad34e4 100644
--- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
+++ b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
@@ -3,7 +3,6 @@
**Learning objectives:**
- Understand what makes data “scientific” (units, precision, metadata)
-- Recognize types of data: time-series, experimental, simulation, and imaging data
- Identify challenges in data processing (missing data, noise, outliers)
- Overview of the data-analysis workflow
---
@@ -14,6 +13,11 @@ We may collect this in the following ways:
- **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities.
- **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions.
- **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems.
+## Data Processing flow works
+```mermaid
+flowchart
+ A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization]
+```
## Introduction to pandas
`pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**.
@@ -21,45 +25,26 @@ A `Series` represents a single column or one-dimensional labeled array, while
DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently.
+### Problem 1: Create a dataframe from an array
+Given the data `force_N` and `time_s`
-### Problem 1: Create a dataframe from a text file
-Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe
```python
import pandas as pd
-file_path = "force_displacement_data.txt"
+force_N = [10, 20, 30, 25, 15]
+time_s = [0, 1, 2, 3, 4]
-df_txt = pd.read_csv(
- file_path,
- delim_whitespace=True,
- comment="#",
- skiprows=0,
- header=0
-)
+df = pd.DataFrame({
+ 'Time (s)': time_s,
+ 'Force (N)': force_N
+})
print("\n=== Basic Statistics ===")
-print(df_txt.describe())232
-
-if "Force_N" in df_txt.columns:
- print("\nFirst five Force readings:")
- print(df_txt["Force_N"].head())
-
-try:
- import matplotlib.pyplot as plt
-
- plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1])
- plt.xlabel(df_txt.columns[0])
- plt.ylabel(df_txt.columns[1])
- plt.title("Loaded Data from Text File")
- plt.grid(True)
- plt.show()
-
-except ImportError:
- print("\nmatplotlib not installed — skipping plot.")
-
+print(df.describe())
```
-
-## Subsetting and Conditional filtering
+Notice how `the describe()` function outputs some statistical data the we may find useful.
+### Manipulating dataframes
+#### Subsets and Conditional filtering
You can select rows, columns, or specific conditions from a DataFrame.
```python
@@ -73,10 +58,9 @@ subset = df[["Time_s", "Force_N"]]
df_high_force = df[df["Force_N"] > 50]
```
-
![[Pasted image 20251013064718.png]]
-## Combining and Merging Datasets
+#### Combining and Merging Datasets
Often, multiple sensors or experiments must be merged into one dataset for analysis.
```python
@@ -88,21 +72,62 @@ combined = pd.concat([df_run1, df_run2], axis=0)
```
-## Problem 1: Describe a dataset
-Use pandas built-in describe data to report on the statistical mean of the given experimental data.
+https://pandas.pydata.org/docs/user_guide/merging.html
+
+#### Creating new columns based on existing ones
+<img src="image_1761156588079.png" width="600">
+Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant.
```python
-import matplotlib.pyplot as plt
+air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
+```
-plt.plot(df["Time_s"], df["Force_N"])
-plt.xlabel("Time (s)")
-plt.ylabel("Force (N)")
-plt.title("Force vs. Time")
-plt.show()
+We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows:
+```python
+air_quality["ratio_paris_antwerp"] = (
+ air_quality["station_paris"] / air_quality["station_antwerp"]
+ )
```
+https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
+https://pandas.pydata.org/docs/user_guide/reshaping.html
+### Problem 1: Create a dataframe from data
+Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe
+```python
+import pandas as pd
+
+file_path = "force_displacement_data.txt"
+
+df_txt = pd.read_csv(
+ file_path,
+ delim_whitespace=True,
+ comment="#",
+ skiprows=0,
+ header=0
+)
+
+print("\n=== Basic Statistics ===")
+print(df_txt.describe())
+
+if "Force_N" in df_txt.columns:
+ print("\nFirst five Force readings:")
+ print(df_txt["Force_N"].head())
+
+try:
+ import matplotlib.pyplot as plt
+
+ plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1])
+ plt.xlabel(df_txt.columns[0])
+ plt.ylabel(df_txt.columns[1])
+ plt.title("Loaded Data from Text File")
+ plt.grid(True)
+ plt.show()
+
+except ImportError:
+ print("\nmatplotlib not installed — skipping plot.")
+```
**Activities & Examples:**