diff options
Diffstat (limited to 'tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md')
| -rw-r--r-- | tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md | 109 |
1 files changed, 67 insertions, 42 deletions
diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md index d52c33c..3ad34e4 100644 --- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md +++ b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md @@ -3,7 +3,6 @@ **Learning objectives:** - Understand what makes data “scientific” (units, precision, metadata) -- Recognize types of data: time-series, experimental, simulation, and imaging data - Identify challenges in data processing (missing data, noise, outliers) - Overview of the data-analysis workflow --- @@ -14,6 +13,11 @@ We may collect this in the following ways: - **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities. - **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions. - **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems. +## Data Processing flow works +```mermaid +flowchart + A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization] +``` ## Introduction to pandas `pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**. @@ -21,45 +25,26 @@ A `Series` represents a single column or one-dimensional labeled array, while DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently. +### Problem 1: Create a dataframe from an array +Given the data `force_N` and `time_s` -### Problem 1: Create a dataframe from a text file -Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe ```python import pandas as pd -file_path = "force_displacement_data.txt" +force_N = [10, 20, 30, 25, 15] +time_s = [0, 1, 2, 3, 4] -df_txt = pd.read_csv( - file_path, - delim_whitespace=True, - comment="#", - skiprows=0, - header=0 -) +df = pd.DataFrame({ + 'Time (s)': time_s, + 'Force (N)': force_N +}) print("\n=== Basic Statistics ===") -print(df_txt.describe())232 - -if "Force_N" in df_txt.columns: - print("\nFirst five Force readings:") - print(df_txt["Force_N"].head()) - -try: - import matplotlib.pyplot as plt - - plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) - plt.xlabel(df_txt.columns[0]) - plt.ylabel(df_txt.columns[1]) - plt.title("Loaded Data from Text File") - plt.grid(True) - plt.show() - -except ImportError: - print("\nmatplotlib not installed — skipping plot.") - +print(df.describe()) ``` - -## Subsetting and Conditional filtering +Notice how `the describe()` function outputs some statistical data the we may find useful. +### Manipulating dataframes +#### Subsets and Conditional filtering You can select rows, columns, or specific conditions from a DataFrame. ```python @@ -73,10 +58,9 @@ subset = df[["Time_s", "Force_N"]] df_high_force = df[df["Force_N"] > 50] ``` - ![[Pasted image 20251013064718.png]] -## Combining and Merging Datasets +#### Combining and Merging Datasets Often, multiple sensors or experiments must be merged into one dataset for analysis. ```python @@ -88,21 +72,62 @@ combined = pd.concat([df_run1, df_run2], axis=0) ``` -## Problem 1: Describe a dataset -Use pandas built-in describe data to report on the statistical mean of the given experimental data. +https://pandas.pydata.org/docs/user_guide/merging.html + +#### Creating new columns based on existing ones +<img src="image_1761156588079.png" width="600"> +Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant. ```python -import matplotlib.pyplot as plt +air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 +``` -plt.plot(df["Time_s"], df["Force_N"]) -plt.xlabel("Time (s)") -plt.ylabel("Force (N)") -plt.title("Force vs. Time") -plt.show() +We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows: +```python +air_quality["ratio_paris_antwerp"] = ( + air_quality["station_paris"] / air_quality["station_antwerp"] + ) ``` +https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html +https://pandas.pydata.org/docs/user_guide/reshaping.html +### Problem 1: Create a dataframe from data +Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe +```python +import pandas as pd + +file_path = "force_displacement_data.txt" + +df_txt = pd.read_csv( + file_path, + delim_whitespace=True, + comment="#", + skiprows=0, + header=0 +) + +print("\n=== Basic Statistics ===") +print(df_txt.describe()) + +if "Force_N" in df_txt.columns: + print("\nFirst five Force readings:") + print(df_txt["Force_N"].head()) + +try: + import matplotlib.pyplot as plt + + plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1]) + plt.xlabel(df_txt.columns[0]) + plt.ylabel(df_txt.columns[1]) + plt.title("Loaded Data from Text File") + plt.grid(True) + plt.show() + +except ImportError: + print("\nmatplotlib not installed — skipping plot.") +``` **Activities & Examples:** |
