summaryrefslogtreecommitdiff
path: root/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
diff options
context:
space:
mode:
authorChristian Kolset <christian.kolset@gmail.com>2025-12-03 15:50:47 -0700
committerChristian Kolset <christian.kolset@gmail.com>2025-12-03 15:50:47 -0700
commit5b9fd00087ca8594ddb716134cd210a9f9ae5876 (patch)
tree8fbbaee2ba680ae3d3116abede779c7f834c9b53 /tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
parent7a713b1e67a255c285eef45d172ba0a1286987c3 (diff)
Made some updates to older files and renamed files. Also updated README.md
Diffstat (limited to 'tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md')
-rw-r--r--tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md144
1 files changed, 0 insertions, 144 deletions
diff --git a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md b/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
deleted file mode 100644
index 882ce59..0000000
--- a/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
+++ /dev/null
@@ -1,144 +0,0 @@
-# Introduction to Data and Scientific Datasets
-
-**Learning objectives:**
-
-- Understand what makes data “scientific” (units, precision, metadata)
-- Identify challenges in data processing (missing data, noise, outliers)
-- Overview of the data-analysis workflow
----
-### What is scientific data?
-Scientific data refers to **measured or simulated information** that describes a physical phenomenon in a quantitative and reproducible way. Scientific data is rooted in physical laws and carries information about the system’s behavior, boundary conditions, and measurement uncertainty whether this is collected experimentally or predicted with a model.
-
-We may collect this in the following ways:
-- **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities.
-- **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions.
-- **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems.
-## Data Processing flow works
-```mermaid
-flowchart
- A[Collecting] --> B[Cleaning & Filtering] --> C[Analysis] --> D[Visualization]
-```
-Data processing begins with **collection**, where measurements are recorded either manually using instruments or electronically through sensors. Regardless of the method, every measurement contains some degree of error, whether due to instrument limitations or external interference. In engineering, recognizing and quantifying this uncertainty is essential, as it defines the confidence range of our predictions.
-
-Once the data has been collected, the next step is **cleaning and filtering**. This involves addressing missing data points, managing outliers, and reducing noise. Errors can arise from faulty readings, sensor drift, or transcription mistakes. By cleaning and filtering the data, we ensure it accurately represents the system being measured.
-
-After the data is refined, we move into **analysis**. Here, statistical methods and computational tools are applied to model the data, uncover trends, and test hypotheses. This stage transforms raw numbers into meaningful insight.
-
-Finally, **visualization** allows us to communicate these insights effectively. Visualization can occur alongside analysis to guide interpretation or as the concluding step to present results clearly and purposefully. Well-designed visualizations make complex findings intuitive and accessible to the intended audience.
-
-To carry out this workflow efficiently, particularly during the cleaning, analysis, and visualization stages, we rely on powerful computational tools. In Python, one of the most versatile and widely used libraries for handling tabular data is pandas. It simplifies the process of managing, transforming, and analyzing datasets, allowing engineers and scientists to focus on interpreting results rather than wrestling with raw data.
-
-## Introduction to pandas
-`pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**.
-
-A `Series` represents a single column or one-dimensional labeled array, while a `DataFrame` is a two-dimensional table of data, similar to a spreadsheet table, where each column is a `Series` and each row has a labeled index.
-
-DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently.
-
-### Problem: Create a dataframe from an array
-Given the data `force_N` and `time_s`
-
-```python
-import pandas as pd
-
-force_N = [10, 20, 30, 25, 15]
-time_s = [0, 1, 2, 3, 4]
-
-df = pd.DataFrame({
- 'Time (s)': time_s,
- 'Force (N)': force_N
-})
-
-print("\n=== Basic Statistics ===")
-print(df.describe())
-```
-Notice how `the describe()` function outputs some statistical data the we may find useful.
-### Manipulating dataframes
-#### Subsets and Conditional filtering
-You can select rows, columns, or specific conditions from a DataFrame.
-
-```python
-# Select a column
-force = df["Force_N"]
-
-# Select multiple columns
-subset = df[["Time_s", "Force_N"]]
-
-# Conditional filtering
-df_high_force = df[df["Force_N"] > 50]
-```
-
-![[Pasted image 20251013064718.png]]
-
-#### Combining and Merging Datasets
-Often, multiple sensors or experiments must be merged into one dataset for analysis.
-
-```python
-# Merge on a common column (e.g., time)
-merged = pd.merge(df_force, df_temp, on="Time_s")
-
-# Stack multiple test runs vertically
-combined = pd.concat([df_run1, df_run2], axis=0)
-```
-https://pandas.pydata.org/docs/user_guide/merging.html
-
-#### Creating new columns based on existing ones
-<img src="image_1761156588079.png" width="600">
-
-Much like excel, pandas allows you to manipulate columns using the dataframe header. In this examples we want to multiply a dataframe column by a constant.
-```python
-air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
-```
-
-We may want to the new column as a function of other columns, we can do so by simply applying a mathematical operation as follows:
-```python
-air_quality["ratio_paris_antwerp"] = (
- air_quality["station_paris"] / air_quality["station_antwerp"]
- )
-```
-
-https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html
-https://pandas.pydata.org/docs/user_guide/reshaping.html
-
-### Problem: Create a dataframe from Numpy arrays
-
-
-
-Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe
-```python
-import pandas as pd
-
-file_path = "force_displacement_data.txt"
-
-df_txt = pd.read_csv(
- file_path,
- delim_whitespace=True,
- comment="#",
- skiprows=0,
- header=0
-)
-
-print("\n=== Basic Statistics ===")
-print(df_txt.describe())
-
-if "Force_N" in df_txt.columns:
- print("\nFirst five Force readings:")
- print(df_txt["Force_N"].head())
-
-try:
- import matplotlib.pyplot as plt
-
- plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1])
- plt.xlabel(df_txt.columns[0])
- plt.ylabel(df_txt.columns[1])
- plt.title("Loaded Data from Text File")
- plt.grid(True)
- plt.show()
-
-except ImportError:
- print("\nmatplotlib not installed — skipping plot.")
-```
-
-
-## **Activities & Examples:**
-- Discuss real ME examples: strain gauge data, thermocouple readings, pressure transducers \ No newline at end of file