summaryrefslogtreecommitdiff
path: root/tutorials/module_4/4.1 Introduction to Data and Scientific Datasets.md
blob: d52c33c836a0cc06589ca2d9fb7d5f2055a93d17 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# Introduction to Data and Scientific Datasets

**Learning objectives:**

- Understand what makes data “scientific” (units, precision, metadata)
- Recognize types of data: time-series, experimental, simulation, and imaging data
- Identify challenges in data processing (missing data, noise, outliers)
- Overview of the data-analysis workflow
---
### What is scientific data?
Scientific data refers to **measured or simulated information** that describes a physical phenomenon in a quantitative and reproducible way. Scientific data is rooted in physical laws and carries information about the system’s behavior, boundary conditions, and measurement uncertainty whether this is collected experimentally or predicted with a model.

We may collect this in the following ways:
- **Experiments** – temperature readings from thermocouples, strain or force from sensors, vibration accelerations, or flow velocities.
- **Simulations** – outputs from finite-element or CFD models such as pressure, stress, or temperature distributions.
- **Instrumentation and sensors** – digital or analog signals from transducers, encoders, or DAQ systems.
## Introduction to pandas
`pandas` (**Pan**el **Da**ta) is a Python library designed for data analysis and manipulation, widely used in engineering, science, and data analytics. It provides two core data structures: the **Series** and the **DataFrame**. 

A `Series` represents a single column or one-dimensional labeled array, while a `DataFrame` is a two-dimensional table of data, similar to a spreadsheet table, where each column is a `Series` and each row has a labeled index. 

DataFrames can be created from dictionaries, lists, NumPy arrays, or imported from external files such as CSV or Excel. Once data is loaded, you can **view and explore** it using methods like `head()`, `tail()`, and `describe()`. Data can be **selected by label** or **by position**. These indexing systems make it easy to slice, filter, and reorganize datasets efficiently.


### Problem 1: Create a dataframe from a text file
Given the the file `force_displacement_data.txt`. Use pandas to tabulate the data into a dataframe
```python
import pandas as pd

file_path = "force_displacement_data.txt"

df_txt = pd.read_csv(
    file_path,
    delim_whitespace=True,
    comment="#",
    skiprows=0,
    header=0
)

print("\n=== Basic Statistics ===")
print(df_txt.describe())232

if "Force_N" in df_txt.columns:
    print("\nFirst five Force readings:")
    print(df_txt["Force_N"].head())

try:
    import matplotlib.pyplot as plt

    plt.plot(df_txt.iloc[:, 0], df_txt.iloc[:, 1])
    plt.xlabel(df_txt.columns[0])
    plt.ylabel(df_txt.columns[1])
    plt.title("Loaded Data from Text File")
    plt.grid(True)
    plt.show()

except ImportError:
    print("\nmatplotlib not installed — skipping plot.")

```

## Subsetting and Conditional filtering
You can select rows, columns, or specific conditions from a DataFrame.

```python
# Select a column
force = df["Force_N"]

# Select multiple columns
subset = df[["Time_s", "Force_N"]]

# Conditional filtering
df_high_force = df[df["Force_N"] > 50]
```


![[Pasted image 20251013064718.png]]

## Combining and Merging Datasets
Often, multiple sensors or experiments must be merged into one dataset for analysis.

```python
# Merge on a common column (e.g., time)
merged = pd.merge(df_force, df_temp, on="Time_s")

# Stack multiple test runs vertically
combined = pd.concat([df_run1, df_run2], axis=0)
```


## Problem 1: Describe a dataset
Use pandas built-in describe data to report on the statistical mean of the given experimental data.

```python
import matplotlib.pyplot as plt

plt.plot(df["Time_s"], df["Force_N"])
plt.xlabel("Time (s)")
plt.ylabel("Force (N)")
plt.title("Force vs. Time")
plt.show()
```





**Activities & Examples:**
- Load small CSV datasets using `numpy.loadtxt()` and `pandas.read_csv()`
- Discuss real ME examples: strain gauge data, thermocouple readings, pressure transducers