From b2e5d0f00829b3c603a846b62c5bdbea45e449db Mon Sep 17 00:00:00 2001 From: Christian Kolset Date: Fri, 17 Oct 2025 13:57:08 -0600 Subject: Restructures module 4 to follow structure from last meeting. Renamed tutorials to have correct order. --- .../module_4/4.2 Importing and Managing Data.md | 142 --------------------- tutorials/module_4/4.2 Interpreting Data.md | 14 ++ .../4.3 Data Cleaning and Preprocessing.md | 104 --------------- .../module_4/4.3 Importing and Managing Data.md | 142 +++++++++++++++++++++ .../4.5 Data Filtering and Signal Processing.md | 47 ------- .../4.6 Data Filtering and Signal Processing.md | 47 +++++++ .../4.6 Data Visualization and Presentation.md | 29 ----- .../4.7 Data Visualization and Presentation.md | 29 +++++ .../module_4/Data Cleaning and Preprocessing.md | 104 +++++++++++++++ 9 files changed, 336 insertions(+), 322 deletions(-) delete mode 100644 tutorials/module_4/4.2 Importing and Managing Data.md create mode 100644 tutorials/module_4/4.2 Interpreting Data.md delete mode 100644 tutorials/module_4/4.3 Data Cleaning and Preprocessing.md create mode 100644 tutorials/module_4/4.3 Importing and Managing Data.md delete mode 100644 tutorials/module_4/4.5 Data Filtering and Signal Processing.md create mode 100644 tutorials/module_4/4.6 Data Filtering and Signal Processing.md delete mode 100644 tutorials/module_4/4.6 Data Visualization and Presentation.md create mode 100644 tutorials/module_4/4.7 Data Visualization and Presentation.md create mode 100644 tutorials/module_4/Data Cleaning and Preprocessing.md diff --git a/tutorials/module_4/4.2 Importing and Managing Data.md b/tutorials/module_4/4.2 Importing and Managing Data.md deleted file mode 100644 index 101d5ab..0000000 --- a/tutorials/module_4/4.2 Importing and Managing Data.md +++ /dev/null @@ -1,142 +0,0 @@ -# Importing and Managing Data - -**Learning objectives:** - -- Import data from CSV, Excel, and text files using Pandas -- Handle headers, delimiters, and units -- Combine and merge multiple datasets -- Manage data with time or index labels ---- -## File types -Once data is collected, the first step is importing it into a structured form that Python can interpret. The `pandas` library provides the foundation for this, it can read nearly any file format used in engineering (text files, CSV, Excel sheets, CFD results, etc. as well as many python formats such as, arrays, lists, dicitonaries, Numpy arrays etc.) and organize the data in a DataFrame, a tabular structure similar to an Excel sheet but optimized for coding. -![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg) -## Importing spreadsheets using Pandas -Comma-Separated Values (CSV) files is a common spreadsheet type file. It is essentially a text file where each line is a now row of tables and commas indicate that a new column has stated. It is a standard convention to save spreadsheets in this format. - -Let's take a look at how this works in python. -```python -import pandas as pd - -# Read a CSV file -df = pd.read_csv("data_experiment.csv") - -# Optional arguments -df_csv = pd.read_csv( - "data_experiment.csv", - delimiter=",", # specify custom delimiter - header=0, # row number to use as header - index_col=None, # or specify a column as index - skiprows=0, # skip metadata lines -) -print df -``` - -We now created a new dataframe with the data from our .csv file. - -We can also do this for **excel files**. Pandas has a built-in function to make this easier for us. -```python -df_xlsx = pd.read_excel("temperature_log.xlsx", sheet_name="Sheet1") -print(df_xlsx.head()) -``` - -Additionally, although not a very common practice in engineering but very useful: Pandas can import a wide variety of file types such as JSON, HTML, SQL or even your clipboard. - -### Handling Headers, Units, and Metadata -Raw data often contains metadata or units above the table. Pandas can account for this metadata by skipping the first few rows. - -```python -df = pd.read_csv("sensor_data.csv", skiprows=3) -df.columns = ["Time_s", "Force_N", "Displacement_mm"] - -# Convert units -df["Displacement_m"] = df["Displacement_mm"] / 1000 -``` - -### Writing and Editing Data in pandas -https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html - -Once data has been analyzed or cleaned, `pandas` allows you to **export results** to multiple file types for reporting or further processing. Similarily to importing we can also export .csv files and Excel files. Pandas makes it easy to modify individual datapoints directly within a DataFrame. You can localize entries either by label or position - -```python -# by name -df.loc[row_label, column_label]`  -#or by position  -df.iloc[row_index, column_index] -``` - - -```python -import pandas as pd - -# Create DataFrame manually -data = { - "Time_s": [0, 1, 2, 3], - "Force_N": [0.0, 5.2, 10.4, 15.5], - "Displacement_mm": [0.0, 0.3, 0.6, 0.9] -} -df = pd.DataFrame(data) - -# Edit a single value -df.loc[1, "Force_N"] = 5.5 - -# Export to CSV -df.to_csv("edited_experiment.csv", index=False) -``` - -This workflow makes pandas ideal for working with tabular data, you can quickly edit or generate datasets, verify values, and save clean, structured files for later visualization or analysis. - -## Subsetting and Conditional filtering -You can select rows, columns, or specific conditions from a DataFrame. - -```python -# Select a column -force = df["Force_N"] - -# Select multiple columns -subset = df[["Time_s", "Force_N"]] - -# Conditional filtering -df_high_force = df[df["Force_N"] > 50] -``` - - -![[Pasted image 20251013064718.png]] - -## Combining and Merging Datasets -Often, multiple sensors or experiments must be merged into one dataset for analysis. - -```python -# Merge on a common column (e.g., time) -merged = pd.merge(df_force, df_temp, on="Time_s") - -# Stack multiple test runs vertically -combined = pd.concat([df_run1, df_run2], axis=0) -``` - - -## Problem 1: Describe a dataset -Use pandas built-in describe data to report on the statistical mean of the given experimental data. - -```python -import matplotlib.pyplot as plt - -plt.plot(df["Time_s"], df["Force_N"]) -plt.xlabel("Time (s)") -plt.ylabel("Force (N)") -plt.title("Force vs. Time") -plt.show() -``` - - -### Problem 2: Import time stamped data - - - -### Further Docs -[Comparison with Spreadsheets](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_spreadsheets.html#compare-with-spreadsheets) -[Intro to Reading/Writing Files](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) -[Subsetting Data](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) -[Adding Columns](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) -[Reshaping Data](https://pandas.pydata.org/docs/user_guide/reshaping.html) -[Merging DataFrames](https://pandas.pydata.org/docs/user_guide/merging.html) -[Combining DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html) diff --git a/tutorials/module_4/4.2 Interpreting Data.md b/tutorials/module_4/4.2 Interpreting Data.md new file mode 100644 index 0000000..109a741 --- /dev/null +++ b/tutorials/module_4/4.2 Interpreting Data.md @@ -0,0 +1,14 @@ +# Interpreting Data +Philosophy of visualizing data + + + +## The meaning of your data +Similarly to the English language, when we put words together we create context. As engineers and scientists, if mathematics is our language, then the data is the context. + + + +## Audience + + + diff --git a/tutorials/module_4/4.3 Data Cleaning and Preprocessing.md b/tutorials/module_4/4.3 Data Cleaning and Preprocessing.md deleted file mode 100644 index 7ac126c..0000000 --- a/tutorials/module_4/4.3 Data Cleaning and Preprocessing.md +++ /dev/null @@ -1,104 +0,0 @@ -# Data Cleaning and Preprocessing - -**Learning objectives:** - -- Detect and handle missing or invalid data -- Identify and remove outliers -- Apply smoothing and detrending -- Unit consistency and scaling ---- -## What is data cleaning? -Data cleaning is an **iterative and adaptive process** that uses different methods depending on the characteristics of the dataset, the goals of the analysis, and the tools available. It generally includes several key tasks, such as: -- Handling or replacing missing and invalid data -- Detecting and correcting outliers -- Reducing noise through smoothing or filtering techniques - - -## Handling missing or invalid data -Missing data occurs when expected values or measurements are absent from a dataset, often appearing as `NULL`, `0`, empty strings, or `NaN` (Not a Number) entries. These gaps can arise from various sources, including sensor malfunctions during data acquisition, errors in transmission, or formatting issues during data conversion. Because missing data can distort analyses and weaken model accuracy, it must be carefully identified and treated during the data cleaning stage. - -Detecting missing data may seem simple, but selecting an appropriate way to replace those gaps is often more complex. The process typically begins by locating missing or invalid entries through visualization or value inspection. Once identified, the goal is to estimate replacements that closely approximate the true, unobserved values. The method used depends heavily on the behavior and structure of the data. - -- **Slowly changing data**, such as temperature measurements, can often be filled using the nearest valid observation. -- **Seasonal or moderately variable data**, like weather records, may benefit from statistical approaches such as moving averages, medians, or _K_-nearest neighbor imputation. -- **Strongly time-dependent data**, such as financial or process signals, are best handled using interpolation methods that estimate values based on surrounding data points. - -You may have data the looks like this. -[![A solar irradiance raw input data time-series plot with missing values.|450](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/725f6f68-0273-4bd3-8e6a-6a184615752a/image.adapt.full.medium.jpg/1758740047296.jpg) - -In Python, the `pandas` library provides several simple and powerful ways to handle missing values. Missing entries in a DataFrame appear as `NaN` (Not a Number), and you can replace or estimate these values using methods such as forward fill, backward fill, interpolation, or moving averages. - -- Forward fill (`ffill`) uses the last valid observation to replace missing values, which is useful for slowly changing signals. -- Backward fill (`bfill`) propagates the next valid value backward to fill earlier gaps. -- Interpolation estimates missing values using linear or polynomial trends between known data points. -- Rolling mean or moving average smooths short-term fluctuations by averaging nearby samples, similar to MATLAB’s `movmean()` function. - -The example below demonstrates these techniques applied to a temperature dataset with missing readings: -```python -import pandas as pd - -# Example data -data = {"Time_s": [0, 1, 2, 3, 4, 5], - "Temp_C": [20.1, None, 21.0, None, 22.3, 22.8]} -df = pd.DataFrame(data) - -# Fill with the last valid value (forward fill) -df["Temp_ffill"] = df["Temp_C"].ffill() - -# Fill with next valid value (backward fill) -df["Temp_bfill"] = df["Temp_C"].bfill() - -# Linear interpolation between missing values -df["Temp_interp"] = df["Temp_C"].interpolate(method="linear") - -# Rolling mean (similar to moving average) -df["Temp_movmean"] = df["Temp_C"].fillna( - df["Temp_C"].rolling(window=3, min_periods=1).mean() -) -print(df) - -``` - - -## Identify and remove outliers -Outliers are data points that differ greatly from the rest of the data, often appearing as unusually high or low values that don’t follow the overall pattern. They can distort analysis and lead to misleading conclusions. Outliers may result from measurement errors, data entry mistakes, normal variation, or true anomalies in the system being measured. - -One common statistical approach to detect and remove outliers is the **Z-score method**. A Z-score describes how far a data point is from the mean of the dataset, measured in units of standard deviation. For a normally distributed variable, most data points lie within three standard deviations of the mean (±3 σ). Values that fall far outside this range are likely to be **outliers**, meaning they deviate significantly from the typical trend of the data. - -In practice, we calculate the Z-score for each observation, take its absolute value, and remove points whose Z-score exceeds a threshold, commonly 3 for general data, or 2.5 when the dataset is smaller or more sensitive to noise. This method works best when the data roughly follows a bell-shaped (Gaussian) distribution. - -The following example demonstrates how we could apply the Z-score method using the SciPy library and a small sample dataset of force measurements: - -```python -import pandas as pd -import numpy as np -from scipy import stats - -# Example dataset -df = pd.DataFrame({"Force_N": [10, 11, 10.5, 10.2, 11.1, 50, 10.8, 9.9]}) - -# Compute Z-scores -z = np.abs(stats.zscore(df["Force_N"])) - -# Keep only points where Z < 3 (within 3 standard deviations) -df_clean = df[z < 3] - -print(df_clean) -``` - - -### Problem 1: Cleaning datasets -Clean up the following dataset using the methods above. - -## Apply smoothing and detrending - - - - -## Units and scalling - - - -### Problem 2: - - diff --git a/tutorials/module_4/4.3 Importing and Managing Data.md b/tutorials/module_4/4.3 Importing and Managing Data.md new file mode 100644 index 0000000..101d5ab --- /dev/null +++ b/tutorials/module_4/4.3 Importing and Managing Data.md @@ -0,0 +1,142 @@ +# Importing and Managing Data + +**Learning objectives:** + +- Import data from CSV, Excel, and text files using Pandas +- Handle headers, delimiters, and units +- Combine and merge multiple datasets +- Manage data with time or index labels +--- +## File types +Once data is collected, the first step is importing it into a structured form that Python can interpret. The `pandas` library provides the foundation for this, it can read nearly any file format used in engineering (text files, CSV, Excel sheets, CFD results, etc. as well as many python formats such as, arrays, lists, dicitonaries, Numpy arrays etc.) and organize the data in a DataFrame, a tabular structure similar to an Excel sheet but optimized for coding. +![](https://pandas.pydata.org/docs/_images/02_io_readwrite.svg) +## Importing spreadsheets using Pandas +Comma-Separated Values (CSV) files is a common spreadsheet type file. It is essentially a text file where each line is a now row of tables and commas indicate that a new column has stated. It is a standard convention to save spreadsheets in this format. + +Let's take a look at how this works in python. +```python +import pandas as pd + +# Read a CSV file +df = pd.read_csv("data_experiment.csv") + +# Optional arguments +df_csv = pd.read_csv( + "data_experiment.csv", + delimiter=",", # specify custom delimiter + header=0, # row number to use as header + index_col=None, # or specify a column as index + skiprows=0, # skip metadata lines +) +print df +``` + +We now created a new dataframe with the data from our .csv file. + +We can also do this for **excel files**. Pandas has a built-in function to make this easier for us. +```python +df_xlsx = pd.read_excel("temperature_log.xlsx", sheet_name="Sheet1") +print(df_xlsx.head()) +``` + +Additionally, although not a very common practice in engineering but very useful: Pandas can import a wide variety of file types such as JSON, HTML, SQL or even your clipboard. + +### Handling Headers, Units, and Metadata +Raw data often contains metadata or units above the table. Pandas can account for this metadata by skipping the first few rows. + +```python +df = pd.read_csv("sensor_data.csv", skiprows=3) +df.columns = ["Time_s", "Force_N", "Displacement_mm"] + +# Convert units +df["Displacement_m"] = df["Displacement_mm"] / 1000 +``` + +### Writing and Editing Data in pandas +https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html + +Once data has been analyzed or cleaned, `pandas` allows you to **export results** to multiple file types for reporting or further processing. Similarily to importing we can also export .csv files and Excel files. Pandas makes it easy to modify individual datapoints directly within a DataFrame. You can localize entries either by label or position + +```python +# by name +df.loc[row_label, column_label]`  +#or by position  +df.iloc[row_index, column_index] +``` + + +```python +import pandas as pd + +# Create DataFrame manually +data = { + "Time_s": [0, 1, 2, 3], + "Force_N": [0.0, 5.2, 10.4, 15.5], + "Displacement_mm": [0.0, 0.3, 0.6, 0.9] +} +df = pd.DataFrame(data) + +# Edit a single value +df.loc[1, "Force_N"] = 5.5 + +# Export to CSV +df.to_csv("edited_experiment.csv", index=False) +``` + +This workflow makes pandas ideal for working with tabular data, you can quickly edit or generate datasets, verify values, and save clean, structured files for later visualization or analysis. + +## Subsetting and Conditional filtering +You can select rows, columns, or specific conditions from a DataFrame. + +```python +# Select a column +force = df["Force_N"] + +# Select multiple columns +subset = df[["Time_s", "Force_N"]] + +# Conditional filtering +df_high_force = df[df["Force_N"] > 50] +``` + + +![[Pasted image 20251013064718.png]] + +## Combining and Merging Datasets +Often, multiple sensors or experiments must be merged into one dataset for analysis. + +```python +# Merge on a common column (e.g., time) +merged = pd.merge(df_force, df_temp, on="Time_s") + +# Stack multiple test runs vertically +combined = pd.concat([df_run1, df_run2], axis=0) +``` + + +## Problem 1: Describe a dataset +Use pandas built-in describe data to report on the statistical mean of the given experimental data. + +```python +import matplotlib.pyplot as plt + +plt.plot(df["Time_s"], df["Force_N"]) +plt.xlabel("Time (s)") +plt.ylabel("Force (N)") +plt.title("Force vs. Time") +plt.show() +``` + + +### Problem 2: Import time stamped data + + + +### Further Docs +[Comparison with Spreadsheets](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_spreadsheets.html#compare-with-spreadsheets) +[Intro to Reading/Writing Files](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html) +[Subsetting Data](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html) +[Adding Columns](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) +[Reshaping Data](https://pandas.pydata.org/docs/user_guide/reshaping.html) +[Merging DataFrames](https://pandas.pydata.org/docs/user_guide/merging.html) +[Combining DataFrames](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html) diff --git a/tutorials/module_4/4.5 Data Filtering and Signal Processing.md b/tutorials/module_4/4.5 Data Filtering and Signal Processing.md deleted file mode 100644 index 112826e..0000000 --- a/tutorials/module_4/4.5 Data Filtering and Signal Processing.md +++ /dev/null @@ -1,47 +0,0 @@ -# Data Filtering and Signal Processing - -**Learning Objectives** - -- Understand the purpose of filtering in experimental and computational data -- Differentiate between noise, bias, and true signal -- Apply time-domain and frequency-domain filters to remove unwanted noise -- Introduce basic spatial (2-D) filtering for imaging or contour data -- Interpret filter performance and trade-offs (cutoff frequency, phase lag) - ---- - - -#### Topics - -- Review: what “noise” looks like statistically -- Time-domain filters - - Moving-average, Savitzky–Golay smoothing - - FIR and IIR filters (low-pass, high-pass, band-pass) -- Frequency-domain filtering - - Fast Fourier Transform (FFT) basics - - Noise removal using spectral methods -- Spatial filtering and image operations - - Gaussian smoothing, Sobel edge detection, median filters -- Comparing filtered vs. unfiltered data visually -#### Python Focus - -- `scipy.signal` for 1-D signals - - `butter()`, `filtfilt()`, `savgol_filter()` - - `freqz()` for visualizing filter response -- `numpy.fft` for frequency-domain analysis -- `scipy.ndimage` for 2-D spatial filters - - `gaussian_filter()`, `median_filter()`, `sobel()` -- Quick visualization with `matplotlib.pyplot` and `imshow()` -#### Applications - -- **Vibration analysis:** Filter accelerometer data to isolate modal frequencies -- **Thermal measurements:** Smooth transient thermocouple data to remove spikes -- **Fluid or heat transfer visualization:** Apply Gaussian blur or gradient filters to contour plots or infrared images -- **Structural testing:** Remove noise from strain-gauge or displacement signals before computing stress–strain - -#### Problems - -- Filter noisy vibration or pressure data and compare spectra before/after -- Apply a moving average and a Butterworth filter to the same dataset — evaluate differences -- Use `ndimage.sobel()` to highlight temperature gradients in a heat-map image -- Challenge: write a short Python function that automatically chooses an appropriate smoothing window based on noise level \ No newline at end of file diff --git a/tutorials/module_4/4.6 Data Filtering and Signal Processing.md b/tutorials/module_4/4.6 Data Filtering and Signal Processing.md new file mode 100644 index 0000000..112826e --- /dev/null +++ b/tutorials/module_4/4.6 Data Filtering and Signal Processing.md @@ -0,0 +1,47 @@ +# Data Filtering and Signal Processing + +**Learning Objectives** + +- Understand the purpose of filtering in experimental and computational data +- Differentiate between noise, bias, and true signal +- Apply time-domain and frequency-domain filters to remove unwanted noise +- Introduce basic spatial (2-D) filtering for imaging or contour data +- Interpret filter performance and trade-offs (cutoff frequency, phase lag) + +--- + + +#### Topics + +- Review: what “noise” looks like statistically +- Time-domain filters + - Moving-average, Savitzky–Golay smoothing + - FIR and IIR filters (low-pass, high-pass, band-pass) +- Frequency-domain filtering + - Fast Fourier Transform (FFT) basics + - Noise removal using spectral methods +- Spatial filtering and image operations + - Gaussian smoothing, Sobel edge detection, median filters +- Comparing filtered vs. unfiltered data visually +#### Python Focus + +- `scipy.signal` for 1-D signals + - `butter()`, `filtfilt()`, `savgol_filter()` + - `freqz()` for visualizing filter response +- `numpy.fft` for frequency-domain analysis +- `scipy.ndimage` for 2-D spatial filters + - `gaussian_filter()`, `median_filter()`, `sobel()` +- Quick visualization with `matplotlib.pyplot` and `imshow()` +#### Applications + +- **Vibration analysis:** Filter accelerometer data to isolate modal frequencies +- **Thermal measurements:** Smooth transient thermocouple data to remove spikes +- **Fluid or heat transfer visualization:** Apply Gaussian blur or gradient filters to contour plots or infrared images +- **Structural testing:** Remove noise from strain-gauge or displacement signals before computing stress–strain + +#### Problems + +- Filter noisy vibration or pressure data and compare spectra before/after +- Apply a moving average and a Butterworth filter to the same dataset — evaluate differences +- Use `ndimage.sobel()` to highlight temperature gradients in a heat-map image +- Challenge: write a short Python function that automatically chooses an appropriate smoothing window based on noise level \ No newline at end of file diff --git a/tutorials/module_4/4.6 Data Visualization and Presentation.md b/tutorials/module_4/4.6 Data Visualization and Presentation.md deleted file mode 100644 index b788fc7..0000000 --- a/tutorials/module_4/4.6 Data Visualization and Presentation.md +++ /dev/null @@ -1,29 +0,0 @@ -# Data Visualization and Presentation - -**Learning objectives:** - -- Create scientific plots using `matplotlib.pyplot` -- Customize figures (labels, legends, styles, subplots) -- Plot multi-dimensional and time-series data -- Combine plots and export for reports ---- - -**Extensions:** - -- Intro to `seaborn` for statistical visualization -- Plotting uncertainty and error bars - - - - -## How to represent data scientifically - - - - - - - - - -## Taking it further with R \ No newline at end of file diff --git a/tutorials/module_4/4.7 Data Visualization and Presentation.md b/tutorials/module_4/4.7 Data Visualization and Presentation.md new file mode 100644 index 0000000..b788fc7 --- /dev/null +++ b/tutorials/module_4/4.7 Data Visualization and Presentation.md @@ -0,0 +1,29 @@ +# Data Visualization and Presentation + +**Learning objectives:** + +- Create scientific plots using `matplotlib.pyplot` +- Customize figures (labels, legends, styles, subplots) +- Plot multi-dimensional and time-series data +- Combine plots and export for reports +--- + +**Extensions:** + +- Intro to `seaborn` for statistical visualization +- Plotting uncertainty and error bars + + + + +## How to represent data scientifically + + + + + + + + + +## Taking it further with R \ No newline at end of file diff --git a/tutorials/module_4/Data Cleaning and Preprocessing.md b/tutorials/module_4/Data Cleaning and Preprocessing.md new file mode 100644 index 0000000..7ac126c --- /dev/null +++ b/tutorials/module_4/Data Cleaning and Preprocessing.md @@ -0,0 +1,104 @@ +# Data Cleaning and Preprocessing + +**Learning objectives:** + +- Detect and handle missing or invalid data +- Identify and remove outliers +- Apply smoothing and detrending +- Unit consistency and scaling +--- +## What is data cleaning? +Data cleaning is an **iterative and adaptive process** that uses different methods depending on the characteristics of the dataset, the goals of the analysis, and the tools available. It generally includes several key tasks, such as: +- Handling or replacing missing and invalid data +- Detecting and correcting outliers +- Reducing noise through smoothing or filtering techniques + + +## Handling missing or invalid data +Missing data occurs when expected values or measurements are absent from a dataset, often appearing as `NULL`, `0`, empty strings, or `NaN` (Not a Number) entries. These gaps can arise from various sources, including sensor malfunctions during data acquisition, errors in transmission, or formatting issues during data conversion. Because missing data can distort analyses and weaken model accuracy, it must be carefully identified and treated during the data cleaning stage. + +Detecting missing data may seem simple, but selecting an appropriate way to replace those gaps is often more complex. The process typically begins by locating missing or invalid entries through visualization or value inspection. Once identified, the goal is to estimate replacements that closely approximate the true, unobserved values. The method used depends heavily on the behavior and structure of the data. + +- **Slowly changing data**, such as temperature measurements, can often be filled using the nearest valid observation. +- **Seasonal or moderately variable data**, like weather records, may benefit from statistical approaches such as moving averages, medians, or _K_-nearest neighbor imputation. +- **Strongly time-dependent data**, such as financial or process signals, are best handled using interpolation methods that estimate values based on surrounding data points. + +You may have data the looks like this. +[![A solar irradiance raw input data time-series plot with missing values.|450](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/725f6f68-0273-4bd3-8e6a-6a184615752a/image.adapt.full.medium.jpg/1758740047296.jpg) + +In Python, the `pandas` library provides several simple and powerful ways to handle missing values. Missing entries in a DataFrame appear as `NaN` (Not a Number), and you can replace or estimate these values using methods such as forward fill, backward fill, interpolation, or moving averages. + +- Forward fill (`ffill`) uses the last valid observation to replace missing values, which is useful for slowly changing signals. +- Backward fill (`bfill`) propagates the next valid value backward to fill earlier gaps. +- Interpolation estimates missing values using linear or polynomial trends between known data points. +- Rolling mean or moving average smooths short-term fluctuations by averaging nearby samples, similar to MATLAB’s `movmean()` function. + +The example below demonstrates these techniques applied to a temperature dataset with missing readings: +```python +import pandas as pd + +# Example data +data = {"Time_s": [0, 1, 2, 3, 4, 5], + "Temp_C": [20.1, None, 21.0, None, 22.3, 22.8]} +df = pd.DataFrame(data) + +# Fill with the last valid value (forward fill) +df["Temp_ffill"] = df["Temp_C"].ffill() + +# Fill with next valid value (backward fill) +df["Temp_bfill"] = df["Temp_C"].bfill() + +# Linear interpolation between missing values +df["Temp_interp"] = df["Temp_C"].interpolate(method="linear") + +# Rolling mean (similar to moving average) +df["Temp_movmean"] = df["Temp_C"].fillna( + df["Temp_C"].rolling(window=3, min_periods=1).mean() +) +print(df) + +``` + + +## Identify and remove outliers +Outliers are data points that differ greatly from the rest of the data, often appearing as unusually high or low values that don’t follow the overall pattern. They can distort analysis and lead to misleading conclusions. Outliers may result from measurement errors, data entry mistakes, normal variation, or true anomalies in the system being measured. + +One common statistical approach to detect and remove outliers is the **Z-score method**. A Z-score describes how far a data point is from the mean of the dataset, measured in units of standard deviation. For a normally distributed variable, most data points lie within three standard deviations of the mean (±3 σ). Values that fall far outside this range are likely to be **outliers**, meaning they deviate significantly from the typical trend of the data. + +In practice, we calculate the Z-score for each observation, take its absolute value, and remove points whose Z-score exceeds a threshold, commonly 3 for general data, or 2.5 when the dataset is smaller or more sensitive to noise. This method works best when the data roughly follows a bell-shaped (Gaussian) distribution. + +The following example demonstrates how we could apply the Z-score method using the SciPy library and a small sample dataset of force measurements: + +```python +import pandas as pd +import numpy as np +from scipy import stats + +# Example dataset +df = pd.DataFrame({"Force_N": [10, 11, 10.5, 10.2, 11.1, 50, 10.8, 9.9]}) + +# Compute Z-scores +z = np.abs(stats.zscore(df["Force_N"])) + +# Keep only points where Z < 3 (within 3 standard deviations) +df_clean = df[z < 3] + +print(df_clean) +``` + + +### Problem 1: Cleaning datasets +Clean up the following dataset using the methods above. + +## Apply smoothing and detrending + + + + +## Units and scalling + + + +### Problem 2: + + -- cgit v1.2.3