summaryrefslogtreecommitdiff
path: root/tutorials/module_4/Data Cleaning and Preprocessing.md
diff options
context:
space:
mode:
Diffstat (limited to 'tutorials/module_4/Data Cleaning and Preprocessing.md')
-rw-r--r--tutorials/module_4/Data Cleaning and Preprocessing.md104
1 files changed, 104 insertions, 0 deletions
diff --git a/tutorials/module_4/Data Cleaning and Preprocessing.md b/tutorials/module_4/Data Cleaning and Preprocessing.md
new file mode 100644
index 0000000..7ac126c
--- /dev/null
+++ b/tutorials/module_4/Data Cleaning and Preprocessing.md
@@ -0,0 +1,104 @@
+# Data Cleaning and Preprocessing
+
+**Learning objectives:**
+
+- Detect and handle missing or invalid data
+- Identify and remove outliers
+- Apply smoothing and detrending
+- Unit consistency and scaling
+---
+## What is data cleaning?
+Data cleaning is an **iterative and adaptive process** that uses different methods depending on the characteristics of the dataset, the goals of the analysis, and the tools available. It generally includes several key tasks, such as:
+- Handling or replacing missing and invalid data
+- Detecting and correcting outliers
+- Reducing noise through smoothing or filtering techniques
+
+
+## Handling missing or invalid data
+Missing data occurs when expected values or measurements are absent from a dataset, often appearing as `NULL`, `0`, empty strings, or `NaN` (Not a Number) entries. These gaps can arise from various sources, including sensor malfunctions during data acquisition, errors in transmission, or formatting issues during data conversion. Because missing data can distort analyses and weaken model accuracy, it must be carefully identified and treated during the data cleaning stage.
+
+Detecting missing data may seem simple, but selecting an appropriate way to replace those gaps is often more complex. The process typically begins by locating missing or invalid entries through visualization or value inspection. Once identified, the goal is to estimate replacements that closely approximate the true, unobserved values. The method used depends heavily on the behavior and structure of the data.
+
+- **Slowly changing data**, such as temperature measurements, can often be filled using the nearest valid observation.
+- **Seasonal or moderately variable data**, like weather records, may benefit from statistical approaches such as moving averages, medians, or _K_-nearest neighbor imputation.
+- **Strongly time-dependent data**, such as financial or process signals, are best handled using interpolation methods that estimate values based on surrounding data points.
+
+You may have data the looks like this.
+[![A solar irradiance raw input data time-series plot with missing values.|450](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/725f6f68-0273-4bd3-8e6a-6a184615752a/image.adapt.full.medium.jpg/1758740047296.jpg)
+
+In Python, the `pandas` library provides several simple and powerful ways to handle missing values. Missing entries in a DataFrame appear as `NaN` (Not a Number), and you can replace or estimate these values using methods such as forward fill, backward fill, interpolation, or moving averages.
+
+- Forward fill (`ffill`) uses the last valid observation to replace missing values, which is useful for slowly changing signals.
+- Backward fill (`bfill`) propagates the next valid value backward to fill earlier gaps.
+- Interpolation estimates missing values using linear or polynomial trends between known data points.
+- Rolling mean or moving average smooths short-term fluctuations by averaging nearby samples, similar to MATLAB’s `movmean()` function.
+
+The example below demonstrates these techniques applied to a temperature dataset with missing readings:
+```python
+import pandas as pd
+
+# Example data
+data = {"Time_s": [0, 1, 2, 3, 4, 5],
+ "Temp_C": [20.1, None, 21.0, None, 22.3, 22.8]}
+df = pd.DataFrame(data)
+
+# Fill with the last valid value (forward fill)
+df["Temp_ffill"] = df["Temp_C"].ffill()
+
+# Fill with next valid value (backward fill)
+df["Temp_bfill"] = df["Temp_C"].bfill()
+
+# Linear interpolation between missing values
+df["Temp_interp"] = df["Temp_C"].interpolate(method="linear")
+
+# Rolling mean (similar to moving average)
+df["Temp_movmean"] = df["Temp_C"].fillna(
+ df["Temp_C"].rolling(window=3, min_periods=1).mean()
+)
+print(df)
+
+```
+
+
+## Identify and remove outliers
+Outliers are data points that differ greatly from the rest of the data, often appearing as unusually high or low values that don’t follow the overall pattern. They can distort analysis and lead to misleading conclusions. Outliers may result from measurement errors, data entry mistakes, normal variation, or true anomalies in the system being measured.
+
+One common statistical approach to detect and remove outliers is the **Z-score method**. A Z-score describes how far a data point is from the mean of the dataset, measured in units of standard deviation. For a normally distributed variable, most data points lie within three standard deviations of the mean (±3 σ). Values that fall far outside this range are likely to be **outliers**, meaning they deviate significantly from the typical trend of the data.
+
+In practice, we calculate the Z-score for each observation, take its absolute value, and remove points whose Z-score exceeds a threshold, commonly 3 for general data, or 2.5 when the dataset is smaller or more sensitive to noise. This method works best when the data roughly follows a bell-shaped (Gaussian) distribution.
+
+The following example demonstrates how we could apply the Z-score method using the SciPy library and a small sample dataset of force measurements:
+
+```python
+import pandas as pd
+import numpy as np
+from scipy import stats
+
+# Example dataset
+df = pd.DataFrame({"Force_N": [10, 11, 10.5, 10.2, 11.1, 50, 10.8, 9.9]})
+
+# Compute Z-scores
+z = np.abs(stats.zscore(df["Force_N"]))
+
+# Keep only points where Z < 3 (within 3 standard deviations)
+df_clean = df[z < 3]
+
+print(df_clean)
+```
+
+
+### Problem 1: Cleaning datasets
+Clean up the following dataset using the methods above.
+
+## Apply smoothing and detrending
+
+
+
+
+## Units and scalling
+
+
+
+### Problem 2:
+
+