summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorChristian Kolset <christian.kolset@gmail.com>2025-10-06 12:46:06 -0600
committerChristian Kolset <christian.kolset@gmail.com>2025-10-06 12:46:06 -0600
commit2a57f6032b5717020bfaf4dccde1546a7a29ee6f (patch)
treebdcdf431d2011541e42e2f64245c8f152b0ea44f
parentb7652c078a74ec0fd8419c4e0d8f9dc1d7b28020 (diff)
Added Reference notes for module 4
-rw-r--r--tutorials/module_4/1_importing_scientific_data.md2
-rw-r--r--tutorials/module_4/2_data_processing.md3
-rw-r--r--tutorials/module_4/3_linear_regression.md3
-rw-r--r--tutorials/module_4/How to represent data.md1
-rw-r--r--tutorials/module_4/Pandas.md4
-rw-r--r--tutorials/module_4/data cleaning.md81
-rw-r--r--tutorials/module_4/data filtering.md81
-rw-r--r--tutorials/module_4/data_visualization.md2
-rw-r--r--tutorials/module_4/importing data.md9
-rw-r--r--tutorials/module_4/processing data.md7
-rw-r--r--tutorials/module_4/working with irregular data.md8
11 files changed, 191 insertions, 10 deletions
diff --git a/tutorials/module_4/1_importing_scientific_data.md b/tutorials/module_4/1_importing_scientific_data.md
index f609b39..9cfedbe 100644
--- a/tutorials/module_4/1_importing_scientific_data.md
+++ b/tutorials/module_4/1_importing_scientific_data.md
@@ -1,7 +1,5 @@
# Importing Scientific Data using Pandas
-^8eb966
-
[Introduction text]
diff --git a/tutorials/module_4/2_data_processing.md b/tutorials/module_4/2_data_processing.md
index d75c35a..eba7a12 100644
--- a/tutorials/module_4/2_data_processing.md
+++ b/tutorials/module_4/2_data_processing.md
@@ -1,8 +1,5 @@
# Data Processing
-^7b1480
-
-
## Signal Processing - Filtering
### Low-Pass
diff --git a/tutorials/module_4/3_linear_regression.md b/tutorials/module_4/3_linear_regression.md
index d79c639..511ea1a 100644
--- a/tutorials/module_4/3_linear_regression.md
+++ b/tutorials/module_4/3_linear_regression.md
@@ -1,8 +1,5 @@
# Linear Regression
-^aab594
-
-
## Statistical tools
Numpy comes with some useful statistical tools that we can use to analyze our data.
diff --git a/tutorials/module_4/How to represent data.md b/tutorials/module_4/How to represent data.md
new file mode 100644
index 0000000..041c377
--- /dev/null
+++ b/tutorials/module_4/How to represent data.md
@@ -0,0 +1 @@
+How to represent data \ No newline at end of file
diff --git a/tutorials/module_4/Pandas.md b/tutorials/module_4/Pandas.md
new file mode 100644
index 0000000..d62ee48
--- /dev/null
+++ b/tutorials/module_4/Pandas.md
@@ -0,0 +1,4 @@
+Pandas
+Panel Data
+
+https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_spreadsheets.html#compare-with-spreadsheets \ No newline at end of file
diff --git a/tutorials/module_4/data cleaning.md b/tutorials/module_4/data cleaning.md
new file mode 100644
index 0000000..efd2889
--- /dev/null
+++ b/tutorials/module_4/data cleaning.md
@@ -0,0 +1,81 @@
+#### Data Cleaning
+
+###### How Data Cleaning Works
+
+Data cleaning is an iterative process that involves different techniques depending on your data set, the objectives of the final analysis, and the available tools and software. Data cleaning typically involves one or more of following steps:
+
+### Typical Data Cleaning Steps
+
+#### Filling Missing Data
+
+Missing data refers to the absence of values or information in a data set resulting in NULL, 0, empty strings, or invalid (NaN) data points. Values can be missing because of several reasons such as data acquisition, data transmission, and data conversion. Missing data can have a significant impact on the quality and validity of data analysis and modeling; hence, it is important to address it appropriately during the data cleaning process.
+
+Missing data can be classified into three categories and identifying the right category can help you select an appropriate fill method:
+
+1. Missing at random (MAR) — In this category, the variable with missing values is dependent on other variables in the data set. For instance, a rooftop solar installation relaying telemetry data such as irradiance level, grid voltage, frequency, etc., would have missing values at night or during rainy days because there isn’t enough solar irradiance to power up the system and thus the missing values of grid voltage or frequency are caused by poor irradiance levels.
+2. Missing completely at random (MCAR) — In this category, the underlying cause of missing values is completely unrelated to any other variable in the data set. For example, missing packets in weather telemetry could result from malfunctioning sensors or high channel noise.
+3. Missing not at random (MNAR) — This scenario applies to variables where the underlying cause of missing data is related to the variable itself. For example, if a sensor relaying temperature information has reached its measurement limits, it would result in missing values in the form of its saturated thresholds.
+
+Identifying missing data sounds straightforward, but replacing it with a suitable estimate is an involved process. You can start by spotting missing values using visualization or searching for invalid values. Replacing the missing values involves generating values that are likely to be close to the actual values. Based on the nature of the data, the technique of filling these missing values can vary. For example:
+
+- A slow varying data like temperature could simply use the nearest valid value.
+- Data sets exhibiting seasonality and reduced randomness like weather could use statistical methods like moving average, median, or _K_-nearest neighbors.
+- Data sets exhibiting strong dependencies on its previous values like stock prices or economic indicators are well-suited for interpolation-based techniques to generate missing data.
+
+Figure 5 shows raw solar irradiance data with its missing values filled using the `fillmissing` function. In this instance, a moving median window-based technique is used to fill in the missing values.
+
+[![A solar irradiance raw input data time-series plot with missing values.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/725f6f68-0273-4bd3-8e6a-6a184615752a/image.adapt.full.medium.jpg/1758740047296.jpg)
+
+](https://www.mathworks.com/discovery/data-cleaning.html#)
+
+[![A solar irradiance raw input data time-series plot with missing values filled using the fillmissing function in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/791b019f-886d-4859-b338-895a71833079/image.adapt.full.medium.jpg/1758740047311.jpg)
+
+](https://www.mathworks.com/discovery/data-cleaning.html#)
+
+Figure 5. Time-series plot of a solar irradiance raw data set, with its missing values filled using the `fillmissing` function in MATLAB.
+
+#### Managing Outliers
+
+Outliers are data points that deviate significantly from most observations within a data set. They can be unusually high or low values that do not seem to follow the general pattern of the data. Outliers can distort the statistical analysis and interpretation of a data set, potentially leading to misleading results. Outliers can arise due to various reasons, including measurement errors, data entry mistakes, natural variability, or genuine anomalies in the underlying process being studied.
+
+Managing outliers involves two configurable steps:
+
+1. Detection
+
+Detecting outliers involves defining a valid operating range outside of which any data point is identified as an outlier. Methods used in defining the valid operating range are related to the attribute, source, and purpose of the data set. These methods range from simple techniques like visualization-based or fixed threshold–based outlier detection to statistical methods like median absolute deviation, to distance-based methods, such as Euclidean and Mahalanobis.
+
+3. Filling outliers
+
+After identifying the outliers, they can be replaced with generated values. Generating techniques used in replacing outliers are similar to the ones used for filling missing data.
+
+Figure 6 shows input data with two outliers that are detected and filled using the linear interpolation median detection method.
+
+![Graph shows two outliers detected using median thresholding and filled by a linear interpolation method interactively using the Clean Outlier Data Live Editor task in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_1672412617.adapt.full.medium.jpg/1758740047342.jpg)
+
+Figure 6. Clean Outlier Live Editor task used to detect and fill outliers using median thresholding and linear interpolation respectively.
+
+#### Smoothing
+
+Smoothing is a data analysis technique used to reduce noise, variability, or irregularities in a data set to reveal underlying patterns or trends more clearly. It is commonly applied in various fields including statistics, signal processing, time-series analysis, and image processing.
+
+Like other data cleaning methods, the smoothing technique is also highly dependent on the nature and the domain of the data. You can use simple statistical methods like moving average filter, weighted moving average filter, or moving median-based filter to more complex techniques like splines, Fourier transform smoothing, and Kalman filtering. The smoothing function requires the data set to be ordered and sampled at a fixed interval.
+
+![A plot showing raw noisy input data before and after applying the data cleaning technique smoothdata function in MATLAB to remove the noise from the input signal.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_103707558.adapt.full.medium.jpg/1758740047366.jpg)
+
+Figure 7. MATLAB plot of a noisy data set smoothed using a moving average filter with the `smoothdata` function.
+
+### Data Cleaning Using Deep Learning Models
+
+Traditional data cleaning methods work well with data that can be modeled with commonly known statistical and mathematical models. But for complex data sets that do not fit standard models well, like human speech, EEG signals, etc., we can leverage deep learning models to perform data cleaning.
+
+In this [example](https://www.mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-learning-networks.html) shown in Figure 8, speech signals are riddled with noise from a washing machine running in the background. Data cleaning methods, such as smoothing or outlier removal, cannot effectively remove the noise from the washing machine data as it has an audio spectrum that overlaps with the speech signal. Deep learning networks, such as [fully connected](https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.fullyconnectedlayer.html) and [convolutional](https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.convolution2dlayer.html), are able to clean or denoise the speech signal, thus removing the noise and leaving the underlying signal.
+
+![Graphs showing a clean speech signal and a version of it contaminated by washing machine noise in the background and graphs comparing the output of fully connected and convolutional networks used to denoise the speech signal plotted in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_1238277247.adapt.full.medium.jpg/1758740047389.jpg)
+
+Figure 8. MATLAB plots of clean and noisy speech signals and denoised output from two deep learning networks—fully connected and convolutional.
+
+### Data Cleaning with Excel
+
+Microsoft® Excel® is a common tool for cleaning and preparing data. It offers built-in commands, such as Remove Duplicates and Find and Replace, that you can use to standardize data sets. You can also apply conditional formatting to highlight inconsistencies or use [pivot tables](https://www.mathworks.com/help/matlab/ref/pivottable.html) to identify and correct errors. However, for larger data sets, tasks such as handling missing values, merging data sets, or applying custom logic often need to be done manually. Lack of automation can increase the risk of unintended errors and inconsistencies in processing, especially when working with complex data sets.
+
+MATLAB can help with some of the more time-consuming parts of data cleaning in Excel, especially when working with larger data sets. MATLAB scripts and functions make data cleaning transformations transparent, so you can always see what steps are taken and adjust as needed. For example, instead of manually searching for missing values, you can use `fillmissing` to automatically handle gaps in data. By using MATLAB with Excel, you can handle messier data sets more consistently while keeping control over the process. \ No newline at end of file
diff --git a/tutorials/module_4/data filtering.md b/tutorials/module_4/data filtering.md
new file mode 100644
index 0000000..2ecbca3
--- /dev/null
+++ b/tutorials/module_4/data filtering.md
@@ -0,0 +1,81 @@
+#### Data Filtering
+
+## What Is Data Filtering?
+
+Data filtering is the process of refining raw data by removing errors, reducing noise, and isolating relevant information for analysis. It helps improve accuracy, consistency, and reliability—key factors in making data truly useful.
+
+For example, an audio engineer might apply a [low-pass filter](https://www.mathworks.com/discovery/low-pass-filter.html) to remove high-frequency noise while preserving lower frequencies, ensuring clearer sound in music production and telecommunications.
+
+In practice, data filtering plays a crucial role in everything from cleaning data sets for visualization to optimizing machine learning models. By eliminating unwanted data, engineers and scientists can focus on the information that matters most.
+
+![Screenshot showing data filtering in a two-line plot of the original and filtered data. The smoothed signal removes noise but preserves key components.](https://www.mathworks.com/discovery/data-filtering/_jcr_content/mainParsys/columns/7a8517e6-3719-491a-b9c2-d63f41869458/image.adapt.full.medium.png/1759479220970.png)
+
+You can filter noisy data using a low-pass filter then visualize the results using MATLAB. ([See code.](https://www.mathworks.com/help/signal/ug/filtering-data-with-signal-processing-toolbox.html))
+
+## Key Aspects of Data Filtering
+
+Effective data filtering helps refine data sets, improving accuracy, reliability, and usability across various applications. While techniques vary by field, three fundamental aspects define how data filtering is applied:
+
+- **Noise reduction** removes unwanted variations or distortions that can obscure meaningful information, improving data clarity and consistency.
+- **Relevance filtering** selects only the most useful data based on specific criteria, ensuring that analytics and decision-making focus on high-value information.
+- **Data smoothing and transformation** reduces abrupt fluctuations and refines raw data, making it easier to identify trends and patterns in time-series analysis and predictive modeling.
+
+### Example: Medical Imaging
+
+In [medical image processing](https://www.mathworks.com/help/medical-imaging/index.html), data filtering is essential for producing clearer scans. For example, MRI and CT scans use filters to reduce noise caused by movement or interference, making it easier for radiologists to detect abnormalities. Without filtering, critical details could be lost in background noise, potentially leading to misdiagnosis.
+
+## Best Practices for Effective Data Filtering
+
+Effective data filtering is crucial to maintaining data accuracy and reliability. The top best practices to ensure high-quality results are:
+
+- **Understanding your data:** Before applying any filters, analyze the structure and characteristics of your data set. This step includes identifying noise, missing values, and outliers to choose the most suitable filtering techniques.
+- **Choosing the right filter:** Select filters that align with your analysis goals. For example, use frequency-based filters for noise reduction, smoothing filters for trend preservation, and rule-based filters for outlier detection.
+- **Preserving data integrity:** Avoid over-filtering, which can remove important insights. Focus on improving accuracy while maintaining essential data and patterns.
+- **Evaluating filtered data:** Always assess the effectiveness of your filtering. Compare raw versus filtered data, visualize the results, and use statistical metrics to ensure the accuracy and reliability of your data.
+
+## Types of Data Filtering Methods
+
+Different filtering methods are used depending on the data set type, the nature of the noise, and the desired outcome—whether it’s removing interference, [detecting anomalies](https://www.mathworks.com/discovery/anomaly-detection.html), or smoothing fluctuations in time-series data. Choosing the right filter ensures cleaner, more reliable data for analysis and decision-making.
+
+The table below outlines several key filtering methods, their purpose, common applications, and how to implement them in MATLAB®.
+
+| | | | | |
+|---|---|---|---|---|
+Key Data Filtering Methods
+|**Filtering Method**|**Types of Filters**|**Purpose**|**Applications**|**MATLAB Example**|
+|Frequency-based filters (signal processing filters)|- Low-pass<br>- High-pass<br>- Bandpass<br>- Bandstop (notch)|Remove or retain specific frequency components in data|Noise reduction (e.g., low-pass filtering to remove high-frequency noise), image processing, sensor data analysis|[Filtering Data with Signal Processing Toolbox](https://www.mathworks.com/help/signal/ug/filtering-data-with-signal-processing-toolbox.html) (low-pass FIR filter)|
+|Smoothing filters (statistical methods)|- Median filter<br>- Moving average<br>- Gaussian filter<br>- Exponential smoothing|Smooth data by reducing noise and variability|Time-series smoothing, image processing, outlier removal|[Moving-Average Filter of Traffic Data](https://www.mathworks.com/help/matlab/data_analysis/filtering-data.html#bqm3i7m-4)|
+|Rule-based filters (conditional filtering)|- Threshold filters (e.g., greater than, less than)<br>- Rule-based filters|Filter data based on predefined logical conditions|Data cleaning, outlier detection, quality control|[Outlier Removal via Hampel Filter](https://www.mathworks.com/help/signal/ug/signal-smoothing.html#SignalSmoothingExample-10)|
+|Trend-based filters (time-series methods)|- Hodrick-Prescott filter<br>- Kalman filter<br>- Wavelet filter|Identify trends, remove seasonality, smooth fluctuations|Stock market analysis, climate data, sensor monitoring|[Hodrick-Prescott filter for trend and cyclical components](https://www.mathworks.com/help/econ/hpfilter.html)|
+|Machine learning–based filters|- Anomaly detection algorithms<br>- Autoencoders<br>- Clustering-based filtering|Use AI and machine learning to detect and remove noisy or irrelevant data|Fraud detection, predictive maintenance, automated data cleaning|[Anomaly Detection Using Autoencoder and Wavelets](https://www.mathworks.com/help/wavelet/ug/detect-anomalies-using-wavelet-scattering-with-autoencoders.html)|
+
+## Techniques for Data Filtering with MATLAB
+
+MATLAB offers powerful tools for filtering data across various domains, including signal processing, image processing, and time-series analysis. Key techniques for data filtering in MATLAB include:
+
+- **Predefined filters:** MATLAB provides built-in functions, such as `lowpass`, `highpass`, and `movmean`, for quick and efficient filtering. These functions are ideal for noise reduction, trend smoothing, frequency component isolation, and other common tasks.
+- **Custom filter design:** For more specific filtering needs, MATLAB lets you design custom filters:
+ - **Interactive design:** Use the Filter Designer app (`filterDesigner`) or the Design Filter Live Editor to create and test filters interactively.
+ - **Programmatic design:** Create filters programmatically using `designfilt`, `butter`, `cheby1`, or other functions. This method enables you to customize key parameters such as filter order, cutoff frequency, and filter type (FIR or IIR) for more control over the filtering process.
+- **Visualizing and evaluating filters:** After applying a filter, you can visualize the filtered data using the MATLAB `plot` functions to compare the effects of the filter with the raw data. This step is essential for evaluating how well the filter has improved data quality.
+
+By leveraging these filtering techniques in MATLAB, engineers and scientists can effectively process and clean data, whether working with [time-series data](https://www.mathworks.com/discovery/time-series-data.html), sensor readings, or signal processing tasks.
+
+## Frequently Asked Questions
+
+### 1. How does data filtering differ from data cleaning?
+
+Data filtering removes irrelevant or noisy data based on criteria such as frequency or statistical properties, while data cleaning focuses on fixing errors, handling missing values, and standardizing formats to ensure data integrity. See more on [data cleaning](https://www.mathworks.com/discovery/data-cleaning.html).
+
+### 2. What are the main challenges in data filtering, and how I can overcome them?
+
+- **Over-filtering:** Fine-tune parameters to avoid losing valuable data.
+- **Slow performance:** For large data sets, use optimized techniques in MATLAB to improve filtering speed.
+- **Choosing the right filter:** Ensure you select the right filter to avoid distorting results—and always visualize before and after. See more on choosing the right filter with [Filter Designer](https://www.mathworks.com/help/signal/ref/filterdesigner-app.html).
+
+### 3. Can filters be combined for better results?
+
+Yes, combining filters often enhances results. For instance:
+
+- Apply a moving average filter before a low-pass filter for smoother data.
+- Use machine learning–based filters to detect outliers before applying noise reduction filters. \ No newline at end of file
diff --git a/tutorials/module_4/data_visualization.md b/tutorials/module_4/data_visualization.md
index fc02427..d07c20b 100644
--- a/tutorials/module_4/data_visualization.md
+++ b/tutorials/module_4/data_visualization.md
@@ -1,4 +1,2 @@
# Data Visualization
-^1ac244
-
diff --git a/tutorials/module_4/importing data.md b/tutorials/module_4/importing data.md
new file mode 100644
index 0000000..19b5fe1
--- /dev/null
+++ b/tutorials/module_4/importing data.md
@@ -0,0 +1,9 @@
+Importing scientific data
+
+#### Importing Data
+
+**Objective:** Read text files that contain a mixture of data types, delimiters, and headers.
+
+- Import a mixture of data types from arbitrarily formatted text files
+- Import only required columns of data from a text file
+- Import and merge data from multiple files \ No newline at end of file
diff --git a/tutorials/module_4/processing data.md b/tutorials/module_4/processing data.md
new file mode 100644
index 0000000..5a82797
--- /dev/null
+++ b/tutorials/module_4/processing data.md
@@ -0,0 +1,7 @@
+#### Processing data
+
+**Objective:** Process raw imported data by extracting, manipulating, aggregating, and counting portions of data.
+
+- Process data with missing elements
+- Create and modify categorical arrays
+- Aggregate, bin, and count groups of data \ No newline at end of file
diff --git a/tutorials/module_4/working with irregular data.md b/tutorials/module_4/working with irregular data.md
new file mode 100644
index 0000000..092fa91
--- /dev/null
+++ b/tutorials/module_4/working with irregular data.md
@@ -0,0 +1,8 @@
+#### Working with Irregular Data
+
+**Objective:** Import and visualize scattered data from text files with irregular formatting.
+
+- Parse text files to determine formatting
+- Import data from separate sections of a text file
+- Extract data from container variables
+- Perform calculations on time series data