summaryrefslogtreecommitdiff
path: root/tutorials/module_4/data cleaning.md
diff options
context:
space:
mode:
Diffstat (limited to 'tutorials/module_4/data cleaning.md')
-rw-r--r--tutorials/module_4/data cleaning.md81
1 files changed, 81 insertions, 0 deletions
diff --git a/tutorials/module_4/data cleaning.md b/tutorials/module_4/data cleaning.md
new file mode 100644
index 0000000..efd2889
--- /dev/null
+++ b/tutorials/module_4/data cleaning.md
@@ -0,0 +1,81 @@
+#### Data Cleaning
+
+###### How Data Cleaning Works
+
+Data cleaning is an iterative process that involves different techniques depending on your data set, the objectives of the final analysis, and the available tools and software. Data cleaning typically involves one or more of following steps:
+
+### Typical Data Cleaning Steps
+
+#### Filling Missing Data
+
+Missing data refers to the absence of values or information in a data set resulting in NULL, 0, empty strings, or invalid (NaN) data points. Values can be missing because of several reasons such as data acquisition, data transmission, and data conversion. Missing data can have a significant impact on the quality and validity of data analysis and modeling; hence, it is important to address it appropriately during the data cleaning process.
+
+Missing data can be classified into three categories and identifying the right category can help you select an appropriate fill method:
+
+1. Missing at random (MAR) — In this category, the variable with missing values is dependent on other variables in the data set. For instance, a rooftop solar installation relaying telemetry data such as irradiance level, grid voltage, frequency, etc., would have missing values at night or during rainy days because there isn’t enough solar irradiance to power up the system and thus the missing values of grid voltage or frequency are caused by poor irradiance levels.
+2. Missing completely at random (MCAR) — In this category, the underlying cause of missing values is completely unrelated to any other variable in the data set. For example, missing packets in weather telemetry could result from malfunctioning sensors or high channel noise.
+3. Missing not at random (MNAR) — This scenario applies to variables where the underlying cause of missing data is related to the variable itself. For example, if a sensor relaying temperature information has reached its measurement limits, it would result in missing values in the form of its saturated thresholds.
+
+Identifying missing data sounds straightforward, but replacing it with a suitable estimate is an involved process. You can start by spotting missing values using visualization or searching for invalid values. Replacing the missing values involves generating values that are likely to be close to the actual values. Based on the nature of the data, the technique of filling these missing values can vary. For example:
+
+- A slow varying data like temperature could simply use the nearest valid value.
+- Data sets exhibiting seasonality and reduced randomness like weather could use statistical methods like moving average, median, or _K_-nearest neighbors.
+- Data sets exhibiting strong dependencies on its previous values like stock prices or economic indicators are well-suited for interpolation-based techniques to generate missing data.
+
+Figure 5 shows raw solar irradiance data with its missing values filled using the `fillmissing` function. In this instance, a moving median window-based technique is used to fill in the missing values.
+
+[![A solar irradiance raw input data time-series plot with missing values.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/725f6f68-0273-4bd3-8e6a-6a184615752a/image.adapt.full.medium.jpg/1758740047296.jpg)
+
+](https://www.mathworks.com/discovery/data-cleaning.html#)
+
+[![A solar irradiance raw input data time-series plot with missing values filled using the fillmissing function in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/columns_copy_copy/791b019f-886d-4859-b338-895a71833079/image.adapt.full.medium.jpg/1758740047311.jpg)
+
+](https://www.mathworks.com/discovery/data-cleaning.html#)
+
+Figure 5. Time-series plot of a solar irradiance raw data set, with its missing values filled using the `fillmissing` function in MATLAB.
+
+#### Managing Outliers
+
+Outliers are data points that deviate significantly from most observations within a data set. They can be unusually high or low values that do not seem to follow the general pattern of the data. Outliers can distort the statistical analysis and interpretation of a data set, potentially leading to misleading results. Outliers can arise due to various reasons, including measurement errors, data entry mistakes, natural variability, or genuine anomalies in the underlying process being studied.
+
+Managing outliers involves two configurable steps:
+
+1. Detection
+
+Detecting outliers involves defining a valid operating range outside of which any data point is identified as an outlier. Methods used in defining the valid operating range are related to the attribute, source, and purpose of the data set. These methods range from simple techniques like visualization-based or fixed threshold–based outlier detection to statistical methods like median absolute deviation, to distance-based methods, such as Euclidean and Mahalanobis.
+
+3. Filling outliers
+
+After identifying the outliers, they can be replaced with generated values. Generating techniques used in replacing outliers are similar to the ones used for filling missing data.
+
+Figure 6 shows input data with two outliers that are detected and filled using the linear interpolation median detection method.
+
+![Graph shows two outliers detected using median thresholding and filled by a linear interpolation method interactively using the Clean Outlier Data Live Editor task in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_1672412617.adapt.full.medium.jpg/1758740047342.jpg)
+
+Figure 6. Clean Outlier Live Editor task used to detect and fill outliers using median thresholding and linear interpolation respectively.
+
+#### Smoothing
+
+Smoothing is a data analysis technique used to reduce noise, variability, or irregularities in a data set to reveal underlying patterns or trends more clearly. It is commonly applied in various fields including statistics, signal processing, time-series analysis, and image processing.
+
+Like other data cleaning methods, the smoothing technique is also highly dependent on the nature and the domain of the data. You can use simple statistical methods like moving average filter, weighted moving average filter, or moving median-based filter to more complex techniques like splines, Fourier transform smoothing, and Kalman filtering. The smoothing function requires the data set to be ordered and sampled at a fixed interval.
+
+![A plot showing raw noisy input data before and after applying the data cleaning technique smoothdata function in MATLAB to remove the noise from the input signal.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_103707558.adapt.full.medium.jpg/1758740047366.jpg)
+
+Figure 7. MATLAB plot of a noisy data set smoothed using a moving average filter with the `smoothdata` function.
+
+### Data Cleaning Using Deep Learning Models
+
+Traditional data cleaning methods work well with data that can be modeled with commonly known statistical and mathematical models. But for complex data sets that do not fit standard models well, like human speech, EEG signals, etc., we can leverage deep learning models to perform data cleaning.
+
+In this [example](https://www.mathworks.com/help/deeplearning/ug/denoise-speech-using-deep-learning-networks.html) shown in Figure 8, speech signals are riddled with noise from a washing machine running in the background. Data cleaning methods, such as smoothing or outlier removal, cannot effectively remove the noise from the washing machine data as it has an audio spectrum that overlaps with the speech signal. Deep learning networks, such as [fully connected](https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.fullyconnectedlayer.html) and [convolutional](https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.convolution2dlayer.html), are able to clean or denoise the speech signal, thus removing the noise and leaving the underlying signal.
+
+![Graphs showing a clean speech signal and a version of it contaminated by washing machine noise in the background and graphs comparing the output of fully connected and convolutional networks used to denoise the speech signal plotted in MATLAB.](https://www.mathworks.com/discovery/data-cleaning/_jcr_content/mainParsys/band/mainParsys/lockedsubnav/mainParsys/image_1238277247.adapt.full.medium.jpg/1758740047389.jpg)
+
+Figure 8. MATLAB plots of clean and noisy speech signals and denoised output from two deep learning networks—fully connected and convolutional.
+
+### Data Cleaning with Excel
+
+Microsoft® Excel® is a common tool for cleaning and preparing data. It offers built-in commands, such as Remove Duplicates and Find and Replace, that you can use to standardize data sets. You can also apply conditional formatting to highlight inconsistencies or use [pivot tables](https://www.mathworks.com/help/matlab/ref/pivottable.html) to identify and correct errors. However, for larger data sets, tasks such as handling missing values, merging data sets, or applying custom logic often need to be done manually. Lack of automation can increase the risk of unintended errors and inconsistencies in processing, especially when working with complex data sets.
+
+MATLAB can help with some of the more time-consuming parts of data cleaning in Excel, especially when working with larger data sets. MATLAB scripts and functions make data cleaning transformations transparent, so you can always see what steps are taken and adjust as needed. For example, instead of manually searching for missing values, you can use `fillmissing` to automatically handle gaps in data. By using MATLAB with Excel, you can handle messier data sets more consistently while keeping control over the process. \ No newline at end of file