Data Preparation and Pre-processing Methods




Data Preparation and Pre-processing Methods

 

Data can be collected in various ways, such as through surveys, experiments, observations, or it can be extracted from existing databases. There are also numerous public datasets available for use in data science and machine learning projects. The choice of data collection method depends on the nature of the problem, the available resources, and the specific requirements of the statistical analysis or machine learning model. Data preparation and preprocessing for machine learning is the process of transforming raw data into a format that can be used by machine learning algorithms.

Sometimes rather than generated by real-world events data is artificially manufactured. Synthetic data developed algorithmically, be used as stand-in for test datasets or operational data to validate mathematical models and to train machine learning models.


The various steps of Data preparation

Step 1: Data Collection

Data sampling requires the application of sampling theory. Sampling theory is a branch of statistics that is involved with the collection, analysis and interpretation of data gathered from random samples of a population under study. The application of sampling theory is concerned not only with the proper selection of observations from the population that will constitute the random sample; it also involves the use of probability theory, along with prior knowledge about the population parameters, to analyze the data from the random sample and develop conclusions from the analysis.  


 Figure -1

 Data collection is the by far the essential first step as it addresses common challenges, including searching and identifying relevant data from the repositories. This is sometimes called data exploration.



Figure - 2

Step 2: Data coding 
Data Coding means the transformation of data into variables using numbers to a form understandable by computer for analysis. Data coding is any process that assigns a numerical value to a response/observation  during data collection. Data Coding is an analytical process in which data, in both quantitative or qualitative form is categorized to facilitate data entry into the computer and further analysis. 
Finally data collected needs to be stored in databases in  a standardized manner data files in form a series of tables, table fields, field data values, spread sheets, audio, image, video files etc

Preprocessing of raw data
Data pre-processing for Machine learning refers to the transformations applied to raw data before feeding it to a learning algorithm. Lack of data quality is most challenging problem in any machine learning application. Bad data can result in wrong or misleading inferences.  Source (raw) data will reflect real-world situations and so it is critical to be sure the learning model is not influenced by the inferior quality of data. According to IBM analytics report generally 80% of data scientist time is spent to clean up the data.

Step 3: Data Editing
Virtually all data collection and data preparation method are influenced by some degree of errors. According to surveys 25% database records are inaccurate, 60% is unreliable, 15% contains duplicate data, 8% contains missing data, 11% invalid value ranges and 6% is invalid addresses (like email, contact addresses).
This can happen due to factors affecting data collection methods, subjective preferences, judgments of individuals and so on. It is required to clean and filter the data  to improve data quality. According to IBM’s estimate annual cost due to bad data quality is $3.1 trillion in year 2016 (ref: hbr.org). Data quality is affected due to the following  reasons

Missing values
Care should be taken when handling outliers and missing data. Similarly, take care before automatically deleting all records with a missing value, as too many deletions could skew
 your data set to no longer reflect real-world situations.


Figure – 3


Figure – 4


Figure – 5

Noise  - unexplained variations or anomalies in data



Figure – 6: Highlighted points indicated feature dimension influenced by noise

Figure – 7: Noisy 1D – signal


                  Figure – 8: Nvidia AI removes heavy noise from image signal

Extremal values or outliers,
Outliers are data points that significantly differs from other observations. It can occur due to measurement errors, experimental errors, input error, anomalous conditions etc. Examples of outlier data are shown in figures.

Figure – 9: Outlier in spreadsheet.

Figure 10: Outlier in a waveform

The reason that an outlier differs from the rest of the points in the data set is crucial in determining whether to omit the outlier or not. If they are not attributed to errors, they may reveal new information or trends that were not predicted. Hence it is better not to omit them. Though outliers can be usually ignored it could be a real and meaningful result that could inform future events.

Incorrect, inconsistent data
Incorrect and inappropriate data happens it is entered in the wrong field, poor data entry like misspells, typos, naming, formatting and so on.

 Figure-11 Inconsistent data.

Skewed data – noise can skew data to any one side. Outliers can skew data.



Figure – 12:
Duplication
Figure – 13: Data duplication detected by Excel

Any unseen biases.
Figure – 14: Gender Bias in Data

Therefore once the data is collected, it is required assess its quality condition. The data may be required to be cleaned and removed of the above errors

Step 4: Data Formatting,
Profiling (fine tuning) to make it consistent.
The next step in data preparation is to ensure your data is formatted in a way that best fits your machine learning model. Consistent data formatting takes away these errors so that the entire data set uses the same input formatting protocols.


Figure – 15: Data Formatting with Excel

In the same way, standardizing or normalizing the values in a column, e.g. State names that could be spelled out or abbreviated) will ensure that your data will aggregate correctly.

Figure – 16: Raw Data vs Zero mean Scaled Data

Figure – 17: Raw Data vs Z-score normalized Data
  
Step 5: Feature Extraction.
Features are distinguishable attributes or properties related to something of interest.  Feature extraction involves the art and science of transforming raw data into features that better represent a pattern to the learning algorithms. Selection of appropriate feature plays an important role in determining the accuracy and performance of the learning model. Features are intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations. Feature extraction is also related to reduction of dimensionality.
Features can be numbers, distances, mass, time, areas, volumes, prices, averages values, variance values, histograms etc. For time series signal it can be moments, zero crossings, spectral values, cepstral values etc, for images edges, colors, textures, intensity values, various transforms and so on. These derived values or features facilitate better human interpretations, visualization and implementation of learning models, training (recommendation) algorithms and subsequent generalization.

Features are represented as feature vectors. Determining a subset of the initial features is called feature selection. The selected features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data.

Step 6: Splitting data into training and evaluation (test) set
At the final step the data is partitioned or split into training and evaluation (test) sets. Usually, we split the data set in 70:30 or 80:20 ratio for training and testing respectively.


Image Credits:
Figure – 1: http://desogo.elenabetchke.com/sample-size-calculator-for-quasi-experimental-design/
Figure – 2: https://koppa.jyu.fi/avoimet/hum/menetelmapolkuja/en/methodmap/data-
                      collection/existing-materials-and-self-produced-materials?fullscreen=1
Figure – 3: http://rstudio-pubsstatic.s3.amazonaws.com/ 2298_34fdfecb5b7f4 669a84d938f1f   
                      0480e9.html
Figure – 4: https://stackoverflow.com/questions/43200199/how-to-find-missing-values
Figure – 5: Machine Learning A Probabilistic Perspective by K.P. Murphy
Figure – 6: http://mlwiki.org/index.php/Noise_Handling_(Data_Mining)
Figure – 7: https://www.researchgate.net/figure/Sinusoid-signal-without-noise-with-SNR10dB-and-
                   SNR20dB_fig5_276897708
Figure – 8: https://www.extremetech.com/extreme/273121-nvidia-ai-compensates-for-your-poor- 
                      photography-skills-by-erasing-noise-from-images
Figure – 9: https://www.wikihow.com/Calculate-Outliers
Figure – 10: https://in.mathworks.com/help/matlab/ref/filloutliers.html
Figure – 11: Elite Data Science
Figure – 12: wikipedia.org
Figure – 13: https://exceljet.net/formula/highlight-duplicate-values

Figure – 14: Challenges of Gender Bias and Data Collection, Taylor Billings Russell, CARD    

                     Research Specialist

Figure – 15: https://saylordotorg.github.io/text_how-to-use-microsoft-excel-v1.1/s05-03- 
                         formatting-and-data-analysis.html
Figure – 16: https://subscription.packtpub.com/book/big_data_and_business_intelligence/  
                         9781785889622/3/ch03lvl1sec24/data-scaling-and-normalization
Figure – 17: https://developers.google.com/machine-learning/data-prep/transform/normalization

Comments

Post a Comment

Popular posts from this blog

Artificial Intelligence and Machine Learning Life Cycle

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Regularization and Generalization in Deep Learning