Data Preparation
and Pre-processing Methods
Data can be
collected in various ways, such as through surveys, experiments, observations,
or it can be extracted from existing databases. There are also numerous public
datasets available for use in data science and machine learning projects. The
choice of data collection method depends on the nature of the problem, the
available resources, and the specific requirements of the statistical analysis
or machine learning model. Data preparation and preprocessing for machine
learning is the process of transforming raw data into a format that can be used
by machine learning algorithms.
Sometimes rather than generated by real-world
events data is artificially
manufactured. Synthetic data developed
algorithmically, be used as stand-in for test datasets or operational data to validate mathematical models and
to train machine learning models.
The
various steps of Data preparation
Step 1: Data Collection
Data sampling requires the application of sampling theory. Sampling theory is a branch of
statistics that is involved with
the collection, analysis and
interpretation of data gathered from random
samples of a population under study. The application of sampling theory is
concerned not only with the proper
selection of observations from the population that will constitute the
random sample; it also involves the use of probability
theory, along with prior knowledge
about the population parameters, to
analyze the data from the random sample and develop
conclusions from the analysis.
Figure
-1
Data collection is the by far the essential first step as it
addresses common challenges, including searching and
identifying relevant data from the repositories. This is sometimes called data
exploration.
Figure - 2
Step 2: Data coding
Data Coding means the transformation of data into variables using numbers to a form understandable
by computer for analysis. Data coding
is any process that assigns a numerical value to a response/observation during data collection. Data Coding is an analytical process in which data, in both quantitative or qualitative form is
categorized to facilitate data entry into the computer and further analysis.
Finally data collected
needs to be stored in databases in a
standardized manner data files in form a series of tables, table fields, field
data values, spread sheets, audio, image, video files etc
Preprocessing of raw data
Data pre-processing for Machine learning refers to the
transformations applied to raw data before
feeding it to a learning algorithm. Lack
of data quality is most challenging problem in any machine learning
application. Bad data can result in wrong or misleading inferences. Source (raw) data will reflect real-world
situations and so it is critical to be sure the learning model is not influenced
by the inferior quality of data. According to IBM analytics report generally
80% of data scientist time is spent to clean up the data.
Step 3: Data Editing
Virtually all
data collection and data preparation method are influenced by some degree of
errors. According to surveys 25% database records are
inaccurate, 60% is unreliable, 15% contains duplicate data, 8% contains missing
data, 11% invalid value ranges and 6% is invalid addresses (like email, contact
addresses).
This can
happen due to factors affecting data collection methods, subjective
preferences, judgments of individuals and so on. It is required to clean and
filter the data to improve data quality.
According to IBM’s estimate annual cost due to
bad data quality is $3.1 trillion in year 2016 (ref: hbr.org). Data quality is
affected due to the following reasons
Missing values
Care should be taken when handling outliers and missing data.
Similarly, take care before automatically deleting all records with a missing
value, as too many deletions could skew
your data set to no
longer reflect real-world situations.
Figure – 3
Figure – 4
Figure – 5
Noise - unexplained variations or anomalies in data
Figure – 6: Highlighted points indicated feature dimension influenced by noise
Figure – 7: Noisy 1D – signal
Figure – 8: Nvidia AI removes heavy noise from image signal
Extremal values or outliers,
Outliers are data points that significantly differs from
other observations. It can occur due to measurement errors, experimental
errors, input error, anomalous conditions etc. Examples of outlier data are
shown in figures.
Figure – 9: Outlier in
spreadsheet.
Figure 10: Outlier in a waveform
The reason that
an outlier differs from the rest of the points in the data set is crucial in
determining whether to omit the outlier or not. If they are not attributed
to errors, they may reveal new information or trends that were not predicted.
Hence it is better not to omit them. Though outliers can be usually
ignored it could be a real and meaningful result that could inform future
events.
Incorrect, inconsistent data –
Incorrect and inappropriate data happens it is entered in the
wrong field, poor data entry like misspells, typos, naming, formatting and so
on.
Figure-11 Inconsistent data.
Skewed data –
noise can skew data to any one side. Outliers can skew data.
Duplication
Figure – 13: Data duplication detected by Excel
Figure – 14: Gender Bias in Data
Therefore once the data is collected, it is required assess its
quality condition. The data may be required to be cleaned and removed of the above
errors
Step 4: Data Formatting,
Profiling (fine tuning) to make it consistent.
The next
step in data preparation is to ensure your data is formatted in a way that best
fits your machine learning model. Consistent data formatting takes away these
errors so that the entire data set uses the same input formatting protocols.
Figure – 15:
Data Formatting with Excel
In the
same way, standardizing or normalizing the values in a column, e.g. State names
that could be spelled out or abbreviated) will ensure that your data will
aggregate correctly.
Figure – 16: Raw Data vs Zero mean Scaled Data
Figure – 17: Raw Data vs Z-score normalized
Data
Step 5: Feature Extraction.
Features are distinguishable
attributes or properties related to something of interest. Feature
extraction involves the art and science of transforming raw data into features
that better represent a pattern to the learning algorithms. Selection of
appropriate feature plays an important role in determining the accuracy and
performance of the learning model. Features are intended to be
informative and non-redundant, facilitating the subsequent learning and
generalization steps, and in some cases leading to better human
interpretations. Feature extraction is also related to reduction of dimensionality.
Features can be numbers,
distances, mass, time, areas, volumes, prices, averages values, variance values,
histograms etc. For time series signal it can be moments, zero crossings, spectral
values, cepstral values etc, for images edges, colors, textures, intensity
values, various transforms and so on. These
derived values or features facilitate better human interpretations, visualization
and implementation of learning models, training (recommendation) algorithms and
subsequent generalization.
Features are represented as
feature vectors. Determining a subset of the initial features is called feature selection. The selected features are expected to contain the
relevant information from the input data, so that the desired task can be
performed by using this reduced representation instead of the complete initial
data.
Step 6: Splitting data into training and evaluation (test) set
At the final step the data is partitioned or split into
training and evaluation (test) sets. Usually, we split the data set in 70:30 or
80:20 ratio for training and testing respectively.
Image Credits:
Figure – 1: http://desogo.elenabetchke.com/sample-size-calculator-for-quasi-experimental-design/
Figure – 2: https://koppa.jyu.fi/avoimet/hum/menetelmapolkuja/en/methodmap/data-
collection/existing-materials-and-self-produced-materials?fullscreen=1
Figure – 3: http://rstudio-pubsstatic.s3.amazonaws.com/ 2298_34fdfecb5b7f4 669a84d938f1f
0480e9.html
Figure – 4: https://stackoverflow.com/questions/43200199/how-to-find-missing-values
Figure – 5: Machine
Learning A Probabilistic Perspective by K.P. Murphy
Figure – 6: http://mlwiki.org/index.php/Noise_Handling_(Data_Mining)
Figure – 7: https://www.researchgate.net/figure/Sinusoid-signal-without-noise-with-SNR10dB-and-
SNR20dB_fig5_276897708
Figure – 8: https://www.extremetech.com/extreme/273121-nvidia-ai-compensates-for-your-poor-
photography-skills-by-erasing-noise-from-images
Figure – 9: https://www.wikihow.com/Calculate-Outliers
Figure – 10: https://in.mathworks.com/help/matlab/ref/filloutliers.html
Figure – 11: Elite Data Science
Figure – 12: wikipedia.org
Figure – 13:
https://exceljet.net/formula/highlight-duplicate-values
Figure – 14: Challenges of Gender Bias and Data Collection, Taylor Billings Russell, CARD
Research Specialist
Figure – 15: https://saylordotorg.github.io/text_how-to-use-microsoft-excel-v1.1/s05-03-
formatting-and-data-analysis.html
Figure – 16:
https://subscription.packtpub.com/book/big_data_and_business_intelligence/
9781785889622/3/ch03lvl1sec24/data-scaling-and-normalization
Figure – 17: https://developers.google.com/machine-learning/data-prep/transform/normalization
Great article!
ReplyDelete