Data for Artificial Intelligence and Machine Learning - Part 1




Data for Machine Learning

Raw Data
  • Raw data is a term used to describe data that is collected and stored, but not yet been processed.
  • Data is often defined as facts or numbers, but it can also be non-numeric and nonfactual.
  • It can be quantitative or qualitative data.
  • Raw Data can be collected from Sensors, User interactions with machine machine interfaces (e.g.,
  • Mobile devices, touch screens, keyboards,  websites etc.), Journal entries, event recordings, references information, medical diagnosis, recordings of human observations, experiments, communication, market behaviour, business transactions, social media and so on.)
  •  Raw data is sometimes called source data or atomic data.

Raw Data – Basic classifications

Fundamentally-2 classes – Primary and Secondary  - Both primary and secondary data may be quantitative and qualitative.

Primary data consists of a collection of original data collected by the researcher. It is often undertaken after the researcher has gained some insight into the problem by reviewing existing research work or by analysing previously collected primary data.


Secondary Data
Data that has previously been collected (primary data) by someone other than the current user. Secondary data is often used in social and economic analysis, especially when access to primary data is unavailable. Secondary data is the data that have been already collected and readily available from other sources.

Data Stores, Warehouse and Marts
Actually, only a small fraction of the data that are captured, processed and stored is used by decision makers. The data store, warehouse and marts provide facilities to integrate the data generated by an un-integrated environment.  A data warehouse organizes and stores all available data for further analytical process, information extraction and store its historical perspectives. 

Metadata
Metadata is data that describes the real data. Metadata describes the characteristics of data such as what are the identities of data, storage location details, where they originated from and how they can be accessed. In order to provide easy access, it is necessary to maintain a form of data directory with data information about data. Metadata are abstractions from data or high level data that provide concise description. It forms an important component of DW environment. 

Consider the following three different groups of data.
1.      23489, 33765, 27668.
2.      Auto magazine reports that vehicle sales has dipped by 25% in India during 2019 first half. 
3.      Sales figures of leading manufactures have dipped, Maruti 20%, Hyundai 30%, Mahindra 25%.

The first group of data conveys no information. The second group is more descriptive text and straightforward. The third one is concise and contains metadata.

Consider the case of book manufactures, a sample example of metadata is follows:
·         Author, title, ISBN number, publisher name
·         Headlines of stories
·         Definitions of data elements in a DW.
·         Road maps
·         etc


References: 
1. Wikipedia
2. Raw is an Oxymoron, L.Gitelman, MIT Press, 2013
3. Collecting Primary Data A time saving guide, Helen Kara, Policy Press, 2013



Comments

Popular posts from this blog

Artificial Intelligence and Machine Learning Life Cycle

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Regularization and Generalization in Deep Learning