Data for Artificial Intelligence and Machine Learning - Part 2
Big Data
What is Big Data?
Big data is a term used for massive
collection of raw data which cannot be used for computation unless supplied and
operated using appropriate ways. The data sets are large and complex that it is
difficult to store and process using available database management tools or
traditional data processing applications. Big data size is a constantly moving
target. In 2012 it ranged from terabytes to petabytes. Currently it is in the
order of exabytes.
Few Examples
·
Walmart handles more than 1
million customer transactions every hour.
· Facebook stores, accesses, and analyzes 30 + petabytes of user generated data.
·
230+ millions of
tweets are created every day.
·
More than 5 billion people
are calling, texting, tweeting and browsing on mobile phones worldwide
·
YouTube users upload 48 hours of
new video every minute of the day.
·
Amazon handles 15 million customer
click stream user data per day to recommend products.
·
294 billion emails are
sent every day. Services analyses this data to find the spams.
·
Modern cars have close to 100
sensors which monitors fuel level, tire pressure etc., each vehicle generates a lot of sensor data.
·
In 2020, every human
on the planet produced 1.7 megabytes of
information… each second! By 2025 it is expected that global data output will be around 175 zettabytes.
·
In only a year, the
accumulated world data will grow to 44 zettabytes (that’s 44 billion terrabytes)
The Big data challenge includes capturing,
curating, storing, searching, sharing, transferring, analyzing and
visualization of this data.
Five Vs of Big Data
Big data is characterized by 5 factors – Volume,
Velocity, Variety, Value and Veracity.
1.
Volume - Big data first and foremost has to be
“big,” and size in this case is measured as volume. On Facebook alone
there are 10 billion messages send per day.
2.
Velocity – Velocity refers
to rapidly increasing speed at which new data is being created by technological
advances, and the corresponding need for that data to be digested and analyzed
in near real-time. There are more than 3.5 billion searches per day are made on
Google. Also, FaceBook users are increasing by 22%(Approx.) year by year.
3.
Variety - This third “V” describes the huge diversity
of data types such as structured, semi-structured, quasi-structured
and unstructured that organizations see every day. We can differed
types of data including messages, social media conversations,
photos, sensor data, video or voice recordings and bring them together with
the more traditional, structured data.
4.
Veracity – Refers to
the Uncertainty or trustworthiness of the data. With many forms of big data, quality and
accuracy are less controllable, for example Twitter posts with
hashtags, abbreviations, typos and colloquial speech.
5.
Value – Last but not
least, big data must have value. That is, if you’re going to invest in
the infrastructure required to collect and interpret data on a system-wide
scale, it’s important to ensure that the insights that are generated are based
on accurate data and lead to measurable improvements at the end
of the day.
Dark Data
Data that goes
unused is known as dark data. Maintaining and archiving unused data costs money,
energy, space and labour. Dark data can be generated by any of the following – various sensors such as CCTV
cameras which works 24/7etc, diagnostic data produced by machines in various
organizations, redundant data, legacy systems that accumulate large amount of
data that goes unused, expired knowledge that does not any value, knowledge
waste such as large number of documents generated by knowledge workers that is
never used, personally identifiable data such as that of a camera data from a
pool which is deleted as soon as no security issue is identified.
Data
Acquisition of Big data - Sources of Big Data
Scientific Data,
RFID Devices, Navigation systems, IOT, Smart Devices, Smart Power Grids,
Medical Data & Imaging, Geophysical Exploration, Gene Sequencing, Mobile
Devices, Social Media, Governmental Documentation, Corporate Data, Video Surveillance, Cyber
security, Video and Photo footages, Various Digital Content.
Data Types
Structured,
The term structured
data generally refers to data/databases that has
1.
A defined
length and
2.
Format
Examples of structured data include numbers, dates, and groups of words and numbers called strings. Most experts agree that structured data accounts for about 20
percent of the data out
there.
Name
|
Year
|
Hostel No
|
Room
|
GPA
|
Peter
|
2020
|
Hostel - 1
|
104
|
3.50
|
Jamie
|
2019
|
Hotel-2
|
220
|
4.14
|
Jeffy
|
2019
|
Hostel-2
|
235
|
3.20
|
Arun
|
2020
|
Hostel-1
|
110
|
4.18
|
Figure
- 2
Structured data can be catagorized as quantitative data. As observed in the above table it is highly organized and easily decipherable. Common examples of structured data are Excel files or SQL databases. IBM developed structured query language (SQL) to manage structured data. Using relational SQL databases users can easily input, search and manipulate structured data. This type of data is stored in data ware houses with rigid schemas.
Semi-structured
Semi-structured data is data that is neither raw data, nor
typed data in a conventional database system. It is structured data, but
it is not organized in a rational model, like a table or an object-based graph. However it contains tags and other markers to separate semantic and fields in the data.
CSV, XML and JSON documents are semi
structured documents, NoSQL databases are considered as semi structured. Semi structured data represents 5 to 10% of data available.
Features of semi-structured format are as follows
1. Lacks fixed/
rigid schema,
2. No separation between data and schema,
3. Self-describing structure
(tags or other markers) and
4. Does not conform to structure associated with
typical relational databases.
It can represent the information of data formats that cannot be constrained by schema.
Examples – CSV, XML,
HTML, JSON.
The amount of structure used depends on the purpose.
CSV file
CSV - a comma-separated values file is a delimited text
file that uses a comma to separate values. A CSV file stores tabular data in
plain text. Each line of the file is a data record. Each record consists of one
or more fields, separated by commas. CSV files
are text files with information separated by commas, saved with the extension .csv.
They allow large amounts of detailed data to be
transferred 'machine-to-machine', with little or no reformatting by the user. CSV files are mostly used for importing and exporting important information, such as customer or order data, to and from your database.
CSV is a simple file format used
to store tabular data, such as spreadsheet or database.
Files in the CSV format can be imported to and exported from
programs that store data in tables, such as Microsoft Excel or OpenOffice
Calc.
CSV
stands for "comma-separated values". Its
data fields are most often separated, or delimited by a comma.
Simple
example of .csv file,
Peter,2020,Hostel - 1,104,3.50
Jamie, 2019,Hostel - 2,220,4.14
Jeffy,2019,Hostel - 2,235,3.20
Arun,2020,Hostel-1,110,4.18
A
CSV is a text file, so it
can be created and edited using any text editor. More frequently,
however, a CSV file is created by exporting (File menu -> Export)
a spreadsheet or database in the program that created it.
HTML
Short for HyperText Markup Language, the authoring
language used to
create documents on the World Wide Web. HTML is similar to Standard Generalized Markup Language (SGML),
although it is not a strict subset. HTML defines the structure and layout of a Web document by
using a variety of tags and attributes. The
correct structure for an HTML document starts with <HTML><HEAD>(enter here what document
is about)<BODY> and ends with </BODY></HTML>. All the information
you'd like to include in your Web page fits in between the <BODY> and </BODY> tags.
XML
Extensible Markup Language (XML) is used to describe data. The XML standard
is a flexible, portable open source language way to create information formats
and electronically store and share small and medium amount of structured data via the public Internet, as
well as via corporate networks. It is both human and machine readable. XML stores data in plain text format. This
provides a software- and hardware-independent way of storing, transporting, and
sharing data.
XML file will open
in your text editor. The complexity of the file is dependent on what it was
created for. Use the tag labels to find the information you are looking for.
Generally the labels will be fairly self-explanatory, allowing you to browse
through the data and find the information you need.
You'll
likely see <?xml version="1.0"
encoding="UTF-8"?> at the top. This indicates that the
following content is in XML format.
XML
uses custom tags to house pieces of data. Each of these tags are created for
whatever program is using it, so there is no common syntax to the markup
labels. For example, one XML file may have
a <body></body> section, and other might have <message_body>,
</message_body>, but both may function similarly. Tags can be nested
inside other tags, creating a tree. For example,
Each <note></note> tag may have several tags inside, such
as <title>, </title> and <date>,</date>.
Some differences in HTML and XML structures
HTML is designed to display data emphasizing on how data
looks, XML tags are
not predefined like HTML tags.
HTML is a markup language
whereas XML provides a
framework for defining markup languages.
HTML is about displaying data, hence it is static
whereas XML is about carrying information, which makes it
dynamic.
XML does not carry any information about how to
be displayed.
In many
HTML applications, XML is used to store or transport data, while HTML is used
to format and display the same data.
XML files are not displayed as HTML pages.
JSON
JSON, or JavaScript Object Notation, is a way to encode data structures that ensures
that they are easily readable by machines. It is a minimal, readable format for structuring data. It is used primarily to transmit
data between a server and web application, as an alternative to XML. JSON is the primary format in which data is
passed back and forth to APIs (Application Programming Interface, and most API
servers will send their responses in JSON format. Python has great JSON support
with the
json
package. The json
package
is part of the standard library, so we don’t have to install anything to use
it. We can both convert lists and dictionaries to JSON, and convert strings
to lists and dictionaries.For awhile, XML (extensible markup language) was the only choice for open data interchange. But over the years there has been a lot of transformation in the world of open data sharing. The more lightweight JSON has become a popular alternative to XML for various reasons.
When exchanging data between a browser and a server, the data can only be text. JSON is text, and we can convert any JavaScript object into JSON, and send JSON to the server. We can also convert any JSON received from the server into JavaScript objects. This way we can work with the data as JavaScript objects, with no complicated parsing and translations. JSON has fixed syntax, rules and even data types. There are JSON parsers available, which give you the ability to search, sort, edit the JSON data. Parser creates a representation in main memory, making the operations fast enough. Performance depends on data size.
Key/Value pairs
A JSON object
contains data in the form of key/value pair. The keys are strings and the
values are the JSON types.
Keys and values are separated by colon. Each entry (key/value pair) is
separated by comma.
JSON supports two widely used data structures (amongst programming
languages).
· A collection of
name/value pairs. Different programming
languages support this data structure in different names. Like object, record,
structure, dictionary, hash table, keyed list, or associative array.
· An ordered list of
values. In various programming languages, it
is called as array, vector, list, or sequence.
Since data structure supported by JSON is also supported by most of
the modern programming languages, it makes JSON a very useful data-interchange
format.
Quasi
structured
Clickstream is the recording of areas of the screen that a user clicks
while web browsing. As the user clicks anywhere in the web page, the action is
logged. The log contains information such as time, URL, the user’s machine,
type of browser, type of event (for example, browsing, checking out, logging
in, logging out with purchase, removing from cart, logging out with purchase),
product information (for example, ID, category, and price), total purchase in
basket, number of items in basket, and session duration. This information can
give valuable clues about what visitors are doing on your web site, and about
the visitors themselves.
Clickstream
analysis is useful for web activity analysis and market research. The
navigation path can indicate purchase interests and price range. You can
identify browsing patterns to determine the probability that the user will
place an order.
On a Web site, clickstream analysis
(also called clickstream analytics)
is the process of collecting, analyzing and reporting aggregate data about which pages a website
visitor visits -- and in what order. The path the visitor takes though a
website is called the clickstream.
Figure
- 4
Unstructured
Approximately 2.5 quintillion
bytes of data are created each day—and, according to Gartner, it is estimated
that 80% of all data is unstructured.
It can be categorized as qualitative data. Unlike structured databases, unstructured databases do not have a predefined data model. Unstructured data is managed by non-relational (NoSQL) databases. Datalakes are used to preserve unstructured data in its raw form.
Approximately 80% of available data is unstructured
data. Unstructured data is one which is not formally organized. These type of
data files often include books, journals, documents,
metadata, health records, audio, video, analogue data, images,
multimedia, maps, graphs, and unstructured text such as the
body of an e-mail message, Web page, or word-processor document, business
documents, presentations. Unstructured
data owes it popularity to technologies such NoSQL and Hadoop, and to
formats such as JSON and XML.
Speech data
Figure
- 6
Image and Video data
Figure
- 7
Text data
Figure
- 8
Big Data: Conceptual Analysis and Applications By Michael Z. Zgurovsky, Yuriy P. Zaychenko
Figure-4: Modeling
Online browsing with clickstream Data Using Path Analysis published by
Slideshare.com
Figure-5: ORI
News
Figure-6: Mining
an year of speech, Linguistic Data Consortium
Figure-7: Duke Department
of Medicine – Duke University
Comments
Post a Comment