Big Data

What is Big Data?

Big data is a term used for massive collection of raw data which cannot be used for computation unless supplied and operated using appropriate ways. The data sets are large and complex that it is difficult to store and process using available database management tools or traditional data processing applications. Big data size is a constantly moving target. In 2012 it ranged from terabytes to petabytes. Currently it is in the order of exabytes.

Few Examples

· Walmart handles more than 1 million customer transactions every hour.

· Facebook stores, accesses, and analyzes 30 + petabytes of user generated data.

· 230+ millions of tweets are created every day.

· More than 5 billion people are calling, texting, tweeting and browsing on mobile phones worldwide

· YouTube users upload 48 hours of new video every minute of the day.

· Amazon handles 15 million customer click stream user data per day to recommend products.

· 294 billion emails are sent every day. Services analyses this data to find the spams.

· Modern cars have close to 100 sensors which monitors fuel level, tire pressure etc., each vehicle generates a lot of sensor data.

· In 2020, every human on the planet produced 1.7 megabytes of information… each second! By 2025 it is expected that global data output will be around 175 zettabytes.

· In only a year, the accumulated world data will grow to 44 zettabytes (that’s 44 billion terrabytes)

The Big data challenge includes capturing, curating, storing, searching, sharing, transferring, analyzing and visualization of this data.

Five Vs of Big Data

Big data is characterized by 5 factors – Volume, Velocity, Variety, Value and Veracity.

1. Volume - Big data first and foremost has to be “big,” and size in this case is measured as volume. On Facebook alone there are 10 billion messages send per day.

2. Velocity – Velocity refers to rapidly increasing speed at which new data is being created by technological advances, and the corresponding need for that data to be digested and analyzed in near real-time. There are more than 3.5 billion searches per day are made on Google. Also, FaceBook users are increasing by 22%(Approx.) year by year.

3. Variety - This third “V” describes the huge diversity of data types such as structured, semi-structured, quasi-structured and unstructured that organizations see every day. We can differed types of data including messages, social media conversations, photos, sensor data, video or voice recordings and bring them together with the more traditional, structured data.

4. Veracity – Refers to the Uncertainty or trustworthiness of the data. With many forms of big data, quality and accuracy are less controllable, for example Twitter posts with hashtags, abbreviations, typos and colloquial speech.

5. Value – Last but not least, big data must have value. That is, if you’re going to invest in the infrastructure required to collect and interpret data on a system-wide scale, it’s important to ensure that the insights that are generated are based on accurate data and lead to measurable improvements at the end of the day.

Figure - 1

Dark Data

Data that goes unused is known as dark data. Maintaining and archiving unused data costs money, energy, space and labour. Dark data can be generated by any of the following – various sensors such as CCTV cameras which works 24/7etc, diagnostic data produced by machines in various organizations, redundant data, legacy systems that accumulate large amount of data that goes unused, expired knowledge that does not any value, knowledge waste such as large number of documents generated by knowledge workers that is never used, personally identifiable data such as that of a camera data from a pool which is deleted as soon as no security issue is identified.

Data Acquisition of Big data - Sources of Big Data

Scientific Data, RFID Devices, Navigation systems, IOT, Smart Devices, Smart Power Grids, Medical Data & Imaging, Geophysical Exploration, Gene Sequencing, Mobile Devices, Social Media, Governmental Documentation, Corporate Data, Video Surveillance, Cyber security, Video and Photo footages, Various Digital Content.

Data Types

Structured,

The term structured data generally refers to data/databases that has

1. A defined length and

2. Format

Examples of structured data include numbers, dates, and groups of words and numbers called strings. Most experts agree that structured data accounts for about 20 percent of the data out there.

Name	Year	Hostel No	Room	GPA
Peter	2020	Hostel - 1	104	3.50
Jamie	2019	Hotel-2	220	4.14
Jeffy	2019	Hostel-2	235	3.20
Arun	2020	Hostel-1	110	4.18

Figure - 2

Structured data can be catagorized as quantitative data. As observed in the above table it is highly organized and easily decipherable. Common examples of structured data are Excel files or SQL databases. IBM developed structured query language (SQL) to manage structured data. Using relational SQL databases users can easily input, search and manipulate structured data. This type of data is stored in data ware houses with rigid schemas.

Semi-structured

Semi-structured data is data that is neither raw data, nor typed data in a conventional database system. It is structured data, but it is not organized in a rational model, like a table or an object-based graph. However it contains tags and other markers to separate semantic and fields in the data.

CSV, XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured. Semi structured data represents 5 to 10% of data available.

Features of semi-structured format are as follows

1. Lacks fixed/ rigid schema,

2. No separation between data and schema,

3. Self-describing structure (tags or other markers) and

4. Does not conform to structure associated with typical relational databases.

It can represent the information of data formats that cannot be constrained by schema.

Examples – CSV, XML, HTML, JSON.

The amount of structure used depends on the purpose.

CSV file

CSV - a comma-separated values file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. CSV files are text files with information separated by commas, saved with the extension .csv. They allow large amounts of detailed data to be transferred 'machine-to-machine', with little or no reformatting by the user. CSV files are mostly used for importing and exporting important information, such as customer or order data, to and from your database.

CSV is a simple file format used to store tabular data, such as spreadsheet or database. Files in the CSV format can be imported to and exported from programs that store data in tables, such as Microsoft Excel or OpenOffice Calc.

CSV stands for "comma-separated values". Its data fields are most often separated, or delimited by a comma.

Simple example of .csv file,

Peter,2020,Hostel - 1,104,3.50

Jamie, 2019,Hostel - 2,220,4.14

Jeffy,2019,Hostel - 2,235,3.20

Arun,2020,Hostel-1,110,4.18

Figure - 3

A CSV is a text file, so it can be created and edited using any text editor. More frequently, however, a CSV file is created by exporting (File menu -> Export) a spreadsheet or database in the program that created it.

HTML

Short for HyperText Markup Language, the authoring language used to create documents on the World Wide Web. HTML is similar to Standard Generalized Markup Language (SGML), although it is not a strict subset. HTML defines the structure and layout of a Web document by using a variety of tags and attributes. The correct structure for an HTML document starts with <HTML><HEAD>(enter here what document is about)<BODY> and ends with </BODY></HTML>. All the information you'd like to include in your Web page fits in between the <BODY> and </BODY> tags.

XML

Extensible Markup Language (XML) is used to describe data. The XML standard is a flexible, portable open source language way to create information formats and electronically store and share small and medium amount of structured data via the public Internet, as well as via corporate networks. It is both human and machine readable. XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data.

XML file will open in your text editor. The complexity of the file is dependent on what it was created for. Use the tag labels to find the information you are looking for. Generally the labels will be fairly self-explanatory, allowing you to browse through the data and find the information you need.

You'll likely see <?xml version="1.0" encoding="UTF-8"?> at the top. This indicates that the following content is in XML format.

XML uses custom tags to house pieces of data. Each of these tags are created for whatever program is using it, so there is no common syntax to the markup labels. For example, one XML file may have a <body></body> section, and other might have <message_body>, </message_body>, but both may function similarly. Tags can be nested inside other tags, creating a tree. For example, Each <note></note> tag may have several tags inside, such as <title>, </title> and <date>,</date>.

Some differences in HTML and XML structures

HTML is designed to display data emphasizing on how data looks, XML tags are not predefined like HTML tags.

HTML is a markup language whereas XML provides a framework for defining markup languages.

HTML is about displaying data, hence it is static whereas XML is about carrying information, which makes it dynamic.

XML does not carry any information about how to be displayed.

In many HTML applications, XML is used to store or transport data, while HTML is used to format and display the same data.

XML files are not displayed as HTML pages.

JSON

JSON, or JavaScript Object Notation, is a way to encode data structures that ensures that they are easily readable by machines. It is a minimal, readable format for structuring data. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON is the primary format in which data is passed back and forth to APIs (Application Programming Interface, and most API servers will send their responses in JSON format. Python has great JSON support with the json package. The json package is part of the standard library, so we don’t have to install anything to use it. We can both convert lists and dictionaries to JSON, and convert strings to lists and dictionaries.

For awhile, XML (extensible markup language) was the only choice for open data interchange. But over the years there has been a lot of transformation in the world of open data sharing. The more lightweight JSON has become a popular alternative to XML for various reasons.

When exchanging data between a browser and a server, the data can only be text. JSON is text, and we can convert any JavaScript object into JSON, and send JSON to the server. We can also convert any JSON received from the server into JavaScript objects. This way we can work with the data as JavaScript objects, with no complicated parsing and translations. JSON has fixed syntax, rules and even data types. There are JSON parsers available, which give you the ability to search, sort, edit the JSON data. Parser creates a representation in main memory, making the operations fast enough. Performance depends on data size.

Key/Value pairs

A JSON object contains data in the form of key/value pair. The keys are strings and the values are the JSON types. Keys and values are separated by colon. Each entry (key/value pair) is separated by comma.

JSON supports two widely used data structures (amongst programming languages).

· A collection of name/value pairs. Different programming languages support this data structure in different names. Like object, record, structure, dictionary, hash table, keyed list, or associative array.

· An ordered list of values. In various programming languages, it is called as array, vector, list, or sequence.

Since data structure supported by JSON is also supported by most of the modern programming languages, it makes JSON a very useful data-interchange format.

Quasi structured

Clickstream is the recording of areas of the screen that a user clicks while web browsing. As the user clicks anywhere in the web page, the action is logged. The log contains information such as time, URL, the user’s machine, type of browser, type of event (for example, browsing, checking out, logging in, logging out with purchase, removing from cart, logging out with purchase), product information (for example, ID, category, and price), total purchase in basket, number of items in basket, and session duration. This information can give valuable clues about what visitors are doing on your web site, and about the visitors themselves.

Clickstream analysis is useful for web activity analysis and market research. The navigation path can indicate purchase interests and price range. You can identify browsing patterns to determine the probability that the user will place an order.

On a Web site, clickstream analysis (also called clickstream analytics) is the process of collecting, analyzing and reporting aggregate data about which pages a website visitor visits -- and in what order. The path the visitor takes though a website is called the clickstream.

Figure - 4

Unstructured

Approximately 2.5 quintillion bytes of data are created each day—and, according to Gartner, it is estimated that 80% of all data is unstructured.

It can be categorized as qualitative data. Unlike structured databases, unstructured databases do not have a predefined data model. Unstructured data is managed by non-relational (NoSQL) databases. Datalakes are used to preserve unstructured data in its raw form.

Approximately 80% of available data is unstructured data. Unstructured data is one which is not formally organized. These type of data files often include books, journals, documents, metadata, health records, audio, video, analogue data, images, multimedia, maps, graphs, and unstructured text such as the body of an e-mail message, Web page, or word-processor document, business documents, presentations. Unstructured data owes it popularity to technologies such NoSQL and Hadoop, and to formats such as JSON and XML.