35 basic definitions to understand Big Data

0 comentarios


Online Transactional Processing (OLTP) refers to the general activity of updating, querying and presenting text and number of data from databases for operational purposes. In other words OLTP encompasses the everyday transactions done on the operational database systems, for example a transaction reflecting a withdrawal from a checking account or a transaction creating an airline reservation. (Burstein & Holsapple, 2008)


Online Analytical Processing (OLAP) refers to the general activity of querying and presenting text and number data from data warehouses and/or data marts for analytical purposes. [..]OLAP tools are “read only”, they are used exclusively for the retrieval of data (From analytical repositories) to be used in decision-making process. (Burstein & Holsapple, 2008)

3. Data Science

Data science is the business application of machine learning, artificial intelligence, and other quantitative fields that extracts value from data. In the context of how data science is used today, it relies heavily on machine learning and is sometimes called data mining. Some examples are recommendation engines that can recommend movies for a particular user, a fraud alert model that detects fraudulent credit card transactions, or predict revenue for the next quarter. (Kotu & Deshpande, 2019)

4. Data Lake

Data Lake is a huge repository that’s holds every kind of data in its raw format until it is needed by anyone in the organization to analyze. (Pasupuleti & Purra, 2015)

5. Data Pipeline

Data pipeline is an abstract way of talking about the data handling components written in software that are applied to data objects in sequence. The data pipeline is an useful abstraction because it helps one to think about how data are pushed in real time from sensors and instruments through processing steps towards outcomes and how to optimize data handling while minimizing its cost. A data pipeline is an abstraction for managing and streamlining data processes throughout the data lifecycle. (Chowdhury, Apon, & Dey, 2017)

6. Data warehouse

Data warehouse is a big database that holds copies of data from other systems, that is then made available for use for other applications. (Mattison, 2006)

7. Data mining

Data mining is finding useful patterns in the data, is also referred to as knowledge discovery, machine learning and predictive analytics. The act of data mining uses some specialized computational methods to discover meaningful and useful structures in the data. These computational methods have been derived from the fields of statistics, machine learning and artificial intelligence. (Kotu & Deshpande, 2015)

8. Data Analytics

Data analytics is defined as the application of computer systems to the analysis of large data sets for the support of decisions. Data analytics is a very interdisciplinary field that has adopted aspects from many other scientific disciplines such as statistics, machine learning, pattern recognition, system theory, operations research or artificial intelligence. (Runkler, 2016)

9. Data visualization

Visualizing data is one of the most important techniques of data discovery and exploration. Though visualization is not considered a data science technique, terms like visual mining or pattern discovery based on visuals are increasingly used in the context of data science, particularly in the business world. The discipline of data visualization encompasses the methods of expressing data provides easy comprehension of complex data with multiple attributes and their underlying relationships. (Kotu & Deshpande, 2019)

10. Business intelligence

Business intelligence is a broad category of applications and technologies for gathering, storing, analyzing and providing access to data to help enterprise users make better decisions. Business Intelligence applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting and data mining. (Brijs, 2013)

11. Kafka

Apache Kafka is an open source, distributed, partitioned, and replicated commit-log-based publish-subscribe messaging system. [...]. Kafka provides a real-time publish-subscribe solutions that overcomes the challenges of consuming the real-time and batch data volumes that might grow in order of magnitude to be larger than the real data. Kafka also supports parallel data loading in the Hadoop systems. (Garg, 2015)

12. Spark

Apache Stark is a cluster computer platform designed to be fast and general- purpose. Stark is designed to be highly accessible, offering simple APIs in Python, Java, Scala and SQL, and rich built-in libraries. It also integrates closely with other Big Data tools. In particular Spark can run in Hadoop clusters and access any Hadoop data source, including Cassandra. (Karau, Konwinski, Wendell, & Zaharia, 2015)

13. Hadoop

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

Hadoop allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. (Jain, n.d.)

14. Python

Python is a powerful multiparadigm computer programming language, optimized for programmers’ productivity, code readability and software quality. (Lutz, 2013)

15. Pandas

Pandas is a Python library containing high-level data structures and tools that have been created to help Python programmers to perform powerful data analysis. The ultimate purpose of Pandas is to help you quickly discover information in data, with information being defined as an underlying meaning. (Heydt, 2017)

16. Matplotlib

Matplotlib is a Python package for 2D plotting that generates production-quality graphs. It supports interactive and noninteractive plotting, and can save images in several output formats (PNG, PS, and others). It can use multiple window toolkits (GTK+, wxWidgets, Qt, and so on) and it provides a wide variety of plot types (lines, bars, pie charts, histograms, and many more). In addition to this, it is highly customizable, flexible and easy to use. (Tosi, 2009)

17. Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. (Waskom, 2018)

18. Machine learning

Machine learning can either be considered a sub-field or one of the tools of artificial intelligence, is providing machines with the capability of learning from experience. Experience for machines comes from data. Data that is used to teach machines is called training data. For example, many organizations like social media platforms, review sites, or forums are required to moderate posts and remove abusive content. How can machines be taught to automate the removal of abusive content? The machines needs to be shown examples of both abusive and non-abusive posts with a clear indication of which is abusive. The learners will generalize a pattern based on certain words or sequences of words in order to conclude whether the overall post is abusive or not. The model can take the form of a set of “if- - then” rules. Once the data science rules or model is developed, machines can start categorizing the disposition of any new posts. (Kotu & Deshpande, 2019)

19. ETL

ETL is the short for Extract, Transform and Load. It is the set of processes for getting data from OLTP systems, websites, flat files, e-mail databases, spreadsheets and personal database such as Access as well. ETL is not only used to load a single data warehouse but can have many other use cases, like loading data marts, generating spreadsheets, scoring customers using data mining models, or even loading forecasts back into OLTP systems. (Casters, Bouman, & Dongen, 2010)

20. Dataset

Group of structured data retrievable in a link or single instruction as a whole to a single entity, with updating frequency larger than a once a minute. (“MELODA,” 2019)

21. Scala

Scala is a blend of object-oriented and functional programming concepts in a statically typed language. The fusion of object-oriented and functional programming shows up in many different aspects of scala; it is probable more pervasive than in any other widely used language. Scala’s functional programming constructs make it easy to build interesting things quickly from simple parts. Its object-oriented constructs make it easy to structure larger systems and to adapt them to new demands. The combination of both styles in Scala makes it possible to express new kinds of programming patterns and component abstractions. It also leads to a legible and concise programming style. (Odersky, Spoon, & Venners, 2008)

22. R

R is a powerful programming language and environmental for statistical computing, data exploration, analysis and visualization. It is free, open source, and has a strong, rapidly growing community where users and developers share their experience and actively contribute to the development of more than 7,500 packages, so that R can deal with problems in a wide range of fields. (Ren, 2016)

23. SQL

Structured Query Language (SQL) is a relational database management system (RDBMS) develop by Microsoft. (McQuillan, 2015)

24. NoSQL

NoSQL is a mechanism for storing data that doesn’t have any fixed schema. Most of the people assume that it means No SQL, whereas the actual abbreviation is from Not Only SQL. It means that it does not rely only on the SQL programming language for manipulating and storing data, but it can be used in conjunction with other programming languages.(Akhtar, 2018)

25. Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • A powerful N-dimensional array object
  • Sophisticated (broadcasting) functions
  • Tools for integrating C/C++ and Fortran code
  • Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. (“NumPy,” 2018)

26. Scikit-learn

Scikit-learn is a free and open source software that helps you tackle supervised and unsupervised machine learning projects. The software is built entirely in Python and utilizes some of the most popular libraries that Python has to offer, namely Numpy and SciPy. (Jolly, 2018)

27. MapReduce

Map Reduce, is a paradigm of distributed computing, where a given function is applied to smaller parts of a data set to be processed simultaneously or in parallel by different machines or processes, where the result of each part is combined to give a final result of the whole. (Leskovec, Rajaraman & Ullman, 2014)

28. Stream processing

Stream Processing is a Big data technology. It is used to query continuous data stream and detect conditions, quickly, within a small time period from the time of receiving the data. The detection time period varies from few milliseconds to minutes. For example, with stream processing, you can receive an alert when the temperature has reached the freezing point, querying data streams coming from a temperature sensor.

It is also called by many names: real-time analytics, streaming analytics, Complex Event Processing, real-time streaming analytics, and event processing. Although some terms historically had differences, now tools (frameworks) have converged under term stream processing.(Perera, 2018)

29. Batch processing

Batch Processing is the process by which a computer completes batches of jobs, often simultaneously, in non-stop, sequential order. It’s also a command that ensures large jobs are computed in small parts for efficiency during the debugging process. (Watts, 2017)

30. Structured Data

It concerns all data which can be stored in a database SQL in table with rows and columns. They have relational key and can be easily mapped into pre-designed fields. Today, those data are the most processed in development and the simplest way to manage information. (Jain, n.d.)

31. Unstructured Data

Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents. (Jain, n.d.)

32. Cloud computing

Cloud computing is defined as “a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (eg. networks, servers, storage, applications and services) that can be rapidily provisioned and released with minimal management effort or service provider interaction”. Loosely speaking, cloud computing represents a new way to deploy computing technology to give users the ability to access, wor on, share and store information using the internet. (Wang, Ranjan, Chen, & Benatallah, 2012)

33. Airflow

Airflow is a platform to programmatically author, schedule and monitor workflows.

Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. (“Apache Airflow Documentation,” n.d.)

34. Directed acyclic graph

In order to support the ability to push and pull changesets between multiple instances of the same repository, we need a specially designed structure for representing multiple versions of things. The structure we use is called a Directed Acyclic Graph (DAG), a design which is more expressive than a purely linear model. The history of everything in the repository is modeled as a DAG. (“Directed Acyclic Graphs (DAGs),” n.d.)

35. Dark Data

Dark data is considered to be data that hasn’t been classified or associated with analytical tool or use. It is all the unknown data within organization. This is the data generated by individuals users or unconnected or uncataloged systems that might be sitting outside of standard storage management and protection systems. In this sense, it’s not only data that is not classified or associated with analytical functions, but it’s also data the business may fundamentally be unaware of.(Preston de Guise, 2017)


Akhtar, M. F. (2018). Big Data Architect’s Handbook: A guide to building proficiency in tools and systems used by leading Big Data experts. UK: Packt Publishing.

Apache Airflow Documentation. (n.d.). Retrieved March 18, 2019, from https://airflow.apache.org/

Brijs, B. (2013). Business Analysis for Business Intelligence. Taylor & Francis group.

Burstein, F., & Holsapple, C. W. (2008). Handbook on Decision Support Systems 1: Basic Themes. Springer.

Casters, M., Bouman, R., & Dongen, J. van. (2010). Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. Canada: Wiley Publishing Inc.

Chowdhury, M., Apon, A., & Dey, K. (2017). Data Analytics for Intelligent Transportation Systems (Elsevier).

Directed Acyclic Graphs (DAGs). (n.d.). Retrieved March 18, 2019, from https://ericsink.com/vcbe/html/directed_acyclic_graphs.html

Garg, N. (2015). Learning Apache Kafka - Second Edition. UK: Packt Publishing.

Guise, P. de. (2017). Data Protection: Ensuring Data Availability. United States: Taylor & Francis group.

Heydt, M. (2017). Learning pandas. UK: Packt Publishing.

Jain, V. K. . (n.d.). Big Data and Hadoop. New Delhi: Khanna Book Publishing Co.

Jolly, K. (2018). Machine Learning with scikit-learn Quick Start Guide: Classification, regression, and clustering techniques in Python. UK: Packt Publishing.

Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-Fast Big Data Analysis (Databricks). United States: O’reilly Media, Inc.

Kotu, V., & Deshpande, B. (2015). Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner (Elsevier). United States.

Kotu, V., & Deshpande, B. (2019). Data Science: Concepts and practice (Elsevier). United States: Morgan Kaufmann.

Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of Massive Datasets. Cambridge University Press.

Lutz, M. (2013). Learning Python: Powerful Object-Oriented Programming. Canadá: O’reilly Media, Inc.

Mattison, R. (2006). The Data Warehousing Handbook. United States: XiT Press.

McQuillan, M. (2015). Introducing SQL Server (Springer). Apress.

MELODA. (2019). Retrieved March 18, 2019, from http://www.meloda.org/dataset-definition/

NumPy. (2018). Retrieved March 18, 2019, from http://www.numpy.org/

Odersky, M., Spoon, L., & Venners, B. (2008). Programming in Scala. United States: Artima Inc.

Pasupuleti, P., & Purra, B. S. (2015). Data Lake Development with Big Data. Pack Publishing.

Perera, S. (2018). Stream Processing. Retrieved March 18, 2019, from https://medium.com/stream-processing/what-is-stream-processing-1eadfca11b97

Ren, K. (2016). Learning R Programming. UK: Packt Publishing.

Runkler, T. A. (2016). Data Analytics: Models and Algorithms for Intelligent Data Analysis (Springer).

Tosi, S. (2009). Matplotlib for Python Developers. UK: Packt Publishing.

Wang, L., Ranjan, R., Chen, J., & Benatallah, B. (2012). Cloud Computing: Methodology, Systems, and Applications. United States: Taylor & Francis group.

Waskom, M. (2018). seaborn: statistical data visualization. Retrieved March 18, 2019, from https://seaborn.pydata.org/

Watts, S. (2017). What is Batch Processing? Batch Processing Explained. Retrieved March 18, 2019, from https://www.bmc.com/blogs/what-is-batch-processing-batch-processing-explained/


No existe comentarios

{$ comment.fullName $}

{$ comment.date | date:'medium' $}

{$ comment.comment $}

Deja tu comentario: