Data Analytics/Science Terms and Concepts Cheatsheet

Structured Data

Data is organized to meet a schema. Think tables which organize data into rows and columns.

Unstructured Data

Data is unorganized and lacks a schema. Imagine collections of html documents including text and images not organized in any consistent way.

ETL - Export Transform Load

Export from data source to target storage for analysis
Transform data during export via appropriate logic
Load transformed data to target for analysis

Pro: Data can be immediately analyzed once loaded

Con: Transformations can slow down the transfer of data making real time analysis difficult

ELT - Export Load Transform

Export from data source to target storage for analaysis
Load the untransformed data to the target
Transform the data for analysis

Pro: Can be faster for real-time data analysis

Con: Can require more storage space and compute resources, this constraint has been improved/alleviated by cloud providers

Data Warehouse

Platform for storing large amounts of structured data for analysis. Data must be structured prior, which can make loading large flows of incoming data difficult and cumbersome.

DW Strategies

Inmon: Everything is cleaned and normalized upfront for a single source of truth that is distributed in department level marts. Most common strategy but slow queries due to normalization and difficult to share data between departments without creating extra marts.
Kimball: Start with what the data should look like to the end user first (the marts), then curate the data as needed with that in mind. May have more duplication of data and more complex setup, but faster queries and more robust availability of data. Not as often done cause usually sources already tend to inmon.
Data Vault: Used for long term historical storage because it tracks all changes and each change holds a reference source as a value.

ACID (Atomicity, Consistency, Isolation, Durability)

Charachteristics of database transaction that make the data transactional (all or nothing, no partial completion creating broken data)

Data Lake

Platform for storing large amounts of structured and unstructured data for analysis. Since data does not have to be prepared prior much faster for loading realtime data.

Data can be Schema on Read, so Schemas aren't enforced on write like traditional databases.

How the data is stored in the data lake is determined by table format that can help provide ACID compliance, help quality governance, etc.

Apache Iceberg (from Netflix)
Apache Hudi (From Uber)
Delta Lake (From Databricks)
Spark Tables
Hive Tables

OLAP - Online Analytics Processing

Processes for Buisness Intelligence often using large aggregated data with complex queries. Not as regularly backed up due to data being completley recalculated each time. Used by internal Analysis professionals

OLTP - Online Transaction Processing

Processes for day to day business operations that result in small changes to data with simpler queries that require more frequent backups (to preserve the integrity of data like bank withdrawls, etc.). Used by customers and frontline staff.

Query Languages

Langauges to expressing queries to a database.

SQL (Structured Query Langauge): The most popular and standard used for a variety of Relational Databases and data platforms. SELECT * FROM dogs;
HQL (Hive Query Language): Used for SQL like query over Hadoop/Spark data lakes.
CQL (Cypher Query Langauge): Langauge used for NEO4J Graph Database MATCH (tom:Person {name: 'Tom Hanks'}) RETURN tom
CQL (Cassandra Query Language): SQL like languages used for Cassandra DB
AQL (Arango Query Language): Langauge used for Arango Graph Database
Document Database MongoDB uses a very Javascript like syntax for forming queries db.collection.find({})

Normailization

Taking data and eliminating redundancy by breaking parts of the data into smaller related tables.

Hadoop vs Spark

Hadoop is made of three main parts

HDFS: Allows you to store files among a cluster of computers
YARN: (Yet Another Resource Negotiator): Helps orchestrate resources for cluster operations
MapReduce: For processing data, relatively slow and difficult to code, writes to disk (only handled batch jobs), can support jobs written in Java, C++, Python and R

Spark is a data processing framework that can replace MapReduce and work with different storage layers like HDFS. More expensive than using MapReduce as infrastructure needs lots of memory. Can support jobs written in Scala, Java, Python, R.

Spark loads data into memory and processes for much faster speed
Can facilitate Batch, Streaming and Graph jobs

Hadoop Terminology

Name node: Node that tracks all the segments of data and which nodes they are stored on.

Spark Terminology

RDD (Resilient Distributed Dataset) the entry point for a spark job, defined with the spark context method.

Parquet Files

Binary Columnar format meant to be the sweet spot between file storage (CSV/JSON) vs using a Database

Databases can be must faster to query but much more expensive to maintain
Files like CSV/JSON are cheaper, but lack a schema, slow to query, and can't use the pushdown feature of spark

Parquet is a file representation of data that includes the schema, is fast, and can take advantage of some of the advanced spark features.

Apache Arrow

A standard format for organizing data in memory. This format organizes the data by column instead of by row for faster more efficient querying. (For example if I query data by data in a particular column it doesn't have to traverse all the data in each row increasing speed.)

Apache Arrow Flight is an interface for collecting data from sources that only support antiquated JDBC/ODBC connectors

Apache Drill

A SQL Query Engine that can handle a JSON Data format which allows SQL to be used on Flat, Complex, Schema-less and Schema-defined data. SQL all the things.

Hive Tables

A way organizing data sources and data files within Hadoop/Spark to make it possible to run SQL queries on top of them. In term of Hadoop, HQL queries get turned MapReduce jobs.

Apache Iceberg

Table format to address many of the pain points in Hive tables (renaming fields, supporting time-travel, non-atomic changes). Instead of saving data in a tree it saves lists of data files in a table like structure with additional meta data, snapshot. Snapshots track changes in state from one snapshot to the next (kind of like git).

Project Nessie

Literally git for data lakes.

Dremio - Data Lake House

Dremio Docker Container

Data Lake Engine combines the benefits of a Data Lake and Data Wharehouse. With Dremio I don't have to move data from their current location (cloud data store, CSV files, SQL Databses and certain non-sql databases like MongoDB). With Dremio we can query and transform our data from several source and quickly create datasets that can be used for BI dashboards and other purposes.

Benefits

Keep data where it is, in the format it's in
Free up data engineers and empower frontline data consumers to curate data
Control what data different consumers have access too
Use reflections to speed up high priority queries
scale dynamically with dremio cloud
BOTTOM LINE: decrease overall cost, increase speed of data analysis

Terminology

PDS (Primary Data Source): this is data that is "promoted" from a configured external data source (Database, Files, etc.). This data is immutable but can be used to create Virtual Data Sources.
VDS (Virtual Data Source): data sources that can be transformed that are created from either a PDS or other VDS. They are essentially the data set that is the result of an SQL query, which can then be the subject of another query.
Spaces: Semantic layers for organizing your VDS's (All PDS are associated with the home space ONLY), spaces other than the home space can be shared with others.
folder: unit for organizing data in a space or source, useful for setting up semantic layers
Reflections: A query or part of query that is saved as a intermediate parquet file to speed up future queries. Reflections can be:
- Managed (Dremio based on disgnated queries/VDS will determine which reflections will be optimize)
- User Defined (User Scripts from Spark or elsewhere represented in SQL to help define the reflection)
- Reflections can refresh on a set schedule, incremental or with machine learning patterns

With use of SQL on Dremio not only can we query the data:

we can convert the type of a column
pull out nested data into its own column
rename columns
All these changes don't modify the original raw data, but help define Virtual Data Sets that can be shared through the semantic layer to allow people to have the data they need how they need it without making copies.

Apache Calcite

SQL Parser

Relational Caching

Video

Things that determine speed

Distance (Disk, Memory, CPU Cache)
Format Ready for Processing (Columnar vs Row Format, levels of compression)
Relevancy (How similar is the dataset to the question being asked)

Caching Techniques

In-memory file pinning (load data to memory, process in memory)
Columnar Disk Caching (save file to disk in colunar format, Parquet)
In-Memory Block Cache, save useful data in "Hot Blocks of Memory"
Near-CPU Data Caching (save the data in the CPU cache, faster than memory)
Cube Relational Caching, having sub-datasets that may be more relavent
Arbitrary Relational Caching, having different sub pieces that can satisfy parts of the question faster

almirb/readme.md