Data is organized to meet a schema. Think tables which organize data into rows and columns.
Data is unorganized and lacks a schema. Imagine collections of html documents including text and images not organized in any consistent way.
Export
from data source to target storage for analysisTransform
data during export via appropriate logicLoad
transformed data to target for analysis
Pro: Data can be immediately analyzed once loaded
Con: Transformations can slow down the transfer of data making real time analysis difficult
Export
from data source to target storage for analaysisLoad
the untransformed data to the targetTransform
the data for analysis
Pro: Can be faster for real-time data analysis
Con: Can require more storage space and compute resources, this constraint has been improved/alleviated by cloud providers
Platform for storing large amounts of structured data for analysis. Data must be structured prior, which can make loading large flows of incoming data difficult and cumbersome.
DW Strategies
-
Inmon: Everything is cleaned and normalized upfront for a single source of truth that is distributed in department level marts. Most common strategy but slow queries due to normalization and difficult to share data between departments without creating extra marts.
-
Kimball: Start with what the data should look like to the end user first (the marts), then curate the data as needed with that in mind. May have more duplication of data and more complex setup, but faster queries and more robust availability of data. Not as often done cause usually sources already tend to inmon.
-
Data Vault: Used for long term historical storage because it tracks all changes and each change holds a reference source as a value.
Charachteristics of database transaction that make the data transactional (all or nothing, no partial completion creating broken data)
Platform for storing large amounts of structured and unstructured data for analysis. Since data does not have to be prepared prior much faster for loading realtime data.
Data can be Schema on Read, so Schemas aren't enforced on write like traditional databases.
How the data is stored in the data lake is determined by table format that can help provide ACID compliance, help quality governance, etc.
- Apache Iceberg (from Netflix)
- Apache Hudi (From Uber)
- Delta Lake (From Databricks)
- Spark Tables
- Hive Tables
Processes for Buisness Intelligence often using large aggregated data with complex queries. Not as regularly backed up due to data being completley recalculated each time. Used by internal Analysis professionals
Processes for day to day business operations that result in small changes to data with simpler queries that require more frequent backups (to preserve the integrity of data like bank withdrawls, etc.). Used by customers and frontline staff.
Langauges to expressing queries to a database.
- SQL (Structured Query Langauge): The most popular and standard used for a variety of Relational Databases and data platforms.
SELECT * FROM dogs;
- HQL (Hive Query Language): Used for SQL like query over Hadoop/Spark data lakes.
- CQL (Cypher Query Langauge): Langauge used for NEO4J Graph Database
MATCH (tom:Person {name: 'Tom Hanks'}) RETURN tom
- CQL (Cassandra Query Language): SQL like languages used for Cassandra DB
- AQL (Arango Query Language): Langauge used for Arango Graph Database
- Document Database MongoDB uses a very Javascript like syntax for forming queries
db.collection.find({})
Taking data and eliminating redundancy by breaking parts of the data into smaller related tables.
Hadoop is made of three main parts
- HDFS: Allows you to store files among a cluster of computers
- YARN: (Yet Another Resource Negotiator): Helps orchestrate resources for cluster operations
- MapReduce: For processing data, relatively slow and difficult to code, writes to disk (only handled batch jobs), can support jobs written in Java, C++, Python and R
Spark is a data processing framework that can replace MapReduce and work with different storage layers like HDFS. More expensive than using MapReduce as infrastructure needs lots of memory. Can support jobs written in Scala, Java, Python, R.
- Spark loads data into memory and processes for much faster speed
- Can facilitate Batch, Streaming and Graph jobs
- Name node: Node that tracks all the segments of data and which nodes they are stored on.
- RDD (Resilient Distributed Dataset) the entry point for a spark job, defined with the spark context method.
Binary Columnar format meant to be the sweet spot between file storage (CSV/JSON) vs using a Database
- Databases can be must faster to query but much more expensive to maintain
- Files like CSV/JSON are cheaper, but lack a schema, slow to query, and can't use the pushdown feature of spark
Parquet is a file representation of data that includes the schema, is fast, and can take advantage of some of the advanced spark features.
A standard format for organizing data in memory. This format organizes the data by column instead of by row for faster more efficient querying. (For example if I query data by data in a particular column it doesn't have to traverse all the data in each row increasing speed.)
Apache Arrow Flight is an interface for collecting data from sources that only support antiquated JDBC/ODBC connectors
A SQL Query Engine that can handle a JSON Data format which allows SQL to be used on Flat, Complex, Schema-less and Schema-defined data. SQL all the things.
A way organizing data sources and data files within Hadoop/Spark to make it possible to run SQL queries on top of them. In term of Hadoop, HQL queries get turned MapReduce jobs.
Table format to address many of the pain points in Hive tables (renaming fields, supporting time-travel, non-atomic changes). Instead of saving data in a tree it saves lists of data files in a table like structure with additional meta data, snapshot. Snapshots track changes in state from one snapshot to the next (kind of like git).
Literally git for data lakes.
Data Lake Engine combines the benefits of a Data Lake and Data Wharehouse. With Dremio I don't have to move data from their current location (cloud data store, CSV files, SQL Databses and certain non-sql databases like MongoDB). With Dremio we can query and transform our data from several source and quickly create datasets that can be used for BI dashboards and other purposes.
Benefits
- Keep data where it is, in the format it's in
- Free up data engineers and empower frontline data consumers to curate data
- Control what data different consumers have access too
- Use reflections to speed up high priority queries
- scale dynamically with dremio cloud
- BOTTOM LINE: decrease overall cost, increase speed of data analysis
- PDS (Primary Data Source): this is data that is "promoted" from a configured external data source (Database, Files, etc.). This data is immutable but can be used to create Virtual Data Sources.
- VDS (Virtual Data Source): data sources that can be transformed that are created from either a PDS or other VDS. They are essentially the data set that is the result of an SQL query, which can then be the subject of another query.
- Spaces: Semantic layers for organizing your VDS's (All PDS are associated with the home space ONLY), spaces other than the home space can be shared with others.
- folder: unit for organizing data in a space or source, useful for setting up semantic layers
- Reflections: A query or part of query that is saved as a intermediate parquet file to speed up future queries. Reflections can be:
- Managed (Dremio based on disgnated queries/VDS will determine which reflections will be optimize)
- User Defined (User Scripts from Spark or elsewhere represented in SQL to help define the reflection)
- Reflections can refresh on a set schedule, incremental or with machine learning patterns
With use of SQL on Dremio not only can we query the data:
- we can convert the type of a column
- pull out nested data into its own column
- rename columns
- All these changes don't modify the original raw data, but help define Virtual Data Sets that can be shared through the semantic layer to allow people to have the data they need how they need it without making copies.
SQL Parser
Things that determine speed
- Distance (Disk, Memory, CPU Cache)
- Format Ready for Processing (Columnar vs Row Format, levels of compression)
- Relevancy (How similar is the dataset to the question being asked)
Caching Techniques
- In-memory file pinning (load data to memory, process in memory)
- Columnar Disk Caching (save file to disk in colunar format, Parquet)
- In-Memory Block Cache, save useful data in "Hot Blocks of Memory"
- Near-CPU Data Caching (save the data in the CPU cache, faster than memory)
- Cube Relational Caching, having sub-datasets that may be more relavent
- Arbitrary Relational Caching, having different sub pieces that can satisfy parts of the question faster