Data Cloud

Key Ideas

Traditionally we store our data in a transactional Databases
- Great for transactions, syn-writes...
We run to challenges is scaling
- 1. Data Scale (e.g. Event data, data from sensors...)
- 1. Since data is locked into Transactional systems, what we need to do is
  - Pull all that data out into other systems like Analytics.
    - Pull all the data into data warehouse
  - To do Machine Learning (ML)
    - Pull all the data into data lakes and do the ML
Data leaves Salesforce into these other systems
- Security control is lost
- Business definition done at Salesforce is lost and Semantics is lost
End up with proliferations of things which again needs to be integrated back into Salesforce

With Data Cloud

We are bringing the power of the big data processing into Salesforce, so we can run all on the same platform
- where you can run Analytics, ML, Realtime decision system
It is not a silo solution, but it is built into Salesforce platform
So we can use the power of Salesforce metadata platform and power of hyperforce data scale aspects

Data Cloud

We put all the metadata into the Salesforce Core (user, dev-experience...). Stored in the tables in the Core.
All the data is sitting in the hyperforce

Data Cloud

At the storage level, we store the data in Apache Parquet format
- Parquet is industry standard columnar format
- Helps to perform queries at scale
- We integrate this data with partitioning so you can put high-scale
  - Customer brought in 3T rows of data into the system in a week - System is capable of scaling to that level.
  - We have tested up to 100K requests / second per tenant
  - These are volumes and scale the system can support
In traditional data lake house, they are built for batches
- file upload works great
- does not work well for streaming data
- We contribute to the open source community so we can handle streaming data as the events are coming in, we can update at scale
- 1000s of requests coming in and we updating on disk live
- Apache Iceberg sits on the top of Apache Parquet layer which are stored in AWS S3 bucket
- Iceberg provides the metadata - what is the table structure of the Parquet file

Layer	Notes
Parquet	Columnar file
Iceberg	Tabular view of the Parquet file

Iceberg supports delta increment update fast , we also added Salesforce metadata to Iceberg
So our partners like Snowflake can query our data live - no copy, no ETL
All the security control is with Salesforce platform Admins
Super powerful
- You have lost the security
- You have not lost the semantics that you have created in the Salesforce Platform

Ways to use data in Data Cloud (JDBC or File level)

Zero copy way of looking at Salesforce data with full security control of the Salesforce Admin

Via JDBC connector - analytics system can make of use it
File Level access - at Iceberg level - Partners like Snowflake can make use of it

Reverse (Data-In)

WH -> DC - Data In

Already have DW, Data cloud will allow to mount those tables virtually in Data Cloud (Work in progress)

DC Dataflow

REST API
SOQL (with limitations) on single entity
ANSI SQL - full scale

Best Practices

Make the Canonical Model correct (Business scenario)
What is the velocity of the data coming in - batch / streaming
Cost willing to spend - consumption based pricing model
- Figuring out what you will use is important

Questions to ask:

What is Data Landscape
- Starring fresh with DC
Already have DW?
Have MDM ?
Your Action system
- Sales Cloud
- Service Cloud
- Marketing Cloud

Apache Iceberg is an open-source table format that adds data warehouse-level capabilities to a traditional data lake. One of the Apache Software Foundation’s open-source projects, the Iceberg table format enables more efficient petabyte-scale data processing by creating an abstracted metadata layer that describes the files in a data lake’s object storage.

What is Zero ETL?

Zero-ETL is a set of integrations that eliminates or minimizes the need to build ETL data pipelines. Extract, transform, and load (ETL) is the process of combining, cleaning, and normalizing data from different sources to get it ready for analytics, artificial intelligence (AI) and machine learning (ML) workloads. Traditional ETL processes are time-consuming and complex to develop, maintain, and scale. Instead, zero-ETL integrations facilitate point-to-point data movement without the need to create ETL data pipelines. Zero-ETL can also enable querying across data silos without the need for data movement.

Apache parquet format

parquet format

Lakehouse

A lakehouse is a new type of data platform architecture that combines the key benefits of data lakes and data warehouses. Data lakes handle both structured and unstructured data, often for advanced analytics. Lakehouses combine the two, offering analytics flexibility with diverse data types.

Data lakes are typically stored in file format with variable organization or hierarchy. Built on object storage, data lakes allow for the flexibility to store data of all types, from a wide variety of sources. Provide multi-modal access: query, use spark...

Data warehouses tend to be more performant than data lakes, but they can be more expensive and limited in their ability to scale. Typically you can query with SQL.

Salesforce Development Notes

Data Cloud

Key Ideas

Ways to use data in Data Cloud (JDBC or File level)

Reverse (Data-In)

Best Practices

Questions to ask:

Terms used

What is Apache Iceberg?

What is Zero ETL?

Apache parquet format

Lakehouse

References

Video