Add Lake Loader to Quick Start (#611)

* Add Lake Loader to Quick Start * Update feature comparison and other pages
snowplow · Sep 22, 2023 · 1eb84cd · 1eb84cd
1 parent 376db4e
commit 1eb84cd
Show file tree

Hide file tree

Showing 9 changed files with 191 additions and 22 deletions.
diff --git a/docs/feature-comparison/index.md b/docs/feature-comparison/index.md
@@ -24,15 +24,17 @@ To find out more about the support services offered to Snowplow BDP customers se
 | • Redshift | ✅ | ✅| ✅ |
 | • BigQuery | ✅ | ❌ | ✅ |
 | • Databricks | ✅ | ✅ | ✅ |
+| • Synapse Analytics 🧪 | ✅ | ❌ | _coming soon_ |
 | • Elasticsearch | ✅ | ❌ | ✅ |
 | • Postgres | ✅<br/>_(not suitable for high volumes)_ | ❌ | ❌ |
 | • S3 | ✅ | ❌ | ✅ |
 | • GCS | ✅ | ❌ | ✅ |
+| • ADLS 🧪 | ✅ | ❌ | _coming soon_ |
 | **Real-time streams** | | | |
 | • Kinesis | ✅ | ❌ | ✅ |
 | • Pubsub | ✅ | ❌ | ✅ |
-| • Kafka | do-it-yourself | ❌ | bolt-on |
-| • Azure Eventhubs | do-it-yourself | ❌ | bolt-on |
+| • Azure Event Hubs | ✅ | ❌ | ✅<br/>_(bolt-on)_ |
+| • Kafka | ✅ | ❌ | ✅<br/>_(bolt-on)_ |
 | <h3>Build more trust in your data</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
 | [Failed Events](/docs/understanding-your-pipeline/failed-events/index.md) | ✅ | ❌ | ✅ |
 | [Data quality monitoring & API](/docs/managing-data-quality/monitoring-failed-events/index.md) | ❌ | ❌ | ✅ |
@@ -50,11 +52,14 @@ To find out more about the support services offered to Snowplow BDP customers se
 | [Tracking scenarios](/docs/understanding-tracking-design/tracking-plans/index.md) | ❌ | ✅<br/>_(UI only)_ | ✅ |
 | [Data modeling management tooling](/docs/modeling-your-data/running-data-models-via-snowplow-bdp/dbt/using-dbt/index.md) | ❌ | _coming soon_ | ✅ |
 | [Tracking catalog](/docs/discovering-data/tracking-catalog/index.md) | ❌ | ❌ | ✅ |
-| <h3>Deployment & security</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) | 
-| Deployment method | self-hosted<br/>(AWS, GCP, Azure 🧪) | Snowplow-hosted cloud | private cloud<br/>(AWS, GCP) |
+| <h3>Deployment & security</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
+| Deployment method | self-hosted | Snowplow-hosted cloud | private cloud |
+| • AWS | ✅ | — | ✅ |
+| • GCP | ✅ | — | ✅ |
+| • Azure 🧪 | ✅ | — | _coming soon_ |
 | Single Sign-On | ❌ | ❌ | ✅ |
 | Fine grained user permissions (ACLs) | ❌ | ❌ | ✅<br/>_(top tier only)_ |
-| Custom VPC integration | ❌ | ❌ | bolt-on |
+| Custom VPC integration | ❌ | ❌ | ✅<br/>_(bolt-on)_ |
 | AWS Infra security bundle | ❌ | ❌ | ✅<br/>_(top tier only)_ |
 | <h3>Services</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
 | Self-help support website, FAQs and educational materials | ✅ | ✅ | ✅ |

diff --git a/docs/first-steps/querying/index.md b/docs/first-steps/querying/index.md
@@ -90,6 +90,12 @@ To connect, you can use either Snowflake dashboard or [SnowSQL](https://docs.sno
   </TabItem>
   <TabItem value="databricks" label="Databricks">
 
+:::info Azure-specific instructions
+
+On Azure, you have created an external table in the [last step of the guide](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination). Use this table and ignore the text below.
+
+:::
+
 The database name and the schema name will be defined by the `databricks_database` and `databricks_schema` variables in Terraform.
 
 There are two different ways to login to the database:
@@ -101,6 +107,19 @@ See the [Databricks tutorial](https://docs.databricks.com/getting-started/quick-
   </TabItem>
 </Tabs>
 
+</TabItem>
+  <TabItem value="synapse" label="Synapse Analytics 🧪">
+
+In Synapse Analytics, you can connect directly to the data residing in ADLS. You will need to know the names of the storage account (set in the `storage_account_name` Terraform variable) and the storage container (it’s a fixed value: `lake-container`).
+
+Follow [the Synapse documentation](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format) and use the `OPENROWSET` function. If you created a data source in the [last step](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination) of the quick start guide, your queries will be a bit simpler.
+
+:::tip Fabric and OneLake
+
+If you created a OneLake shortcut in the [last step](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination) of the quick start guide, you will be able to explore Snowplow data in Fabric, for example, using Spark SQL.
+
+:::
+
   </TabItem>
 </Tabs>
 

diff --git a/docs/getting-started-on-snowplow-open-source/_diagram.md b/docs/getting-started-on-snowplow-open-source/_diagram.md
@@ -12,11 +12,11 @@ flowchart LR
   iglu{{"<b>Iglu Server</b>\n(${props.compute})"}}
   igludb[("<b>Iglu Database</b>\n(${props.igludb})")]
   bad[["<b>Bad Stream</b>\n(${props.stream})"]]
-  ${props.warehouse == 'Postgres' ?
-    `loader{{"<b>Postgres Loader</b>\n(${props.compute})"}}` :
+  ${(props.warehouse == 'Postgres' || props.warehouse == 'Data Lake') ?
+    `loader{{"<b>${props.warehouse} Loader</b>\n(${props.compute})"}}` :
     `loader("<b>${props.warehouse} Loader</b>\n<i>(see below)</i>")`
   }
-  atomic[("<b>Events</b>\n(${props.warehouse})")]
+  atomic[("<b>Events</b>\n(${props.warehouse == 'Data Lake' ? props.bucket : props.warehouse})")]
   collect---iglu %% invisible link for alignment
   enrich-.-oiglu<-.->igludb
   collect-->|"<b>Raw Stream</b><br/>(${props.stream})"| enrich
@@ -55,7 +55,7 @@ flowchart LR
   </>
 }</>
 
-<>{props.warehouse != 'Postgres' && (<>
+<>{props.warehouse != 'Postgres' && props.warehouse != 'Data Lake' && (<>
   <h4>{props.warehouse} Loader</h4>
   <ReactMarkdown children={`
 For more information about the ${props.warehouse} Loader, see the [documentation on the loading process](/docs/storing-querying/loading-process/index.md?warehouse=${props.warehouse.toLowerCase()}&cloud=${props.cloud == 'aws' ? 'aws-micro-batching' : props.cloud}).

diff --git a/docs/getting-started-on-snowplow-open-source/quick-start/index.md b/docs/getting-started-on-snowplow-open-source/quick-start/index.md
@@ -93,10 +93,10 @@ The sections below will guide you through setting up your destination to receive
 |:----------|:---:|:---:|:-----:|
 | Postgres | :white_check_mark: | :white_check_mark: | :x: |
 | Snowflake | :white_check_mark: | :x: |:white_check_mark: |
-| Databricks | :white_check_mark: | :x: | _coming soon_ |
+| Databricks | :white_check_mark: | :x: | :white_check_mark: |
 | Redshift | :white_check_mark: | — | — |
 | BigQuery | — | :white_check_mark: | — |
-| Synapse | — | — | _coming soon_ |
+| Synapse Analytics 🧪 | — | — | :white_check_mark: |
 
 <Tabs groupId="cloud" queryString>
   <TabItem value="aws" label="AWS" default>
@@ -113,7 +113,9 @@ There are two alternative storage options for you to select: Postgres and BigQue
   </TabItem>
   <TabItem value="azure" label="Azure 🧪">
 
-There is currently only one option: Snowflake.
+There are two storage options for you to select: Snowflake and data lake (ADLS). The latter option enables querying data from Databricks and Synapse Analytics.
+
+We recommend to only load data into a single destination (Snowflake or data lake), but nothing prevents you from loading into both with the same pipeline (e.g. for testing purposes).
 
   </TabItem>
 </Tabs>
@@ -346,6 +348,12 @@ GRANT ROLE ${snowflake_loader_role} TO ROLE SYSADMIN;
   </TabItem>
   <TabItem value="databricks" label="Databricks">
 
+:::info Azure-specific instructions
+
+On Azure, we currently support loading data into Databricks via a data lake. You can still follow Step 1 below to create the cluster, however you should skip the rest of these steps. Instead, proceed with [deploying the pipeline](#set-up-the-pipeline) — we will return to configuring Databricks [at the end of this guide](#configure-the-destination).
+
+:::
+
 #### Step 1: Create a cluster
 
 :::note
@@ -413,7 +421,7 @@ CREATE SCHEMA IF NOT EXISTS ${schema_name}
 
 The security principal used by the loader needs a `Databricks SQL access` permission, which can be enabled in the _Admin Console_.
 
-Databricks does not have table access enabled by default. Enable it with an 
+Databricks does not have table access enabled by default. Enable it with an
 initialization script:
 
 ```scala
@@ -455,6 +463,11 @@ GRANT MODIFY, SELECT ON TABLE  <catalog>.<schema>.rdb_folder_monitoring TO `<pri
 
 </details>
 
+  </TabItem>
+  <TabItem value="synapse" label="Synapse Analytics 🧪">
+
+No extra steps needed. Proceed with [deploying the pipeline](#set-up-the-pipeline) — we will return to configuring Synapse [at the end of this guide](#configure-the-destination).
+
   </TabItem>
 </Tabs>
 
@@ -546,13 +559,13 @@ Set the `postgres_db_authorized_networks` to a list of CIDR addresses that will
   </TabItem>
   <TabItem value="azure" label="Azure 🧪">
 
-As mentioned [above](#storage-options), there is currently only one option for the pipeline’s destination database: Snowflake. Set `snowflake_enabled` to `true` and fill all the relevant configuration options (starting with `<snowflake>_`).
+As mentioned [above](#storage-options), there are two options for the pipeline’s destination: Snowflake and data lake (the latter enabling Databricks and Synapse Analytics). For each destination you’d like to configure, set the `<destination>_enabled` variable (e.g. `snowflake_enabled`) to `true` and fill all the relevant configuration options (starting with `<destination>_`).
 
 When in doubt, refer back to the [destination setup](#prepare-the-destination) section where you have picked values for many of the variables.
 
 :::caution
 
-Change the `snowflake_loader_password` setting to a value that _only you_ know.
+If loading into Snowflake, change the `snowflake_loader_password` setting to a value that _only you_ know.
 
 :::
 
@@ -599,7 +612,7 @@ This will output your `collector_lb_ip_address` and `collector_lb_fqdn`.
   </TabItem>
 </Tabs>
 
-Make a note of the outputs: you'll need them when sending events and connecting to your database.
+Make a note of the outputs: you'll need them when sending events and (in some cases) connecting to your data.
 
 :::tip Empty outputs
 
@@ -615,4 +628,108 @@ For solutions to some common Terraform errors that you might encounter when runn
 
 :::
 
+## Configure the destination
+
+<Tabs groupId="warehouse" queryString>
+  <TabItem value="postgres" label="Postgres" default>
+
+No extra steps needed.
+
+  </TabItem>
+  <TabItem value="redshift" label="Redshift">
+
+No extra steps needed.
+
+  </TabItem>
+  <TabItem value="bigquery" label="BigQuery">
+
+No extra steps needed.
+
+  </TabItem>
+  <TabItem value="snowflake" label="Snowflake">
+
+No extra steps needed.
+
+  </TabItem>
+  <TabItem value="databricks" label="Databricks">
+
+:::info Azure-specific instructions
+
+On Azure, we currently support loading data into Databricks via a data lake. To complete the setup, you will need to configure Databricks to access your data on ADLS.
+
+First, follow the [Databricks documentation](https://docs.databricks.com/en/storage/azure-storage.html) to set up authentication using either Azure service principal, shared access signature tokens or account keys. _(The latter mechanism is not recommended, but is arguably the easiest for testing purposes.)_
+
+You will need to know a couple of things:
+* Storage account name — this is the value of the `storage_account_name` variable in the pipeline `terraform.tvars` file
+* Storage container name — `lake-container`
+
+Once authentication is set up, you can create an external table using Spark SQL (replace `<storage-account-name>` with the corredponding value):
+
+```sql
+CREATE TABLE events
+LOCATION 'abfss://lake-container@<storage-account-name>.dfs.core.windows.net/events/';
+```
+
+:::
+
+  </TabItem>
+  <TabItem value="synapse" label="Synapse Analytics 🧪">
+
+Your data is loaded into ADLS. To access it, follow [the Synapse documentation](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format) and use the `OPENROWSET` function.
+
+You will need to know a couple of things:
+* Storage account name — this is the value of the `storage_account_name` variable in the pipeline `terraform.tvars` file
+* Storage container name — `lake-container`
+
+<details>
+<summary>Example query</summary>
+
+```sql
+SELECT TOP 10 *
+FROM OPENROWSET(
+    BULK 'https://<storage-account-name>.blob.core.windows.net/lake-container/events/',
+    FORMAT = 'delta'
+) AS events;
+```
+
+</details>
+
+We recommend [creating a data source](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#data-source-usage), which simplifies future queries (note that unlike the previous URL, this one does not end with `/events/`):
+
+```sql
+CREATE EXTERNAL DATA SOURCE SnowplowData
+WITH (LOCATION = 'https://<storage-account-name>.blob.core.windows.net/lake-container/');
+```
+
+<details>
+<summary>Example query with data source</summary>
+
+```sql
+SELECT TOP 10 *
+FROM OPENROWSET(
+    BULK 'events',
+    DATA_SOURCE = 'SnowplowData',
+    FORMAT = 'delta'
+) AS events;
+```
+
+</details>
+
+:::tip Fabric and OneLake
+
+You can also consume your ADLS data via Fabric and OneLake:
+
+* First, [create a Lakehouse](https://learn.microsoft.com/en-us/fabric/onelake/create-lakehouse-onelake#create-a-lakehouse) or use an existing one.
+* Next, [create a OneLake shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-adls-shortcut) to your storage account. In the URL field, specify `https://<storage-account-name>.blob.core.windows.net/lake-container/events/`.
+* You can now [use Spark notebooks](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-explore) to explore your Snowplow data.
+
+Do note that currently not all Fabric services support nested fields present in the Snowplow data.
+
+:::
+
+  </TabItem>
+</Tabs>
+
+---
+
 If you are curious, here’s [what has been deployed](/docs/getting-started-on-snowplow-open-source/what-is-deployed/index.md). Now it’s time to [send your first events to your pipeline](/docs/first-steps/tracking/index.md)!
diff --git a/docs/getting-started-on-snowplow-open-source/what-is-deployed/index.md b/docs/getting-started-on-snowplow-open-source/what-is-deployed/index.md
@@ -78,7 +78,17 @@ You can very easily edit the script or run each of the Terraform modules indepen
 <Tabs groupId="warehouse" queryString lazy>
   <TabItem value="snowflake" label="Snowflake" default>
 
-<Diagram cloud="azure" warehouse="Snowflake" compute="VMSS" stream="Event Hubs" bucket="Azure Blob" igludb="Postgres"/>
+<Diagram cloud="azure" warehouse="Snowflake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>
+
+  </TabItem>
+  <TabItem value="databricks" label="Databricks" default>
+
+<Diagram cloud="azure" warehouse="Data Lake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>
+
+  </TabItem>
+  <TabItem value="synapse" label="Synapse Analytics 🧪" default>
+
+<Diagram cloud="azure" warehouse="Data Lake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>
 
   </TabItem>
 </Tabs>
@@ -290,7 +300,7 @@ See the following Terraform modules for further details on the resources, defaul
 
 
   </TabItem>
-  <TabItem value="databricks" label="Databricks">
+  <TabItem value="databricks" label="Databricks (direct)">
 
 [RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is a set of applications that loads enriched events into Databricks.
 
@@ -299,5 +309,23 @@ See the following Terraform modules for further details on the resources, defaul
 * [Databricks Loader](https://registry.terraform.io/modules/snowplow-devops/databricks-loader-ec2/aws/latest)
 
 
+  </TabItem>
+  <TabItem value="databricks-lake" label="Databricks (via lake)">
+
+[Lake Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is an application that loads enriched events into a data lake so that they can be queried via Databricks (or other means).
+
+See the Lake Loader [Terraform module](https://registry.terraform.io/modules/snowplow-devops/lake-loader-vmss/azurerm/latest) for further details on the resources, default and required input variables, and outputs.
+
+The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.
+
+  </TabItem>
+  <TabItem value="synapse" label="Synapse Analytics 🧪">
+
+[Lake Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is an application that loads enriched events into a data lake so that they can be queried via Synapse Analytics (or Fabric, OneLake, etc).
+
+See the Lake Loader [Terraform module](https://registry.terraform.io/modules/snowplow-devops/lake-loader-vmss/azurerm/latest) for further details on the resources, default and required input variables, and outputs.
+
+The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.
+
   </TabItem>
 </Tabs>
diff --git a/...peline-components-and-applications/loaders-storage-targets/lake-loader/index.md b/...peline-components-and-applications/loaders-storage-targets/lake-loader/index.md
@@ -20,7 +20,7 @@ Currently the Lake Loader supports [Delta format](https://delta.io/) only. Futur
 
 :::
 
-<Tabs groupId="cloud" queryString>
+<Tabs groupId="cloud" queryString lazy>
   <TabItem value="gcp" label="GCP" default>
     <LakeLoaderDiagram stream="Pub/Sub" bucket="GCS" cloud="GCP"/>
     <DeployOverview cloud="GCP"/>