Skip to content

Commit

Permalink
Add Lake Loader to Quick Start (#611)
Browse files Browse the repository at this point in the history
* Add Lake Loader to Quick Start
* Update feature comparison and other pages
  • Loading branch information
stanch authored Sep 22, 2023
1 parent 376db4e commit 1eb84cd
Show file tree
Hide file tree
Showing 9 changed files with 191 additions and 22 deletions.
15 changes: 10 additions & 5 deletions docs/feature-comparison/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,17 @@ To find out more about the support services offered to Snowplow BDP customers se
| • Redshift ||||
| • BigQuery ||||
| • Databricks ||||
| • Synapse Analytics 🧪 ||| _coming soon_ |
| • Elasticsearch ||||
| • Postgres | ✅<br/>_(not suitable for high volumes)_ |||
| • S3 ||||
| • GCS ||||
| • ADLS 🧪 ||| _coming soon_ |
| **Real-time streams** | | | |
| • Kinesis ||||
| • Pubsub ||||
|Kafka | do-it-yourself || bolt-on |
|Azure Eventhubs | do-it-yourself || bolt-on |
|Azure Event Hubs | || ✅<br/>_(bolt-on)_ |
|Kafka | || ✅<br/>_(bolt-on)_ |
| <h3>Build more trust in your data</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
| [Failed Events](/docs/understanding-your-pipeline/failed-events/index.md) ||||
| [Data quality monitoring & API](/docs/managing-data-quality/monitoring-failed-events/index.md) ||||
Expand All @@ -50,11 +52,14 @@ To find out more about the support services offered to Snowplow BDP customers se
| [Tracking scenarios](/docs/understanding-tracking-design/tracking-plans/index.md) || ✅<br/>_(UI only)_ ||
| [Data modeling management tooling](/docs/modeling-your-data/running-data-models-via-snowplow-bdp/dbt/using-dbt/index.md) || _coming soon_ ||
| [Tracking catalog](/docs/discovering-data/tracking-catalog/index.md) ||||
| <h3>Deployment & security</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
| Deployment method | self-hosted<br/>(AWS, GCP, Azure 🧪) | Snowplow-hosted cloud | private cloud<br/>(AWS, GCP) |
| <h3>Deployment & security</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
| Deployment method | self-hosted | Snowplow-hosted cloud | private cloud |
| • AWS ||||
| • GCP ||||
| • Azure 🧪 ||| _coming soon_ |
| Single Sign-On ||||
| Fine grained user permissions (ACLs) ||| ✅<br/>_(top tier only)_ |
| Custom VPC integration ||| bolt-on |
| Custom VPC integration ||| ✅<br/>_(bolt-on)_ |
| AWS Infra security bundle ||| ✅<br/>_(top tier only)_ |
| <h3>Services</h3> | [Open Source](/docs/getting-started-on-snowplow-open-source/index.md) | [BDP Cloud](/docs/getting-started-on-snowplow-bdp-cloud/index.md) | [BDP Enterprise](/docs/getting-started-on-snowplow-bdp-enterprise/index.md) |
| Self-help support website, FAQs and educational materials ||||
Expand Down
19 changes: 19 additions & 0 deletions docs/first-steps/querying/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,12 @@ To connect, you can use either Snowflake dashboard or [SnowSQL](https://docs.sno
</TabItem>
<TabItem value="databricks" label="Databricks">

:::info Azure-specific instructions

On Azure, you have created an external table in the [last step of the guide](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination). Use this table and ignore the text below.

:::

The database name and the schema name will be defined by the `databricks_database` and `databricks_schema` variables in Terraform.

There are two different ways to login to the database:
Expand All @@ -101,6 +107,19 @@ See the [Databricks tutorial](https://docs.databricks.com/getting-started/quick-
</TabItem>
</Tabs>

</TabItem>
<TabItem value="synapse" label="Synapse Analytics 🧪">

In Synapse Analytics, you can connect directly to the data residing in ADLS. You will need to know the names of the storage account (set in the `storage_account_name` Terraform variable) and the storage container (it’s a fixed value: `lake-container`).

Follow [the Synapse documentation](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format) and use the `OPENROWSET` function. If you created a data source in the [last step](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination) of the quick start guide, your queries will be a bit simpler.

:::tip Fabric and OneLake

If you created a OneLake shortcut in the [last step](/docs/getting-started-on-snowplow-open-source/quick-start/index.md#configure-the-destination) of the quick start guide, you will be able to explore Snowplow data in Fabric, for example, using Spark SQL.

:::

</TabItem>
</Tabs>

Expand Down
8 changes: 4 additions & 4 deletions docs/getting-started-on-snowplow-open-source/_diagram.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ flowchart LR
iglu{{"<b>Iglu Server</b>\n(${props.compute})"}}
igludb[("<b>Iglu Database</b>\n(${props.igludb})")]
bad[["<b>Bad Stream</b>\n(${props.stream})"]]
${props.warehouse == 'Postgres' ?
`loader{{"<b>Postgres Loader</b>\n(${props.compute})"}}` :
${(props.warehouse == 'Postgres' || props.warehouse == 'Data Lake') ?
`loader{{"<b>${props.warehouse} Loader</b>\n(${props.compute})"}}` :
`loader("<b>${props.warehouse} Loader</b>\n<i>(see below)</i>")`
}
atomic[("<b>Events</b>\n(${props.warehouse})")]
atomic[("<b>Events</b>\n(${props.warehouse == 'Data Lake' ? props.bucket : props.warehouse})")]
collect---iglu %% invisible link for alignment
enrich-.-oiglu<-.->igludb
collect-->|"<b>Raw Stream</b><br/>(${props.stream})"| enrich
Expand Down Expand Up @@ -55,7 +55,7 @@ flowchart LR
</>
}</>

<>{props.warehouse != 'Postgres' && (<>
<>{props.warehouse != 'Postgres' && props.warehouse != 'Data Lake' && (<>
<h4>{props.warehouse} Loader</h4>
<ReactMarkdown children={`
For more information about the ${props.warehouse} Loader, see the [documentation on the loading process](/docs/storing-querying/loading-process/index.md?warehouse=${props.warehouse.toLowerCase()}&cloud=${props.cloud == 'aws' ? 'aws-micro-batching' : props.cloud}).
Expand Down
131 changes: 124 additions & 7 deletions docs/getting-started-on-snowplow-open-source/quick-start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,10 +93,10 @@ The sections below will guide you through setting up your destination to receive
|:----------|:---:|:---:|:-----:|
| Postgres | :white_check_mark: | :white_check_mark: | :x: |
| Snowflake | :white_check_mark: | :x: |:white_check_mark: |
| Databricks | :white_check_mark: | :x: | _coming soon_ |
| Databricks | :white_check_mark: | :x: | :white_check_mark: |
| Redshift | :white_check_mark: |||
| BigQuery || :white_check_mark: ||
| Synapse ||| _coming soon_ |
| Synapse Analytics 🧪 ||| :white_check_mark: |

<Tabs groupId="cloud" queryString>
<TabItem value="aws" label="AWS" default>
Expand All @@ -113,7 +113,9 @@ There are two alternative storage options for you to select: Postgres and BigQue
</TabItem>
<TabItem value="azure" label="Azure 🧪">

There is currently only one option: Snowflake.
There are two storage options for you to select: Snowflake and data lake (ADLS). The latter option enables querying data from Databricks and Synapse Analytics.

We recommend to only load data into a single destination (Snowflake or data lake), but nothing prevents you from loading into both with the same pipeline (e.g. for testing purposes).

</TabItem>
</Tabs>
Expand Down Expand Up @@ -346,6 +348,12 @@ GRANT ROLE ${snowflake_loader_role} TO ROLE SYSADMIN;
</TabItem>
<TabItem value="databricks" label="Databricks">

:::info Azure-specific instructions

On Azure, we currently support loading data into Databricks via a data lake. You can still follow Step 1 below to create the cluster, however you should skip the rest of these steps. Instead, proceed with [deploying the pipeline](#set-up-the-pipeline) — we will return to configuring Databricks [at the end of this guide](#configure-the-destination).

:::

#### Step 1: Create a cluster

:::note
Expand Down Expand Up @@ -413,7 +421,7 @@ CREATE SCHEMA IF NOT EXISTS ${schema_name}

The security principal used by the loader needs a `Databricks SQL access` permission, which can be enabled in the _Admin Console_.

Databricks does not have table access enabled by default. Enable it with an
Databricks does not have table access enabled by default. Enable it with an
initialization script:

```scala
Expand Down Expand Up @@ -455,6 +463,11 @@ GRANT MODIFY, SELECT ON TABLE <catalog>.<schema>.rdb_folder_monitoring TO `<pri

</details>

</TabItem>
<TabItem value="synapse" label="Synapse Analytics 🧪">

No extra steps needed. Proceed with [deploying the pipeline](#set-up-the-pipeline) — we will return to configuring Synapse [at the end of this guide](#configure-the-destination).

</TabItem>
</Tabs>

Expand Down Expand Up @@ -546,13 +559,13 @@ Set the `postgres_db_authorized_networks` to a list of CIDR addresses that will
</TabItem>
<TabItem value="azure" label="Azure 🧪">

As mentioned [above](#storage-options), there is currently only one option for the pipeline’s destination database: Snowflake. Set `snowflake_enabled` to `true` and fill all the relevant configuration options (starting with `<snowflake>_`).
As mentioned [above](#storage-options), there are two options for the pipeline’s destination: Snowflake and data lake (the latter enabling Databricks and Synapse Analytics). For each destination you’d like to configure, set the `<destination>_enabled` variable (e.g. `snowflake_enabled`) to `true` and fill all the relevant configuration options (starting with `<destination>_`).

When in doubt, refer back to the [destination setup](#prepare-the-destination) section where you have picked values for many of the variables.

:::caution

Change the `snowflake_loader_password` setting to a value that _only you_ know.
If loading into Snowflake, change the `snowflake_loader_password` setting to a value that _only you_ know.

:::

Expand Down Expand Up @@ -599,7 +612,7 @@ This will output your `collector_lb_ip_address` and `collector_lb_fqdn`.
</TabItem>
</Tabs>

Make a note of the outputs: you'll need them when sending events and connecting to your database.
Make a note of the outputs: you'll need them when sending events and (in some cases) connecting to your data.

:::tip Empty outputs

Expand All @@ -615,4 +628,108 @@ For solutions to some common Terraform errors that you might encounter when runn

:::

## Configure the destination

<Tabs groupId="warehouse" queryString>
<TabItem value="postgres" label="Postgres" default>

No extra steps needed.

</TabItem>
<TabItem value="redshift" label="Redshift">

No extra steps needed.

</TabItem>
<TabItem value="bigquery" label="BigQuery">

No extra steps needed.

</TabItem>
<TabItem value="snowflake" label="Snowflake">

No extra steps needed.

</TabItem>
<TabItem value="databricks" label="Databricks">

:::info Azure-specific instructions

On Azure, we currently support loading data into Databricks via a data lake. To complete the setup, you will need to configure Databricks to access your data on ADLS.

First, follow the [Databricks documentation](https://docs.databricks.com/en/storage/azure-storage.html) to set up authentication using either Azure service principal, shared access signature tokens or account keys. _(The latter mechanism is not recommended, but is arguably the easiest for testing purposes.)_

You will need to know a couple of things:
* Storage account name — this is the value of the `storage_account_name` variable in the pipeline `terraform.tvars` file
* Storage container name — `lake-container`

Once authentication is set up, you can create an external table using Spark SQL (replace `<storage-account-name>` with the corredponding value):

```sql
CREATE TABLE events
LOCATION 'abfss://lake-container@<storage-account-name>.dfs.core.windows.net/events/';
```

:::

</TabItem>
<TabItem value="synapse" label="Synapse Analytics 🧪">

Your data is loaded into ADLS. To access it, follow [the Synapse documentation](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format) and use the `OPENROWSET` function.

You will need to know a couple of things:
* Storage account name — this is the value of the `storage_account_name` variable in the pipeline `terraform.tvars` file
* Storage container name — `lake-container`

<details>
<summary>Example query</summary>

```sql
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://<storage-account-name>.blob.core.windows.net/lake-container/events/',
FORMAT = 'delta'
) AS events;
```

</details>

We recommend [creating a data source](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format#data-source-usage), which simplifies future queries (note that unlike the previous URL, this one does not end with `/events/`):

```sql
CREATE EXTERNAL DATA SOURCE SnowplowData
WITH (LOCATION = 'https://<storage-account-name>.blob.core.windows.net/lake-container/');
```

<details>
<summary>Example query with data source</summary>

```sql
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'events',
DATA_SOURCE = 'SnowplowData',
FORMAT = 'delta'
) AS events;
```

</details>

:::tip Fabric and OneLake

You can also consume your ADLS data via Fabric and OneLake:

* First, [create a Lakehouse](https://learn.microsoft.com/en-us/fabric/onelake/create-lakehouse-onelake#create-a-lakehouse) or use an existing one.
* Next, [create a OneLake shortcut](https://learn.microsoft.com/en-us/fabric/onelake/create-adls-shortcut) to your storage account. In the URL field, specify `https://<storage-account-name>.blob.core.windows.net/lake-container/events/`.
* You can now [use Spark notebooks](https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-notebook-explore) to explore your Snowplow data.

Do note that currently not all Fabric services support nested fields present in the Snowplow data.

:::

</TabItem>
</Tabs>

---

If you are curious, here’s [what has been deployed](/docs/getting-started-on-snowplow-open-source/what-is-deployed/index.md). Now it’s time to [send your first events to your pipeline](/docs/first-steps/tracking/index.md)!
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,17 @@ You can very easily edit the script or run each of the Terraform modules indepen
<Tabs groupId="warehouse" queryString lazy>
<TabItem value="snowflake" label="Snowflake" default>

<Diagram cloud="azure" warehouse="Snowflake" compute="VMSS" stream="Event Hubs" bucket="Azure Blob" igludb="Postgres"/>
<Diagram cloud="azure" warehouse="Snowflake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>

</TabItem>
<TabItem value="databricks" label="Databricks" default>

<Diagram cloud="azure" warehouse="Data Lake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>

</TabItem>
<TabItem value="synapse" label="Synapse Analytics 🧪" default>

<Diagram cloud="azure" warehouse="Data Lake" compute="VMSS" stream="Event Hubs" bucket="ADLS Gen2" igludb="Postgres"/>

</TabItem>
</Tabs>
Expand Down Expand Up @@ -290,7 +300,7 @@ See the following Terraform modules for further details on the resources, defaul


</TabItem>
<TabItem value="databricks" label="Databricks">
<TabItem value="databricks" label="Databricks (direct)">

[RDB Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is a set of applications that loads enriched events into Databricks.

Expand All @@ -299,5 +309,23 @@ See the following Terraform modules for further details on the resources, defaul
* [Databricks Loader](https://registry.terraform.io/modules/snowplow-devops/databricks-loader-ec2/aws/latest)


</TabItem>
<TabItem value="databricks-lake" label="Databricks (via lake)">

[Lake Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is an application that loads enriched events into a data lake so that they can be queried via Databricks (or other means).

See the Lake Loader [Terraform module](https://registry.terraform.io/modules/snowplow-devops/lake-loader-vmss/azurerm/latest) for further details on the resources, default and required input variables, and outputs.

The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.

</TabItem>
<TabItem value="synapse" label="Synapse Analytics 🧪">

[Lake Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/snowplow-rdb-loader/index.md) is an application that loads enriched events into a data lake so that they can be queried via Synapse Analytics (or Fabric, OneLake, etc).

See the Lake Loader [Terraform module](https://registry.terraform.io/modules/snowplow-devops/lake-loader-vmss/azurerm/latest) for further details on the resources, default and required input variables, and outputs.

The Terraform stack for the pipeline will deploy a storage account and a storage container where the loader will write the data.

</TabItem>
</Tabs>
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Currently the Lake Loader supports [Delta format](https://delta.io/) only. Futur

:::

<Tabs groupId="cloud" queryString>
<Tabs groupId="cloud" queryString lazy>
<TabItem value="gcp" label="GCP" default>
<LakeLoaderDiagram stream="Pub/Sub" bucket="GCS" cloud="GCP"/>
<DeployOverview cloud="GCP"/>
Expand Down
Loading

0 comments on commit 1eb84cd

Please sign in to comment.