This solution is still functional, but can be manual and time consuming to set up with minimal error handling. We encourage you to check out the [Data Warehouse Ingestion](../../data warehouse-ingestion/introduction.md) solution instead.
Overview
There are 2 ways to integrate with Snowflake: using a data connector, or through direct ingestion from Snowflake.Using a Data Connector
To ingest events from Snowflake, you can use our Census integration. To export events to Snowflake, you can use our Fivetran integration.Direct ingestion from Snowflake
We also support direct data ingestion from Snowflake, through which Statsig will automatically pull data from Snowflake into your events. You will need to do the following steps. Please follow the checklist below to avoid delays!1. Set up your Snowflake data warehouse and user for Statsig integration
InsertUSER and PASSWORD values to the below and run it in a Snowflake worksheet
on an account which has sysadmin and securityadmin roles.
This will create the table schemas and setup that Statsig’s ingestion will use.
Statsig will use the user you create to access tables in the new Statsig schema. Make sure to use a unique and secure username and password and replace the placeholder values in the first 2 statements.
userID, custom fields, and the customIDs object (notably stableID) if you are using any custom identifiers.
Make sure these fields are provided where they exist, and that the names of the fields capitalized correctly. Not providing a
unit identifier will limit the utility of your events, as we won’t be able to use them to build metrics like daily event users.
Metrics (statsig_user_metrics)
| Column | Description | Rules |
|---|---|---|
| unit_id | The unique user identifier this metric is for. This might not necessarily be a user_id - it could be a custom_id of some kind | |
| id_type | The id_type the unit_id represents. | Must be a valid id_type. The default Statsig types are user_id/stable_id, but you may have generated custom id_types |
| date | Date of the daily metric | |
| timeuuid | A unique timeuuid for the event | This should be a timeuuid, but using a unique id will suffice. If not provided, the table defaults to generating a UUID. |
| metric_name | The name of the metric | Not null. Length < 128 characters |
| metric_value | A numeric value for the metric | Metric value, or both of numerator/denominator need to be provided for Statsig to process the metric. See details below |
| numerator | Numerator for metric calculation | refer to the relevant section, and details below |
| denominator | Denominator for metric calculation | refer to the relevant section, and details below |
Note on metric values
If you provide both a numerator and denominator value for any record of a metric, we’ll assume that this metric is a ratio metric; we’ll filter out users who do not have a denominator value from analysis, and recalculate the metric value ourselves via the numerator and denominator fields. If you only provide a metric_value, we’ll use the metric_value for analysis. In this case, we’ll impute 0 for users without a metric value in experiments.Scheduling
Because you may be streaming events to your tables or have multiple ETLs pointing to your metrics table, Statsig relies on you signalling that your metric/events for a given day are done. When a day is fully loaded, insert that date as a row in the appropriate signal_table -statsig_user_metrics_signal for metrics
or statsig_events_signal for events. For example, once all of your metrics data is loaded into statsig_user_metrics for 2022-04-01, you would insert 2022-04-01 into statsig_user_metrics_signal.
Statsig expects you to load data in order. For example, if you have loaded up to 2022-04-01 and signal that 2022-04-03 has landed,
we will wait for you to signal that 2022-04-02 has landed, and load that data before we ingest data from 2022-04-03
NOTE: this ingestion pipeline is in beta, and does not currently support automatic backfills or updates to data once it’s been ingested. Only signal these tables
are loaded after you’ve run data quality checks!
Checklist
These are common errors we’ve run into - please go through and make sure your setup is looking good!- Field names are set incorrectly
- Some of the field names here may intersect with protected key words. You should be able to generate these by wrapping them in quotes if needed. The above SQL works in the snowflake console.
- Run
SELECT *on your tables when done to confirm that there aren’t any special characters in the column names! This will cause our ingestion to fail.
- The
id_typeis set correctly- Default types are
user_idorstable_id. If you have custom ids, make sure that the capitalization and spelling matches as these are case sensitive (you can find your custom ID types by going to your Project Settings in the Statsig console).
- Default types are
- Your ids match the format of ids logged from SDKs
- In some cases, your data warehouse may transform IDs. This may mean we can’t join your experiment or feature flag data to your metrics to calculate pulse or other reports. You can go to the Metrics page of your project and view the log stream to check the format of the ids being sent (either
User ID, or a custom ID inUser Properties) to confirm they match
- In some cases, your data warehouse may transform IDs. This may mean we can’t join your experiment or feature flag data to your metrics to calculate pulse or other reports. You can go to the Metrics page of your project and view the log stream to check the format of the ids being sent (either
- Monitoring is limited today, but you should be able to check your snowflake query history for the Statsig user to understand which data is being pulled, and if queries are not executing (no history) or are failing.
- You should see polling queries within a few hours of setting up your integration.
- If you have a signal date in the last 28d, you should see a select statement for data from the earliest signal date in that window
- If that query fails, try running it yourself to understand if there is a schema issue
- If data is loading, it’s likely we’re just processing. For new metrics, give it a day to catch. If data isn’t loaded after a day or two, please check in with us. The most common reason for metrics catalog failures is due to id_type mismatches.