Derived datasets

In this article...

We'll introduce you to Derived datasets and how they can be used to power business insights for supercharged personalizations.

What is a Derived Dataset?

A Derived Dataset (DD) uses the power of SQL to query and aggregate any data points we've collected or ingested for your property.

By combining raw data points, they can be created to generate meaningful and useful business insights in a way that those data points by themselves cannot.

Once created, the resulting dataset can be used to power experiences, segments, and analysis.

Common uses

Derived Datasets are already being used by Qubit customers to power one-to-one personalizations and drive conversions. What's more, to help you get started, we've built several dd-based models that can be used off-the-shelf:

Stock replenishment

Focusing on the higher conversion rates of repeat purchasers, a Derived Dataset can be created to re-target users when products they have previously bought are due to run out.

This can be achieved by ingesting the number of replenishment days per product-the number of days that typically elapse between a product being purchased and then re-purchased-and then building a query to predict when a user's supply is due to run out.

The result is then used to inform the user, on return to a site, how many days are left until their products run out.

Size preference

In another example, we could create a Derived Dataset to pre-select the size in each product category based on a user's previous browsing and purchasing behavior.

For each user, a SQL query is run every 6 hours to produce a dataset of a compound key of context Id, category Id, and the value of the last purchase. This dataset is then used in an experience on product detail pages (PDP) to preselect the size for each individual user.

Getting started

Select Data tools > Import from the side menu, and then switch to the Derived data tab

Opening an existing dataset

On opening, you will find a list of your Derived Datasets, including a timestamp that indicates the last time data was ingested into the import and the schedule that determines how often your SQL query is run to pull data into the dataset.

WARNING: errors indicates that issues were found the last time data was pulled into the dataset. See Error reporting for more information.

On opening a dataset, you will be presented with key information, organized into three tabs:

  • Preview - shows a preview of the data imported using the query
  • Details - activity associated with the dataset:

    • In Query we show the SQL query used in the dataset

    • In Import activity we show details of each data import since the creation of the dataset

    • In Info, we provide the dataset Id. When using the Import API to use imported data in an experience, you will need to pass this Id in your GET request

    • In Lookup Access, we provide the namespace and endpoint for the import. This information can be used to query the import using the Import API. You can copy the endpoint by selecting copy key and then paste it into a browser window, replacing id=<key> with an actual key from your import

    • In Live Tap Access, we provide the project and table name for the imported data; you will need this information to explore the data in Live Tap
  • Schema - shows the import's schema, including, importantly, which fields in the schema are available for lookup

Creating a new dataset

To create the dataset, select New import in the list view and select Derived data

Developing the query

WARNING: The results from a query cannot exceed 20 columns.

To develop the query, you will use three panels. To the left, you will find quick access to the data that you can build you queries from.

This includes all the QP events emitted by your property, any imported datasets, and, if you are using Live Tap, the Live Tap Business Views. In the following example, we see that the user can choose from a number of events and an import called dataset_list_of_skus:

query options

To get details of the fields and field types in each one, simply select it:

expanded view

From this view, you can query the table, copy the table name, or select the field to use in your query.

The top panel is an editor that you use to develop your query using Standard SQL. All the usual grouping, joining, and aggregation can be done to derive insights from the data held in Qubit. You can find more information about Standard SQL in Standard SQL Functions and Operation.

Once you have done this, you can run the query. The results are shown below the editor.

In the following example, the user has selected the field basket_quantity to use in a query:

SELECT
  basket_quantity
FROM `qubit-client-36902.kn8__v2.event_ecBasketItem` LIMIT 1000

Running your query and getting results

Once you are happy with the query, you can test it by running it. To do this, select Run query.

Assuming that the query doesn't contain syntax errors, the results of your query will be shown in the Results panel.

SQL Helper Methods

We have added a few helper functions to streamline the writing of SQL. These are optional, but should make the final query more compact and succinct, and are able to handle common tricky tasks such as JSON generation.

ms

Converts a human-friendly duration into milliseconds using parse-duration javascript library, useful for decorators and interval comparisons.

Example:

{{ms "5m 3 seconds"}} -> 303000

eventTable

Returns the full project, dataset, and table identifier string for a given event type, including QP namespace, if relevant.

The function will also optionally replace % with the vertical.

Example:

{{eventTable "%View"}} -> qubit-client-12345:demo.event_ecView

groupAsJson & concatForGroup

A common use case for Derived Datasets is to compute, for every user, a collection of attributes keyed by some other entity. For example, a mapping of user Ids to their most visited categories, including how many times they have viewed each one.

Since we only allow one row per primary key, user Id in this case, this Derived Dataset query requires concatenating the collection into a single field in a single row. Using the example above, you would end up with a column of user Ids and a column of arrays of categories with a view count.

This is very hard to achieve in BigQuery, even more so in legacy SQL syntax. The groupAsJson and concatForGroup function attempt to make this easier and require less boilerplate by using a UDF to combine the collection into a JSON array of objects.

groupAsJson generates a unique UDF function according to its input. It accepts 3 props:

  • groupFields - a comma-separated list of fields that should be present on the row without any concatenation (i.e. is already a single scalar value). All groupFields will be cast to strings.
  • rowFields - a comma-separated list of fields that have been prepared with:
    • concatForGroup and now require reformatting into an array of JSON objects
    • rowsKey - the output field name for the array of JSON objects (optional, defaults to rows)
  • concatForGroup is used to prepare data for grouping. Underneath it is doing a group concat with a special delimiter. It accepts 2 arguments:

    • The first is the source of the group data, for example, a field
    • The second is used for naming the concatenated column, for example, the AS clause

The example below produces a single row per context Id, where the second column is an aggregation of all the meta_ids of ecView events for that context Id:

SELECT
  context_id,
  event_ids
FROM ({{ groupAsJson groupFields="context_id" rowFields="event_id" rowsKey="event_ids" }}(
  SELECT
    context_id,
    {{concatForGroup "meta_id" "event_id"}}
  FROM [{{eventTable "ecView"}}]
  GROUP BY 1
))

Example output:

{
  "context_id": "1478101023161.686953",
  "event_ids": "[{\"event_id\":\"du3ftgry988-0iv1393cq-a59g4kg\"},{\"event_id\":\"r20oxngl51c-0iv136qoj-z8ifptc\"},{\"event_id\":\"phaf4jr0n0w-0iv148xwz-km6ah1s\"},{\"event_id\":\"h6f2qd45ugo-0iv13d6mb-lih8lcg\"},{\"event_id\":\"c23ur5k0kvs-0iv13dq7s-24rfqi8\"}]"
}

Variables

As well as helper methods, the following are available as plain variable values:

  • trackingId
  • clientId
  • vertical
  • namespace

Saving the dataset

Once you have developed your query, and you are happy with the results, you can save the dataset. You must do this before adding a schedule.

Select Save

Add a name to identify your dataset: we recommend you use a name that identifies the purpose of the dataset

You must now choose a schedule to define how often the query is run. Your schedule can be either daily or weekly

In the following example, the user has chosen to schedule the query to run every week at 09:00 on Monday, Wednesday, Friday:

dataset schedule

INFO: You can also specify whether to make fields available for lookup. Only fields available for lookup can be used when building experiences. To do this, select the AVAILABLE FOR LOOKUP toggle for each field in Lookup and then Enable lookup in the modal window.

Finally, select Confirm. This will initiate the data import into the dataset. Once this is complete, you can start to use the data in your experiences and segments

Error reporting

As mentioned earlier, in your list of datasets, errors will display when an error was encountered the last time data was imported into the dataset.

You can get more details by opening the dataset and looking in the Details tab. In the Import activity panel you will details of each of the data imports into the dataset.

In the following example, we see that the last two imports failed:

import fail

Typically, failure is caused by problems in one of Qubit's internal services. If you see two consecutive failures, we recommend reaching out to Customer Support.

Last updated: January 2023
Did you find this article useful?