Home Blog

Breaking Free of GA + BigQuery and Embracing Elegance in Behavioral Data

GA data on BigQuery comes with a ton of scale (that you probably don't need), but it also comes with a ton of baggage. Is there a better schema for storing behavioral data that reduces complexity downstream?

Breaking Free of GA + BigQuery and Embracing Elegance in Behavioral Data

Picture this: you’re tasked with developing an analytics tool for capturing user behavior. The big question pops up: How would you architect the data schema? If you’re thinking, “Well, it depends on how the data’s gonna be used,” you’re spot on. A set of specific reports/metrics ad-hoc queries, etc? This would definitely affect the design choices. What if I said the design intent is for ad-hoc querying using a language with wide adoption. Why reinvent the wheel, SQL is still often the best tool for the job when doing any kind of data analysis. It’s terse, simple to read, easy to grasp syntax without funky language constructs. So, what does an intuitively simple schema for web or, let’s go a bit broader, behavioral analytics look like?

It would probably look something like this:

alt_text

It would be helpful to introduce the artificial concept of a “visit” or session. They don’t exist, not really, the way humans and events do but they’re a useful artificial construct because it captures a lot about a particular long-lived interaction with our user. Sessions are useful for measuring “intent” or telling a more relatable, qualitative story than relying on granular events alone. Sessions also are important when considering our use of multiple devices (it’s not unusual for customers to have multiple, active sessions with interleaved events).

alt_text

Such a simple schema would be the most useful because relational data is easy to reason about. A lot of what a company is interested in are the humans not the events, sessions, etc… Those are just a means to an end so we can better measure how/what we’re doing and its effect on our humans (customers).

Now, I’m going to throw you a curveball – the schema must support multiple tenants. And not just a few or a few thousand but millions. That alone changes everything, now the schema has to handle millions of data points across millions of tenants. Will that force you to rethink how you store and manage data? Of course. Now, what if I say we also have to honor legacy data model contracts (for example the way GA must) . Well now you’d tear up your design and say ok show me ALL the requirements.

But, from the POV of an analyst or data engineer, they want to work with the simplest, most naive, most intuitive model of the data. Humans and events (and sessions).

Many of us who rely on GA behavioral data to power BI, ML models, dashboards, CRM, CDPs and more are forced to reckon with its’ quirks and intricacies and its departure from our perfect relational model. It’s no wonder that the GA schema and BigQuery itself exists in the form it is in now due to Google’s complex requirements of managing a free analytics tool for the internet. Google Analytics must handle vast amounts of data generated by millions of websites. The schema is optimized for scalability, capable of storing and processing huge volumes of event data efficiently. BigQuery’s nested and repeated fields are ideal for this because they allow for a more compact data representation and can reduce the number of I/O operations required for reading large datasets. Indeed BigQuery is a marvel of scalability, but would your company benefit from this level of scale? Unlikely. Even at over 100k visits a day it wouldn’t generate THAT much data (less than a TB). Few e-commerce brands generate the amount of data necessary to justify that level of scale. So you end up paying for scale you don’t need with complexity you don’t want.

In the data world, simplicity is the ultimate sophistication. Unfortunately, when it comes to querying Google Analytics data in BigQuery, simplicity often goes out the window. What should be straightforward insights require a labyrinth of specialized SQL extension clauses, making one pine for the clarity of traditional relational databases. Let’s unravel some examples that demonstrate how something as basic as understanding user behavior becomes a Herculean task in the GA on BigQuery environment.

Consider a simple mission: counting users visiting a specific page. With a traditional relational schema:

SELECT COUNT(DISTINCT user_id) 
FROM page_visits 
WHERE page_path = '/home';

Clear and concise, mission accomplished. Now, let’s turn to BigQuery’s GA schema. The equivalent information is buried within arrays of nested records:

SELECT COUNT(DISTINCT user_pseudo_id) 
FROM `bigquery-public-data.ga_sessions_20210101`, UNNEST(hits) AS hit 
WHERE hit.page.pagePath = '/home';

Notice the need to UNNEST hits—because user interactions are encapsulated in nested structures, a simple count becomes a multi-step process.

This is where the love affair with window functions and arrays takes a toll on your SQL purity.

The Cost of Complexity

BigQuery’s pricing model is based on the amount of data crunched. Every UNNEST, every subquery, and every analytical function adds to the bill. The irony is palpable—you pay more for the complexity you never asked for. Let’s look at a few more examples to illustrate.

Funnel Analysis in BigQuery with GA Data

A funnel analysis tracks the series of steps that users take, often leading towards a goal, like making a purchase. In GA data within BigQuery, each step might correspond to a different event, and each event can have multiple parameters.

Let’s consider a simplified funnel with three steps:

  1. User visits a product page (view_item event).
  2. User adds the product to the cart (add_to_cart event).
  3. User completes the purchase (purchase event).

BigQuery SQL for Funnel Analysis:

WITH steps AS (
 SELECT
 user_pseudo_id,
 event_name,
 event_timestamp,
 -- Use UNNEST to flatten the event_params and extract the product_id for each event
 (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'product_id') AS product_id
 FROM `your_project.your_dataset.ga_sessions_*`,
 UNNEST(events) AS event
 WHERE event_name IN ('view_item', 'add_to_cart', 'purchase')
 AND _TABLE_SUFFIX BETWEEN '20210101' AND '20210131'
)

, funnel AS (
 SELECT
 user_pseudo_id,
 -- Aggregate steps for each user to track their path through the funnel
 ARRAY_AGG(STRUCT(event_name, event_timestamp) ORDER BY event_timestamp) AS user_funnel
 FROM steps
 GROUP BY user_pseudo_id
)

SELECT
 user_pseudo_id,
 user_funnel,
 -- Check the sequence of events for each user to confirm if they completed the funnel
 IF(
 ARRAY_LENGTH(user_funnel) = 3
 AND user_funnel[OFFSET(0)].event_name = 'view_item'
 AND user_funnel[OFFSET(1)].event_name = 'add_to_cart'
 AND user_funnel[OFFSET(2)].event_name = 'purchase',
 'Completed Funnel',
 'Did Not Complete Funnel'
 ) AS funnel_status
FROM  funnel

In this example, the WITH clause is used to create a Common Table Expression (CTE) that first identifies the relevant events for each user and flattens the event_params to extract a product_id. Then, it aggregates these events into an array ordered by event_timestamp to understand the sequence of steps each user has taken.

The final SELECT statement checks the sequence of events for each user to determine if they completed the funnel. This check relies on the aggregated array of events, which is made possible by the nested structure of GA data in BigQuery.

Nested structures enable the representation of complex, multi-step interactions within a user’s session, which would be much more challenging to model in a flat relational table without extensive join operations or additional processing. The power of BigQuery’s nested structures shines in analyses like this, where the relationship and order of events are crucial to the analysis.

Let’s look at one more example: calculating average session duration:

In BigQuery, you might find yourself writing a query like:

WITH sessions AS (
 SELECT
 user_pseudo_id,
 TIMESTAMP_MICROS(event_timestamp) as event_time,
 LEAD(TIMESTAMP_MICROS(event_timestamp)) 
 OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) as next_event_time
 FROM `your_project.your_dataset.ga_sessions_*`
 WHERE event_name = 'session_start'
)
SELECT
 user_pseudo_id,
 AVG(TIMESTAMP_DIFF(next_event_time, event_time, SECOND)) as avg_session_length
FROM sessions
GROUP BY user_pseudo_id

Whereas with our ideal relational schema it may look something like this:

SELECT person_id, AVG(session_duration) as avg_session_length
FROM session
GROUP BY person_id

So, what would an ideal schema look like? At its foundation it should be user-friendly and accurately model the real-world while also offering built-in features akin to materialized views so that data that is often required is already rolled up on the entity where I need it. For example, don’t make me join the event table every time I need to figure out if a person is a customer or not, just give me a boolean is_customer boolean on the person table, better yet, give me a total_spent column. Imagine having key metrics like total spent, days since last purchase, or customer status directly accessible within your user table—this is the kind of customer-centric, efficient design that modern data teams need. In the new world of data pipelines and warehouses, analytics should be built with the analyst in mind. Today’s analysts have become data wranglers. Artificial complexity we never asked for compiles the time needed to get basic data analysis/data science work done.

How organizations deal with GA data when importing from BigQuery to their data warehouse

When organizations import Google Analytics (GA) data from BigQuery into their data warehouse, they generally follow one of these strategies:

1. Flatten the Data During ETL

Most organizations choose to flatten the data during the Extract, Transform, Load (ETL) process. This means they will extract the nested and repeated fields from BigQuery and transform them into a tabular format that fits a relational model. This process often involves:

Extracting nested fields: Using UNNEST() to turn nested fields into separate rows.

De-duplicating: Ensuring that the flattening process doesn’t introduce duplicate rows.

Creating new tables: Making new tables in the data warehouse to hold data that was nested in BigQuery.

Flattening the data simplifies querying and allows the use of standard SQL without complex UNNEST operations. It also makes it easier to use the data with various BI tools that may not support nested schemas. This is by far my favorite option

2. Import Nested Data As-Is

Some organizations may choose to import the GA data into their data warehouse without flattening, keeping the nested and repeated fields. This strategy might be used when:

The data warehouse supports nested structures: If the target data warehouse (like Snowflake or another instance of BigQuery) supports similar data types, the data might be imported as-is to maintain the structure. But then you have to support Google’s data structure in all of your queries.

Minimal transformation requirement: If the organization wants to minimize ETL complexity and processing time, they might opt to keep the data in its original format. Generally this seems like a bad idea as you’ll have to pay for this complexity down the road and may infect a plethora of downstream systems.

3. Hybrid Approach

In some cases, organizations might adopt a hybrid strategy, where they:

Flatten critical data: Flatten only the most frequently accessed data or the data needed for specific high-priority analyses.

Keep detailed data nested: Retain the detail of less frequently accessed data in a nested format to save on processing time and costs.

4. Custom Transformations for Specific Use Cases

Organizations may also perform custom transformations tailored to their specific analytical needs. This can involve:

Creating summary tables: For high-level reporting and dashboards, where detailed event-level data isn’t necessary.

Materialized views: For databases that support them, creating materialized views to provide a flattened view of the data without physically restructuring the underlying data. Of course these need to be maintained or “refreshed” to stay current.

In all of the above cases you’re maintaining system configs, code or a combination of both to deal with GA BigQuery complexity.

In the quest for data-driven insights, the elegance and sophistication lie in simplicity. The GA schema in BigQuery, necessitated by Google’s vast scale needs, often overshoots this mark for most businesses. It’s a data Goliath, most businesses seek a David – nimble, swift, and effective. The vast majority of companies, even with high web traffic, will find more value in a schema that prioritizes ease of use, simplicity in querying, and aligns with the natural structure of human interactions and events. A well-designed relational model, like the one you’d find in traditional SQL databases, respects these principles. It empowers analysts and data engineers to focus on deriving insights rather than unraveling the complexities of their tools. As we advance into the era of big data, let’s not forget the power of simplicity and the value of intuitive data models. These were the two driving forces behind our Postgres-first “headless analytics” solution. We started with the goal of delivering a user friendly schema that made sense and was easy to work with. Then we added the classification, features and binning to the data to make 80% of data use cases that much easier. When it comes to data, time-to-insight and time-to-action have a direct effect on the bottom line. I invite you to take a look at Lassoo if your business is craving 1st party behavioral data that’s simple to work with.


Stay in the loop.

Get the latest Lassoo news directly in your email box.

Nice. You're now registered for the Lassoo Newsletter.


Max Kremer
Max Kremer  Co-founder & CTO @ Lassoo. Startup guy with multiple exits. Lover of technology and data.