GA4 πŸ”— BigQuery: Combine the Potential of Both Platforms


In this tutorial, we will take a look at the integration of Google Analytics 4 (GA4) with BigQuery. The focus will be on understanding BigQuery as a data warehouse and exploring the advantages and costs involved in this integration.

Many Web analysts use BigQuery as their primary tool for querying analytics data because they discovered the benefits that you may not be aware of.

The author

About GA4 and BigQuery #

Each Google Analytics 4 property comes with the capability to send data directly to Google BigQuery.

With this direct connection, businesses can harness the full potential of both platforms, leveraging GA4’s advanced tracking and measurement capabilities and BigQuery’s robust data processing and querying functionalities.

Why Link GA4 with BigQuery

BigQuery #

Google BigQuery is a modern and centralised data warehouse that operates serverless on the cloud. All the resources are organised in a hierarchy structure, which provides a high degree of scalability.

In Details:

BigQuery is hosted and managed by Google and its primary purpose is to serve as a data warehouse.

Data Warehouse #

A data warehouse stores structured data from various sources (like data marketing, CRM and contextual data), and organises it in a single place from which it can be queried (using SQL or other tools) for the purpose of reporting and analysis.

BigQuery Data Warehouse
Google BigQuery Data Warehouse

Centralised #

Make analytics easier by bringing together data from multiple sources into BigQuery.

Google

By consolidating all your data sources into a single repository, you create a solid, reliable and centralised data warehouse that serves as the ultimate “one point of truth” for your business.

Furthermore, imagine for example by having all your advertising data from various channels and campaigns consolidated in one accessible location, how much easier it becomes to calculate your Return On Advertising Spend (ROAS).

Organised #

Anyone who has looked for their golf clubs in a messy garage, only to find them hidden behind the holiday decorations, can appreciate the value of an organised data warehouse.

Informatica

Google BigQuery is not a cluttered storage space where data is simply stacked and piled without structure or organisation.

In BigQuery, you operate within a hierarchical structure, starting with a project that contains all your resources. Within this project, you can create sub-projects, but for now, let’s focus on data sets.

Data sets serve as a means to organise your data effectively. Each data set acts as a container for housing multiple tables and views. This arrangement allows you to manage and structure your data efficiently, making it easier to work with specific datasets without interfering with others.

⚠️ Note: For each Google Analytics 4 property that is linked to BigQuery, a single dataset named "analytics_property_id" is added to your BigQuery project.

Tables within the data sets hold the actual data, while views offer virtual representations or customized queries of that data.

Google BigQuery Structure
Google BigQuery Structure

This hierarchical organization enables users to maintain a well-structured and easily accessible data environment within Google BigQuery.

Modern #

Modern data warehouses may store many terabytes or even petabytes of data.

Google BigQuery is commonly referred to as “a petabyte-scale data warehouse”.

1 PETABYTE = 1,000 TERABYTES = 1 MLN GB

Serverless #

Google BigQuery is a serverless architecture which means it can be accessed from anywhere, and it also means that it doesn’t need extensive efforts or hardware requirements to set up.

Google BigQuery unleashing the power of web-scale analytics for regular companies without the need to build or manage their own massive data centers. … A vision to move beyond endless racks of hardware into a new world of cloud-based analytics.

Michael Manoochehri

On the cloud #

Since its launch, BigQuery has remained the market-defining platform for cloud-based analytics.

The cloud offers many benefits, as do the data warehouses that live there. They are less expensive, faster and more scalable.

Over time, the platform’s team has introduced several powerful features such as streaming ingestion, support for standard SQL, and machine learning functions.

GA4 + BigQuery #

BigQuery is a more powerful data warehouse than Google Analytics and the integration of the two platforms directly enables data ownership for Google Analytics users.

Data ownership #

When you export your Google Analytics data to BigQuery, you become the owner of the data. This means that you have full control of the collected data.

Yes, your data is still stored in the cloud on Google servers, but it’s much more like owning your data rather than just viewing it via the normal Google Analytics web experience (UI) or Looker Studio Data Connector.

Here are some of the ways that the data ownership provides control over data compared to keeping it within Google Analytics alone:

  1. Data Retention #

    In Google Analytics 4, data can be retained for a maximum period of 14 months.

    Google Analytics 4 Data Retention
    Google Analytics 4 Data Retention

    BigQuery has the capacity to retain data indefinitely, without any time limitations.

    ⚠️ Note: BigQuery starts to collect data only from the time of its configuration. So, if you want to access historical data, you should enable data export to Google BigQuery as soon as possible.

  2. Data Analysis #

    When analysing data within the Google Analytics 4 interface, you may incur in data sampled, thresholded or cardinality.

    By sending the data over to BigQuery you will have access to the data in the Raw format, where each user and their events will be displayed on separate lines.

    BigQuery Raw Data
    Google BigQuery unsampled raw events and user-level data

    This allows you to conduct more sophisticated analyses than what you are able to do within the GA interface.

    ⚠️ Note:

    Google Analytics 4 (GA4) data stored in BigQuery is not 100% raw and unmodified, but it is a good approximation of the raw data.

    In fact, before the data reaches BigQuery Google removes the IP address and user agent from the data to protect user privacy:

    • IP Address: Google removes the last octet of the IP address. This action diminishes the ability to track a user’s exact location and helps the company to comply with privacy laws such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act).

    • User Agent: This is a string that typically identifies the browser and operating system of a user. By removing this, Google further hinders the ability to track a user or gain insights into their browsing habits.

    But it adds geographic information and device information:

    • Geographic Information: While the original IP address is truncated to remove the last octet, the remaining data is still used to provide approximate geographic information. This allows businesses to understand the general location of their users without violating privacy rules. It offers a less precise but still valuable level of detail.

    • Device Information: Google includes additional metadata like the type of device being used (e.g., smartphone, desktop, tablet), the operating system, and the browser. This information can be crucial for understanding user behavior and preferences but is considered less sensitive in terms of individual privacy.

    Overall, the data is still valuable for understanding your users, but it is not as precise as it would be if the data was not modified.

    In summary, the modifications that Google makes to GA4 data before it is stored in BigQuery strike a balance between privacy and utility.

    ⚠️ Note: According to Google: β€œAnalytics doesn’t export data from Google signals to BigQuery”, which effectively allows BigQuery to prevent data thresholding. But, since Google Signals deduplicates user counts from individual users, you may see discrepancies in user counts and event counts per user when comparing data between Analytics and BigQuery.

    πŸš€ Tip: As mentioned, BigQuery eliminates issues related to data sampling, thresholding, and cardinality so connecting GA4 to BigQuery is advisable for both high-traffic and low-traffic websites.

  3. Data Management #

    Google Analytics 4 (GA4) offers two ways to export your data: to a spreadsheet or Excel file, or to BigQuery.

    The size of the data plays a significant role here.

    • Exporting to a spreadsheet or Excel file is a good option for small data sets.

      It’s easy to do and the files are easy to open and view.

    • Exporting to BigQuery is a better option for large data sets.

      BigQuery can quickly and efficiently handle large volumes of data.

    To put things into perspective, large companies can accumulate even several hundred GB of data in a single day within Google Analytics – no spreadsheet can accommodate a file this size, not to mention performing additional operations such as calculations, sorting or filtering.

    So, if you have a large data set, it is advisable to export your GA4 data to BigQuery.

  4. Data Visualisation #

    Looker Studio, previously known as Google Data Studio, is a favoured tool for data visualisation beyond the Google Analytics 4 interface.

    However, the tool integrates with Google Analytics 4 through an API. Google has set data limits on the Google Analytics Data API. If these limits are exceeded, users will be presented with messages indicating that the data download thresholds have been surpassed.

    On the other hand, by using the Google BigQuery connector, Looker Studio can connect directly to BigQuery, thus avoiding the Google Analytics Data API’s limitations.

    GA4 Looker Studio Documentation Quota Limit

    ⚠️ Note: BigQuery is naturally compatible with numerous top-tier data visualisation platforms, including “Tableau” and “Power BI”.

GA4 + BigQuery Costs #

Another reason that should prompt any company to link GA4 to BigQuery is related to the BigQuery pricing structure, which allows to store and query data for free or almost for free.

A bit of background #

In the era of Universal Analytics (GA3), only major players in the industry could afford to export raw data from Google Analytics by paying an annual cost of $150,000 for the premium version, Google Analytics 360.

With the introduction of Google Analytics 4 (GA4), this option is now available free! It’s finally free.

Free and Paid Tier #

Google BigQuery offers both a free tier and a paid tier.

BigQuery Pricing Structure
Google BigQuery Pricing Structure

The free tier includes:

  • 10 GB of data storage

  • 1 TB of query data processed per month

  • 100,000 API calls per month

⚠️ Note:

  • The free tier is available to all Google Cloud Platform (GCP) projects.

  • The free tier is renewed every month for each project.

  • The free tier is only available for the first 12 months after a project is created.

  • You must provide a credit card to use the free tier.

  • If you exceed the free tier, you will be charged for the additional usage.

⚠️ Note:

Do not confuse BigQuery free tear with BigQuery sandbox free of charge.

The BigQuery sandbox is a free, no-commitment way to try out BigQuery.

You can use it without providing a credit card or creating a billing account. However, it comes with some limitations:

  • All tables, views, and partitions automatically expire after 60 days.

  • You cannot use streaming data or DML statements.

  • You cannot use the BigQuery Data Transfer Service.

The key takeaway is that both BigQuery’s free tier and the BigQuery sandbox offer ways to use BigQuery without incurring costs. However, because they have distinct limitations, they serve different needs:

  • BigQuery Free Tier: is more appropriate for small and medium production environment, including GA4 BigQuery Linking.

  • BigQuery Sandbox: is most suitable for educational purposes, small-scale testing, or preliminary development. It is also a good option for people who want to learn more about BigQuery before deciding to use it in a production environment.

Undoubtedly, the free tier is quite generous and is enough for most small and medium-sized businesses. However, if you exceed the free tier limits, you will be charged for the additional usage.

The paid tier of Google BigQuery is pay-as-you-go model, and you are charged for three main things:

    • The (additional) amount of data you store
      • $0.02 for each GB of active storage in Google BigQuery.
    • The (additional) number of queries you run
      • $5 for each TB processed when running SQL queries on data.
    • The type of data transfer
      • $0.05 for each transferred GB in a real-time data stream (if this option is enabled).

Cost Breakdown #

According to the official documentation:

1 GB of data is equal to approximately 600,000 Google Analytics events.

So, a business generating 600,000 events per day will generate 18 million events per month, which is equivalent to 30 GB of data.

  • Data Storage

    The data storage will increase by 30 GB every month.

    Subtracting the 10 GB free storage allowance, the company is charged for 20 GB.

    At a rate of $0.02 per GB, the cost of storage will increase by $0.6 every month.

    The cost breakdown is:

    1st month: (30 GB – free 10 GB) x $0.02 = 20 GB x $0.02 = $0.4.

    2nd month: (60 GB – free 10 GB) x $0.02 = 50 GB x $0.02 = $1.

    3rd month: (90 GB – free 10 GB) x $0.02 = 80 GB x $0.02 = $1.6.

    By the 12th month: (360 GB – free 10 GB) x $0.02 = $7.

    ⚠️ Note: BigQuery automatically reduces the storage cost if data remains untouched for 90 days. It then considers this data as “long-term storage”.

  • Query

    For 1 GB of data and 3 hourly-updated reports in Looker Studio, data processing will be 1.5 TB per month.

    Subtracting the 1 TB free allowance, the company is charged for 0.5 TB.

    At a rate of $5 per TB, the total monthly cost for this data processing will be $2.50 monthly.

    The cost breakdown is:

    (1.5 TB – 1 TB free allowance) * $5 = 0.5 TB * $5 = $2.5 monthly

    ⚠️ Note: The query pricing model shown above is the β€œPay On-Demand”, which operates as a pay-as-you-go. However, BigQuery also offers a “Flat Rate” pricing model. In this alternative model, users pay a fixed price by buying a certain amount of “slots” in BigQuery. This approach becomes economical when your monthly spend with BigQuery reaches a point where it’s more cost-effective than the on-demand model.

    πŸš€ Tips:

    • Quality of written queries (Queries that are written inefficiently or that contain errors will be more expensive to run).

    • Only select the column you need.

    • Estimate your query costs before running a query.

    • Preview the data.

    • “LIMIT” doesn’t actually limit the query size.

    The below example demonstrates how a query can be optimised significantly by refining the SQL query.

    GA4 BigQuery Quality-Written Queries

  • Data Transfer

    Sending 30GB of data to Google BigQuery via real-time data streaming incurs a cost of $1.5.

    The cost breakdown is:

    30 GB * $0.05 = $1.5

    ⚠️ Note: Real-time data streaming is the most expensive way to transfer data to BigQuery. If you are not in need of real-time data analysis, you can save money by using batch data transfer.

Conclusions #

I highly recommend linking GA4 to BigQuery, especially if you are using GA4 for business analytics. It is a small investment that will pay off in the long run.

  • What is BigQuery? by Google: Link
  • [GA4] BigQuery Export integration by Google: Link
  • [Looker Studio] Limits in the Google Analytics Data API by Google: Link
  • [Google Cloud] BigQuery pricing by Google: Link

Leave a Reply

Your email address will not be published.

Thanks for commenting