Mauro Romanella

2 June 2023

In this tutorial, we will take a look at the concepts of data sampling, data thresholding and data cardinality within Google Analytics 4, as well as strategies to minimise their impact on the accuracy and reliability of your reports.

About GA4 Data Quality #

Data sampling and thresholding are concepts that have existed to some degree in Universal analytics (GA3) but are implemented a little bit differently in Google Analytics 4 (GA4).

You may have noted that at the top of the GA4 interface, there is an icon next to every report title. Google calls this icon “data quality” and it changes to indicate whether sampling or thresholding is applied or not to the report.

Icon	State	Message
	A green checkmark You are viewing the complete set of available data for the chosen dimensions and metrics.	Unsampled Reports “This report is based on 100% of available data.”
	An orange alert You are viewing a portion of the available data for the selected dimensions and metrics.	Thresholding applied “Google Analytics has applied thresholding to this card and will only display the data when the data meets the minimum aggregation thresholds.”
	A red exclamation mark You are viewing a limited proportion of the available data for the chosen dimensions and metrics.	Heavily sampled exploration “This report is based on <10% of available data. A smaller sample size means that the data in this report is less accurate.”

Icon

State

Message

A green checkmark

You are viewing the complete set of available data for the chosen dimensions and metrics.

Unsampled Reports

“This report is based on 100% of available data.”

An orange alert

You are viewing a portion of the available data for the selected dimensions and metrics.

Thresholding applied

“Google Analytics has applied thresholding to this card and will only display the data when the data meets the minimum aggregation thresholds.”

A red exclamation mark

You are viewing a limited proportion of the available data for the chosen dimensions and metrics.

Heavily sampled exploration

“This report is based on <10% of available data. A smaller sample size means that the data in this report is less accurate.”

To put it briefly and generally, GA4 has two types of data limits: data sampling occurs when your data request exceeds the quota limits, and data thresholding (data hiding) occurs when your data request is based on a limited number of users, making it possible to identify individual users.

GA4 relies on sampling to return data quickly and thresholding to keep it anonymous.
Lawrence Greenlee

Do not confuse the GA4 data limits with the GA4 row limits. When the underlying table that populates the Google Analytics 4 report reaches its row limit, any data exceeding that limit gets consolidated into a separate row called “other”. This concept is referred to as “data cardinality“.

GA4 Data Quality Sampling Thresholding Cardinality 1

Below a summary table of GA4 data limits (sampling, thresholding, cardinality) based on the report type (standard or explore):

	Standard Reports	Exploration reports
GA4 Sampling	Yes 10 million events	Yes 10 million events; more prone
GA4 Thresholding	Yes Less susceptible	Yes
GA4 Cardinality	Yes 50,000 rows	No

ISSUE: In all three cases, the report may not be representative of the entire dataset, and may not be accurate enough to make decisions based on it.

SOLUTION: Linking GA4 to BigQuery is a good solution to all three cases.

Sampling. By linking GA4 to BigQuery, you can get access to the unsampled data in raw format. This means that you can access and analyze your data in BigQuery without having to worry about sampling.
Thresholding. By linking GA4 to BigQuery, sensitive user data (from Google Signals) will not be passed to BigQuery. This means that you can avoid thresholding in explore reports that use Google Signals data.
Cardinality. By linking GA4 to BigQuery, you can store all of your data in BigQuery, which does not have a cardinality limit. This means that you can avoid the row limit error and get access to all of your data, even if it contains high-cardinality dimensions.

In details:

GA4 Data Sampling #

What is … #

Sampling is a method that allows Google Analytics to generate reports by examining only a representative portion of the data, rather than querying the entire dataset.

Do not confuse data sampling with HyperLogLog++ (HLL++) algorithm.

Data Sampling and HyperLogLog++ (HLL++) algorithm are two different techniques used in Google Analytics 4 (GA4) to estimate large datasets. While both methods provide approximations, they serve distinct purposes and have varying levels of accuracy.

Data Sampling: Selects a subset of data to represent the entire population.
HyperLogLog++: Estimates cardinality (number of distinct values).

more on that later

Why it happens … #

This approach allows Google to be efficient in compute costs, but this can lead to situations where it is provided an approximate value instead of the exact value.

When it happens … #

Google Analytics samples data when the volume of data needed to generate your report exceeds the processing quota limit for your GA4 property.

Note: As of November 2023, Google Analytics 4 (GA4) has reintroduced sampling in standard reports.

What is the quota limit … #

Each report type, standard and explore, has a quota limit, typically measured in events processed.

Standard reports have a higher quota (10 million events) compared to Explore reports (default quota is the same).

While the default quota might be 10 million events for both report types, Explore reports are more likely to hit this limit due to:

Data Source: Standard reports leverage pre-aggregated data, significantly reducing processing needs. Explore reports, on the other hand, access raw, unaggregated data, requiring more processing power for analysis.
Query Complexity: Explore reports allow for intricate queries involving a larger volume of events or users. These intricate queries can easily exceed the quota, leading to sampling.

In Summary:

Standard reports, with their pre-aggregated data and simpler queries, can handle larger datasets before sampling becomes necessary. Explore reports, due to the flexibility and potential for complex queries, have a higher chance of triggering sampling, especially when dealing with large datasets.

Note: The quota limit of 10 million events applies to GA4 standard (free version), while it is extended to 1 billion events for GA4 360 (paid version).

Note #2: GA4 standard reports have a limit of 10 million events that can be displayed. This means that if your report contains more than 10 million events, it will be truncated and you will not be able to see all of the data beyond this 10 million event limit.

What can you do … #

If you wish to remain inside Google Analytics:
- In GA4 Standard, you can reduce the date range. This will decrease the size of the data set to the point where it is smaller than 10 million events to avoid sampling.
- In GA4 360, Google added a feature that allows its users to toggle between faster analysis (with data sampling size at 100 million of events) or more detailed results (with data sampling size up to 1 billion of events). Also, there is an option to request an unsampled exploration report.

If you analyse the data outside of Google Analytics, you do have others options:
- Use BigQuery to access your raw data
  
  This is really interesting because previously in Universal analytics this functionality was only available to the paid account.
  
  Tip: Use the Big Query with Google Looker Connector.
- Export your data
  
  You can try to export your data in a shorter time frame and assemble and aggregate it later in your spreadsheet.
  
  Tip: Use GA4 API to pull the data into Google Sheets.
- Use a third-party tool
  
  In case your data is growing rapidly and a Google spreadsheet can no longer store and process your data, you should think about getting a data warehouse.

Faq #

Does Google Analytics Sampling Affect Data Studio?

Yes.

Does Google Analytics Sampling Affect API?

Yes.