When marketing decisions rely on website analytics, accuracy matters.
However, Google Analytics and other analytics platforms sample data to
generate reports, which can sometimes misrepresent the true data trends.
In this article, we’ll explore how Google Analytics 4 uses data sampling, its quota limits change, and why sampled data can still be problematic for high-volume web properties.
What is data sampling?
Data sampling is a statistical analysis technique that uses a smaller subset of data to analyse and identify trends within a larger data set.
It can be helpful when collecting a complete data set is challenging, such as in political surveys, or when it is so large that preparing and computing all the data becomes difficult.
It’s standard practice in many industries where either of these issues presents a problem. For example, Gallup can’t survey the entire US population, so they use representative samples instead.
To ensure a sample is representative, carefully select the subjects that comprise the sample. This step is key to avoiding data selection biases (more on that later).
Why does Google Analytics use data sampling?
Google Analytics limits the amount of data it processes for reports, especially for free users. Simply put, the company does this to save cloud computing resources (which may face increasing costs as more and more resources embrace AI).
The larger the data set, the more computing resources are required to finish the calculations for the report. As a result, Google Analytics tends to use complete data for shorter-term, less data-heavy reports but samples for more in-depth analysis.
For example, look at this basic report from a site with limited traffic. A checkmark indicates that the card is “unsampled” and uses 100% of the available data. (Notice the green reporting icon — for sampled reports, this is red.)
But for web properties with heavy traffic and more complicated reports like a funnel analysis or cohort analysis, the results are almost guaranteed to be sampled. Things get even worse when comparing multiple data sets, for example, two user segments against each other or against a baseline. A 12-month funnel report in GA might only use 48.3% of available data, as shown below.
The more advanced the analysis, the more likely GA and other analytics tools are churning out results that don’t show the full picture.
Changes to data sampling in Google Analytics 4
In Universal Analytics, before the sunset, the maximum sample size for unsampled reports was 500,000 user sessions. If a website receives more than a few thousand user sessions per month, this could quickly become a problem.
With the change to GA4, the sampling threshold is now set to 10 million “events.” This sounds like a massive upgrade at first. But, since events are essentially individual data rows, it’s important to consider that each session could represent dozens of separate events, depending on the report.
So traffic volume isn’t the only limiting factor. The more secondary dimensions added, the larger the set of events grows (exponentially). This often means that a report initially using unsampled data will start using sampled data if rerun to compare segments or add more nuance.
Google has stated that the 10 million only applies to “standard reports.” So, even with far fewer events, complex reports will quickly start relying on sampled data. Beyond this, there’s limited public information about the sampling methods or how GA chooses a random sample.
Bottom line? Data sampling still affects high-traffic properties and marketers who use advanced reports. Using custom dimensions or events is another limiting factor.
Further reading: 10 Key Google Analytics Limitations You Should Be Aware Of
Why data sampling can be a problem for web analytics
Google Analytics’ official explanation of why they use data sampling uses the example of estimating the number of trees in a large area by extrapolating data from one acre. If there are 800 trees per acre and 100 acres, the estimated total number of trees on the property would be around 80,000.
This is fairly disingenuous because a forester doesn’t need complete data accuracy to make smart marketing decisions about the future of their plot. Plus, it’s easy to do an aerial survey of the area first to ensure that it is comparable across its length. This lets them pick an acre of trees that accurately represents the rest of the forest.
With web analytics, avoiding sample selection bias and equally representing all data sources and types of visitors to a site is challenging.
Ultimately, this means finding a data sample that truly represents the average user’s behaviour will be challenging. If the analytics platform happens to select a lot of visits from a specific promotion, it can inflate or deflate sales numbers.
And that’s only a single factor that can impact data accuracy.
The typical margin of error
Okay, so the data is sampled, but how bad is it? On average, the reports are pretty accurate, but the smaller the sample, the larger the typical margin of error gets.
While the margin of error can be as low as 1%, multiple users have found that the error rates could go up to 30% for smaller ranges, but they average around 5%.
This makes it challenging to generate accurate reports for a high-volume website, where 10 million events may only represent a tiny portion of the annual traffic.
This isn’t exactly great news after seeing GA sampling rates below 50% for a simple annual report. And that’s before we even think about doing long-term segment comparisons (which adds more variables and events to the mix).
And that’s not the only reason Google Analytics may not be as accurate as some might think.
3 ways to reduce data sampling in Google Analytics
Want more accurate reporting? To reduce data sampling in Google Analytics reports for high-volume sites, use one of these three approaches:
Reduce the report’s time frame to increase accuracy
Most users can avoid sampling by focusing on a shorter period. Reducing the date range will likely bring the total number of events under the data sampling limit, allowing them to work with actual data.
This is a good idea for accurately estimating short-term trends, like the impact of a new campaign. But it’s not a permanent fix.
It makes creating reliable long-term reports very labour intensive. For example, if 100% data accuracy is only achievable on a 30-day report, data from three reports would be needed for a single quarterly report. Annual reports would be even more taxing.
Plus, any segment-based reports would need to be done manually, as they quickly start to rely on sampled data.
The key to unsampled reports is to keep them simple and the date range short. (If this sounds like the opposite of thorough analysis, it is. For detailed insights, a large data set is preferable to a small one.)
Export the data to third-party data analysis tools (still risk limitations)
Exporting raw data to third-party data analysis platforms may provide full control over whether or not to use sampling when generating reports.
For example, Google Data Studio and Google BigQuery don’t use sampling by default. (Some spreadsheet enthusiasts even use Google Sheets.)
However, data exports are also limited on the GA side, so unfortunately, this isn’t a reliable fix. If there are over one million events per day, the exports will be based on samples.
Depending on the platform, there may be a workaround that gradually collects data from shorter-term requests. However, it’s not guaranteed to work, so it’s probably not the best solution.
Upgrade to Google Analytics 360 (and select “more detailed results”)
Upgrading to Google Analytics 360, the paid version of GA, increases the data quota limit to 100 million events by default.
Selecting the “more detailed results” option (clicking the shield icon in any report in the explore view or on any card in the upper right corner) increases the upper limit to one billion events. For free accounts, this shield icon has a red exclamation mark.
Still, this upgrade requires paying for GA without resolving any issues beyond data sampling.
Read more: GA360 vs GA4: Key Differences and Challenges
Avoid data sampling entirely by switching to a reliable alternative
Many Google Analytics alternatives completely avoid the issue by not using any data sampling in the first place. In essence, this can open up 20-50% more data, depending on the total traffic volume and the complexity of the reports.
And this isn’t the only benefit of switching to a reliable and accurate alternative.
There are still privacy concerns after the launch of GA4
Even though Google Analytics 4 was released in response to increasingly strict privacy regulations, regulators are still far from satisfied. In particular, GA4 has had multiple issues with GDPR and is still not compliant by default.
And getting slapped with a fine isn’t the only problem. If caught unprepared by reports of violation, there may be orders to change analytics tools or completely alter the GA4 setup. If a team has spent months transitioning to GA4, this situation is not desirable, especially during an important campaign.
A more intuitive user interface and insightful reports out-of-the-box
Is decoding data in the GA4 interface a struggle? It’s one of the biggest issues with GA4 that marketers and ecommerce store owners bring up.
The crucial role of website analytics is to unlock meaningful insights. It shouldn’t feel like a puzzle or brain exercise.
With Matomo, there’s no need to work through three different types of conversion rates or create custom reports to gain insights into a sales funnel. The reports are clear, easy to navigate to, and there’s no sampling. Ever.
Users still have the option to create custom reports, dashboards and can export almost any view if more in-depth analysis is needed.
Keep using the metrics that you know and love
Matomo uses clear metrics that most web marketers and analysts have grown used to using Google Analytics over the previous decade. That includes:
- Page views
- Unique visits
- Bounce rates
- User sources
Get rid of data sampling with Matomo
Switching to Matomo eliminates data sampling and accuracy issues (as it always uses 100% of the data for reports) and provides full data ownership and compliance with the world’s strictest privacy regulations.
Not to mention, Matomo offers access to behavioural analytics features like Heatmaps, A/B Testing, and Session Recordings, all within the same platform, to improve user experiences.
This mix of features, reliability, and complete data ownership (without sampling) is why over 1 million websites rely on Matomo analytics. Reliable insights are the foundation of good digital marketing.
Sign up for a 21-day free trial today — no credit card required.