• Main
  • /
  • Blog
  • /
  • Comparing traffic sources between GA4 and session_traffic_source_last_click in BigQuery
bg-image
October 18, 2024

Comparing traffic sources between GA4 and session_traffic_source_last_click in BigQuery

  • Intermediate
  • Google BigQuery
  • GA4
  • Last Non-Direct

This article covers the following: I analyzed 15 projects, comparing the number of users by traffic source/medium from the GA4 interface with the values in the session_traffic_source_last_click.manual_campaign object in BigQuery.

The reason for the study is the appearance of the session_traffic_source_last_click itself, and the claim that it represents the session source based on GA4's last non-direct attribution model. To verify this, I collected data over a certain period and decided to investigate whether this claim holds true.

Summary: The data matched in less than half of the cases, and BigQuery (BQ) generally shows fewer google / cpc entries compared to the GA4 interface.

Now, let's break it down step by step:

Background

At the end of July, Google added the session_traffic_source_last_click (STSLC). Initially, only manual_campaign and google_ads_campaign were included. A few months later, during the research and while writing this article, Google added additional fields: cross_channel_campaign, sa360_campaign, cm360_campaign, dv360_campaign:

2.1

I decided to focus on researching the manual_campaign f fields without waiting for more cross_channel_campaign to accumulate, which could take several more months. Perhaps, one day, I will test cross_channel_campaign as well. If you're interested in those results, feel free to leave a comment at the end of the article.

As for manual_campaign and google_ads_campaign, they were tested almost immediately, and it's been claimed that they match the interface, i.e. that it is not the last click attribution that comes from the object name or even from the help, but the last non-direct from the interface.

Here's a link to one of those statements:

https://www.linkedin.com/posts/luka-cempre_simo-ahava-posted-about-link-in-comments-activity-7219831533149335553-Z3uV

I was skeptical about these claims because Google's updates are often rough around the edges . However, if they turned out to be accurate, it would simplify reporting greatly — allowing users to generate source reports based on raw data using the last non-direct model, which is one of the most popular attribution models. Previously, such a task required advanced SQL knowledge and complex data processing logic.

Given the availability of STSLC for several months, we decided it was time to examine these fields and compare them with the interface data.

Research methodology

Before presenting the results, it's essential to understand how the study was conducted.

The average data period was about 1.5 months. Most of the projects had a lot of historical data, where I took 2 months. However, there were also newly connected BQ projects, where data was collected for only 2-3 weeks.

I gathered GA4 data from the Explore block. I selected Session source/medium and Total Users for the same period as in the SQL script, having changed the Reporting Identity to Device-Based where the method was different. We decided to equalize the testing conditions for all projects because not all of them collect user_id. Sometimes there is no registration/login functionality.

But of course, where possible, it's recommended to base source identification on user_id, which is the internal user ID from your own database.

Why did I count users if I was studying session level parameters?

The help describes a lot about how the interface counts sessions. It explains that sessions are calculated using a specific algorithm that isn't available in BigQuery, leading to session count discrepancies. This, fortunately, does not apply to the usual client ID - Total Users in the Explore block with Device-based reporting identity and COUNT(DISTINCT user_pseudo_id) in BigQuery are the same. To reduce factors influencing discrepancies, I focused on counting users.

From STSLC, I took the source and medium fields from the manual_campaign object and aggregated users with the COUNT(DISTINCT user_pseudo_id) function.

Next, I uploaded the results to Power BI, where I combined all (not set) with (direct) / (none) to highlight undefined traffic only if the difference was significant.

By the way, STSLC doesn't have (direct) / (none) at all. If you make a deductive comparison with the interface number of (direct) / (none) traffic, then (not set) / (not set) is transmitted instead. But the interface (not set) seems to be simply the absence of any values in STSLC.

Result and comparison table

The results were somewhat unexpected. The data from STSLC fields differed from the interface in 60% of cases, namely in 9 projects out of 15.

Google states that data between the export to BigQuery and the interface may differ by 2-5%, so you shouldn't expect 100% accuracy. Here's a link to the help. Discrepancies of up to 5% were ignored.

Compared to the interface in BigQuery:

  1. In 7 cases (47%) there were fewer google / cpc.
  2. In 4 cases (27%) there was more direct traffic.
  3. In 4 cases (27%) there was more google / organic. And all of them had less google / cpc, which suggested that some of the cpc was defined as organic.
  4. In 4 cases (27%), instead of google / cpc(assigned to all Google advertising traffic in the interface), BQ showed different values—such as custom tags like google_pm, or subdomains of the Google advertising network.
  5. In 2 cases (13%), there was a lot of strange traffic from YouTube.
Compared to the GA4 interface in BigQuerydata between the BQ and GA4 fields are the samesomething else instead of google / cpcmore direct trafficmore google / organicless google / cpcstrange traffic from YouTube
number of projects644472
share40%27%27%27%47%13%
project 1*----+-
project 2-+----
project 3-+++++
project 4--+-+-
project 5---++-
project 6+-----
project 7+-----
project 8+-----
project 9+-----
project 10+-----
project 11-+++++
project 12*----+-
project 13-+-++-
project 14**--+---
project 15+-----

*Projects 1 and 12 had little advertising traffic in general, but the discrepancy was 70% and 98%, respectively. Other sources dominated the traffic, but their discrepancies were less than 5%, so they weren’t considered significant enough to include in the table.

**In project 14, BigQuery had 6% more (direct) / (none) and 2% less google / organic, which is why I didn't include this difference in the table. However, in terms of the absolute difference, it looked like a significant part of the difference in organic traffic was categorized as direct. After checking this hypothesis in BigQuery, I found that the situation was somewhat worse - not only google / organic was partially attributed to direct , but other sources were also misclassified as direct.

2.2

I continued testing hypotheses about why STSLC had more direct traffic and google/ organic.

As for google / organic, I found a pattern - this usually happened due to the lack of UTM tags: even when gclid or gbraid (which clearly indicate an ad click) were present, the sources were often still listed as google / organic.

2.3

At this point, I considered emphasizing the importance of UTM tags again, but the screenshot below was somewhat discouraging…

Even with UTM tags, Google might not "see" them and attribute an ad click from Google to, say, Instagram.

2.4

However, cases like the one shown in the second screenshot are much less frequent, so UTM tags can significantly improve tracking - don't forget about them!

Here is an example to confirm this - project 13, where there’s less google / cpc, and more google / organic, as if something else was recorded instead of google / cpc:

2.5

In this case, what should have been categorized as google / cpc, is written in BigQuery to the source syndicatedsearch.goog / referral. This looks like a mistake, but it is not, just like in the google / organic case. The reason is that the source logic in manual_source is based on UTM tags. Since there were no UTM tags for syndicatedsearch.goog, the source was determined by the domain of the site where the click occurred (page_referrer):

2.6

The interface uses gclid in its source determination model, so it can use gclid data from the advertising account (as long as Google Ads is connected to GA4 ) and categorize such traffic as google / cpc, even without any tags in the link.

2.7

In general, even if you count all users from google / cpc and syndicatedsearch.goog / referral, in BigQuery, the number still won't match GA4. This discrepancy is also impacted by the absence of UTM tags, as some of the google / cpc into google / organic, skewing the data in favor of organic traffic.

I hope these examples clearly demonstrate the importance of UTM tags and why relying solely on auto-tagging isn’t enough.

Some interesting cases

Let’s now go through some more interesting cases with screenshots, particularly projects 3 and 11, where the most significant discrepancies occurred.

Project 3

Here’s a screenshot from the GA4 interface. I highlighted the period and the last four digits of the resource ID. A side-by-side comparison will follow below.

2.8

Here’s a screenshot from BigQuery with the same period and resource ID.

2.9

Here’s the comparison screenshot. Only bing / organic matched... And google / cpc was distributed among organic, google_pm, doubleclick and syndycatedsearch.

2.10

In this project, the links from google_pm did not have utm_medium, so Google either categorized the traffic as organic or referral according to its processing algorithms. The point here is that simply adding UTM tags isn’t enough — you need to use them correctly.

Project 11

GA4 interface:

2.11

BigQuery results:

2.12

Here’s the comparison screenshot:

2.13

I think no additional comments are necessary here — just comparing the first lines is enough.

Final thoughts

The results of this study were surprising. I expected to see data similar to the Traffic Acquisition report in GA4. Of course, it’s not perfect, but I thought session attribution just needed minor adjustments and could be used. However, as you’ve seen, the results vary greatly. Some people were more “fortunate“, and data between session_traffic_source_last_click and the interface almost matching, while somewhere, the data differed significantly.

The main reason for these discrepancies is either the absence of UTM tags or their illogical structure.

Sure, you can use tags like utm_source=google_ads&utm_medium=ppc, but you’ll likely still see google / cpc in the interface, because this is the common and “logical” name of this source. Google processes its advertising traffic as google / cpc. Here is the help.

Long before this comparison and before the new fields appeared, our team had already created a custom SQL script for determining session sources using the last non-direct model, and we used it for our projects. When the new fields were introduced, it seemed that the script had was no longer necessary. But before writing it off, I decided to test it on projects where the data between the interface and STSLC matched. In my opinion, there is no point in comparing where the data differed because it is clear that a custom script would have corrected those discrepancies in STSLC source definitions. What was interesting was whether the custom processing logic could identify sources even more accurately than both the interface and STSLC.

At the end of June 2024, Maks and I held a webinar where we discussed defining sources in the GA4 interface and shared our BigQuery script. If you are interested, you can watch the recording on PROANALYTICS.ACADEMY.

As a result, in 5 projects out of 6, our script identified sources better and showed less direct traffic overall. Below is a typical comparison screenshot from these five projects:

  • page_referrer - the page from which a user came,
  • page_location - the page to which the user landed,
  • source_medium - values from the session_traffic_source_last_click.manual_campaign.source and session_traffic_source_last_click.manual_campaign.medium fields,
  • source_medium_proanalytics_script - source and medium values as a result of processing the same data by our script
2.14

I have no other explanation for why Google didn't identify the source when there were tags and a referral, except that it's a bug. According to the help, these sources should have been identified. Our script works based on these principles but with some improvements, as raw data gives you complete flexibility in processing. Our script not only fixes errors like this but also:

  1. ignores payment systems for the entire data history, not just from the moment they are excluded in the referral list,
  2. does not take into account internal utm tags, which is generally a bad practice,
  3. combines social media subdomains into one name (facebook/instagram) and attributes such traffic to the social medium instead of referral,
  4. does not identify as a traffic source the redirect to the site after registration / login from a Google / Microsoft / YouTube account,
  5. fixes illogical Google Ads markup to google / cpc and explicitly identifies traffic that has not been tagged with utm. By the way, such traffic will have an “(organic)” campaign in the interface.

The new fields are a good step forward. With proper tagging, you can get last non-direct attribution out of the box. However, this solution, as you’ve seen, there are still bugs and shortcomings that can only be fixed by writing your own source attribution logic.

And finally:

How to compare data on your project yourself

If you want to check if the data in STSLC matches the interface yourself, then do the following:

  1. Run this script in BigQuery, after changing “project_id.analytics_dataset” to match your project ID and the dataset with GA4 data, which follows the format analytics_{{stream_ID}}.

Also, adjust the date range as needed.

sql
SELECT
  IFNULL(IF(source_medium = '(not set) / (not set)', '(direct) / (none)', source_medium), '(not set)') AS source_medium,
  COUNT(DISTINCT user_pseudo_id) AS users
FROM (
  SELECT
    user_pseudo_id,
    session_traffic_source_last_click.manual_campaign.source || ' / ' || session_traffic_source_last_click.manual_campaign.medium AS source_medium,
  FROM
    -- change project_id.analytics_dataset to your project id and analytics dataset
    `project_id.analytics_dataset.events_20*`
  WHERE user_pseudo_id IS NOT NULL
    AND PARSE_DATE('%y%m%d', _table_suffix)
    -- start and end of the data period
    BETWEEN '2024-08-01'
        AND '2024-08-31'
  )
GROUP BY 1
ORDER BY 2 DESC

Save the results to a table in any of the following ways:

2.15

2. In the GA4 interface, go to Reporting Identity and change the method to Device-Based if it was different before. After reconciliation, you can return to the way it was before.

2.16

3. Next, go to Explore, and there build a report for the same period as in the first step, using Session source / medium as a parameter and Total Users as a metric. Export the result as a table.

2.17

4. Finally, compare the discrepancies using VLOOKUP or other usual methods. If the difference in channels is within 5%, you can use the STSLC data. If not, I recommend writing your own logic. If you need help with this complex task, feel free to contact our analytics team for assistance.

Comments