What the rollout of Google Analytics 4 means for website investigations
The good news is you can still connect sites together by an analytics ID. But there's bad news, too...
In 2011, writer and technologist Andy Baio published an article in Wired that explained how he uncovered the identity of an anonymous blogger.
“The unlucky blogger slipped up and was ratted out by an unlikely source: Google Analytics,” he wrote.
Google Analytics is a free and popular service that measures the audience of an online property. Data from BuiltWith shows it’s currently used by close to 40 million sites.
Baio’s technique relied on the fact that each Google Analytics account is assigned a unique ID that looks like this: UA-112340701-1. You can easily identify a GA ID in the source code of a webpage. Locate the same ID on different sites, and there’s a good chance they’re run by the same person or group.
Four years later, researcher Lawrence Alexander used the same technique to reveal connections between seemingly disparate sites about Syria, Ukraine and other topics. He showed they were likely part of a Russian propaganda operation.
The technique Baio and Alexander helped popularize is today a standard approach for website investigations. I explained how to do it in this chapter for the Verification Handbook for Disinformation and Media Manipulation.
That's the background. Here's the news: as of July 1, Google deactivated the familiar "UA-" ID format as part of the launch of Google Analytics 4.
Publishers no longer have to add a UA- ID to their site to use GA. Sites that previously had a UA- ID were required to add a new ID. The migration to GA4 has been a source of concern in the OSINT and digital investigative communities. Will we be prevented from connecting sites together via an analytics ID? Will the old UA- ID be removed from sites, eliminating evidence? And will the services we used to search for sites connected by the same analytics ID still work?
I reached out to Google and others to get some answers. Here’s what I learned.
The good/neutral news:
Google does not require a site to remove its existing UA- ID as part of the migration to GA4, according to a company spokesperson. A legacy UA- ID remains on a site unless an owner chooses to remove it.
Google Analytics 4 now uses the G- ID, known as the Google tag. This was already in use prior to July 1 and has been collected/tracked by core services like DNSlytics. “We are collecting AW-, DC-, G- and GTM- IDs since Q4 2022,” Paul Schouws, who runs DNSlytics, told me.
Regular users of BuiltWith’s “relationship” tab for websites know it collects and displays data for G- IDs, among others. This means two key services for connecting sites together via analytics IDs have been collecting the relevant GA4 IDs. Good news!
DNSlytics has a cool new search interface in beta. You can search by IDs and craft booleans queries. Test it out here. Nice job, Paul!
I reached out to SpyOnWeb, another popular ID search tool, but did not hear back. As of now, the search prompt on the site does not suggest G- as a searchable ID. It’s also unclear what Microsoft Defender (formerly RiskIQ), DomainTools and other services will do. I’m hopeful many will begin to collect G- and GTM- if they don’t already. But it’s unclear. (Know of a service that’s collecting G- IDs? Tell me in the comments!)
The bad news:
Google is eliminating the suffix that was part of UA- IDs. This was the number that followed the core ID. For example, the suffix in this ID is “-3”: UA-3742720-3. If the suffix was greater than one, it typically meant an ID was used on multiple sites. That was a helpful indication you might be onto a network. Now the signal is gone. Why Google, why?!
The Google spokesperson said they believe it will be “commonplace” for sites to remove defunct UA- IDs. We’ll have to wait and see how this plays out. I checked a few news sites and my very unscientific sample revealed the old UA- IDs were removed and replaced by a G- or GTM- ID.
Why am I seeing GTM- IDs? That ID format is linked to Google Tag Manager, a product used to manage various tags/IDs. Don’t be surprised if you see GTM- on a site instead of UA- or G-.
The Google spokesperson also said the GA4 ID may change in the future. It’s the Google tag by default now, but a site can also use GTM-. And there could be different tags in the future. Confusing, I know. This is the new reality.
Previously, you could perform a “find” on the source page with “UA-” and instantly see a UA ID if one was present. But “G-” is a more common character string, which means you have to wade through false positives before locating the ID. You also can’t be sure if a site is using the G- or GTM ID. So you have to search for both.
Overall, the move to GA4 adds complexity and difficulty to the work of connecting sites by a Google Analytics ID. Here are a few suggestions for investigators:
Combine ID searches with domain searches to make sure you capture all of the current and historical analytics IDs. This for example means you should search the current G-/GTM- ID in DNSlytics, and do a search by domain name in DNSlytics and BuiltWith, among other services. The ID search will bring up more recent results, but you need to search by domain to see if the site previously had a UA- ID, and whether it was connected to other sites.
Rather than doing a “find” in a webpage’s source code for “G-” and “GTM-,” start with a search for “Google.” The code associated with either ID contains that word. Or you could search for “?id=G-” to find Google IDs.
Don’t just look for Google IDs! There are lots of other products and services that place a unique ID in the source code of a site. DNSlytics and BuiltWith track other IDs used by a given domain. Well-Known, which I wrote about here and here, is great for finding advertising IDs connected to a domain. Google’s products are widely used so you often have more luck with them. But don’t ignore other IDs!
Remember that you can search the source code of a page that was captured in the Wayback Machine. As part of your due diligence on a site, you should manually check archived pages for legacy UA- IDs.
Do you have tips to share? Or questions I didn't answer? Share them in the comments.
Finally, here’s my latest investigation at ProPublica. It reveals how a mysterious network called AdStyle placed ads with fake endorsements from celebrities like Oprah Winfrey and Elon Musk on conservative sites in the U.S. and abroad. Here’s a sample of a scam ad placed by AdStyle that Oprah had nothing to do with:
That’s it for this edition of Digital Investigations! Thanks for reading.