Our Data Journey

Note: We are currently transitioning from Version 4 to Version 5 of WebDataStudio.com, so some of the information listed here is only relevant to Version 5. We wanted to list it out ahead of time because things are starting to move very fast.

One of the guiding philosophies we’ve had since we started WebDataStudio.com was that we always want to be radically transparent about how data is processed and stored. We comply with all relevant privacy laws and ensure that we always practice data minimization (only processing and saving data that’s essential, useful, and privacy-focused).

We’re going to go into extreme detail about what happens when you put our WebDataStudio.com script on your website and how it then protects your website visitors’ privacy.

So, let’s say you’ve just put the WebDataStudio.com Analytics script on your website or enabled one of our plugins within your CMS. The script is live and ready to collect privacy-focused website analytics. And then you (or any website visitor to your site) load your website in a browser. Here’s what happens from start to finish:

Step 1: Loading our cookie-free script

The WebDataStudio.com embed script (a javascript file) is loaded from our global content delivery network (CDN). This means that the file loads rapidly from a server sitting in a city closest to you. Typical load times are around 30 milliseconds.

Once the script is loaded, a pageview request is sent to our servers in the United States. This will contain details about the page you’re on and the website that referred you. The browser will also send our servers your IP Address and User-Agent (which contains details about the browser you’re using and device type). Our technology doesn’t use cookies, so you won’t need an annoying cookie consent banner taking up half of your page.

Step 2: Our firewall

When your pageview hits our server, we can see your IP Address, User-Agent and website you visited. We then extract your IP Address and place it in our access log.

Note: In the past, we didn’t keep access logs, but after getting hit with violent DDoS attacks, we’ve had to pivot on that. Our access logs are automatically deleted after 24 hours. They contain no browsing activity – meaning we cannot tie a page view to an IP or User Agent (we do this for maximum privacy).

In addition to the access logs, we also keep track of how many requests each IP Address performs over the course of 5 minutes. Again, this is to help us prevent DDoS spam attacks.

Step 3: Our security checks

We keep counts of how many requests your IP Address makes to our system. Yes, we do this at the firewall level too, but we also have some application logic to protect ourselves. We do this to prevent spam attacks. The data we keep looks like this:

111.22.333.444 = 3 requests777.88.999.111 = 1 request

We keep these counts for varying periods of time, which we won’t publicly disclose for security reasons, but we practice data minimization on what we keep. At most, we keep your IP count for 24 hours after your first request, similar to our access log policy.

If we detect minor abuse from an IP Address, we will block the IP Address at the application level for a brief period of time.

If we detect more serious abuse at the application level, we will permanently block the IP Address at the firewall level. This will mean that the IP Address is kept permanently to protect our systems. At the time of writing, this has only happened once.

Step 4: Establishing if you’re a new visitor

Once you’re past our firewall, we need to establish uniques/visits to your website, and the way we do this is via a privacy-first unique visits method we invented back in 2019. Long story short, we keep SHA256 hashes (learn about hashes and salts here) that allow us to determine unique visitors over the course of 24 hours without creating any privacy risk.

Here’s an example of the data we receive about you when you load a website that uses WebDataStudio.com:

Your IP: 111.22.333.444Your User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36Random ID: 1234-AOPEOSWRELWPEOSDKELSAWEMFNEKRLSOROEPAWEM

The Random ID is generated on the client and will be used later on in step 6. We keep the Random ID for the duration of the pageview.

One of our guiding development principles is that we don’t want to store your raw data (IP and User-Agent) alongside your browsing activity, as that wouldn’t be as privacy-friendly, so we need to create a “signature” that we can use to identify you on your next visit. Many analytics companies will store your raw IP Address and User Agent alongside your browsing activity, and it’s a practice we don’t agree with. We will never, ever do this. We will only ever store raw IP Addresses for security purposes, and they do not form part of our customer data exports, and they’re not shown on customer dashboards. The only time we will review IP Addresses is when we are under a DDoS attack, and our DDoS protection team needs to identify malicious actors.

  1. User Signature Hash. We use this as the base hash. This is our way of anonymously identifying a visitor (you, in this case) without knowing it’s actually you. This allows us to collect site-level uniques. This hash sits alongside your pageviews and is used to remove pageviews in the event of spam. We create this hash by combining the following data:
    1. Salt. We have a unique salt per site, which is recycled each day at midnight. This is put in place to make it impossible for a hacker to brute force the hashes. We’ll go into more information about brute-forcing below.
    2. IP Address. This is typically unique to your network and needs no explanation. There will be occasions where you’ll be on a shared network or proxy, but it’s reliable most of the time.
    3. User-Agent. The User-Agent, combined with your IP Address, often brings more uniqueness to the user signature hash.
    4. Hostname. This is the website address (e.g. http://www.milliondollarhomepage.com). This is a crucial parameter because it means we can’t collect browsing activity between websites.
  2. Page Request Signature Hash. Each time you view a page, we generate a hash that will tell us that you’ve viewed a page. We use this to collect uniques at a page level. This hash consists of the following data:
    • . User Signature. The user signature is the base of this hash.
  1. Pathname. The pathname (e.g. /blog) that was accessed.

We then perform an existence check on the hashes. If they don’t exist in our database, it means that we’re dealing with a unique. If they do exist, it means we’re dealing with a return visitor. If the hashes don’t already exist in the database, we add them in, and they’re kept until midnight when they’re automatically deleted.

The beauty of this hash system is that we (or anyone else) can’t do anything with these hashes, nor can we “unravel” them to see personal details. They’re only valuable for the duration of a database existence check. Outside of that, they’re completely, beautifully useless.

  1. If you somehow broke into AWS, where our servers are hosted and were able to get hold of all of our hashes and salt keys, there’d still be quadrillions of possible combinations for a brute force, and nobody on this planet has the resources to achieve it. We have run the numbers, and your hacking budget would need to be multiple times the GWP (Gross world product), which would be hundreds of trillions of dollars.
  2. We round pageview timestamps up to the nearest hour instead of keeping “per second granularity”. We do this because it means we can’t cross-reference access logs (which contain access by the second) to our database of browsing activity. We do this intentionally to ensure privacy is protected.

Step 5: Saving the pageview

Once we’ve established if you’re a unique visitor, we’re ready to store the pageview in our database. Here’s an example of everything we store in our database when we receive a pageview.

Pageviews Table We insert the following payload into the database when receiving a pageview:

{    id: 232332323234234,    user_signature: 5f9b9f01f747722565af71b4e602dc6239f050616b2dfa00944db79b84804c32,    site_id: 1234,    hostname: http://www.milliondollarhomepage.com    pathname: /blog    is_new_visit: true,    is_new_session: true,    is_unique: true,    referrer_hostname: https://bing.com,    referrer_pathname: /about,    timestamp: 2021-01-01 00:01:05,    duration: 0,    // In Version 3, we’ll be introducing various UTM data too, but it’s not there just yet}

Notice that the above payload contains zero personal data. The user signature is technically pseudo-anonymized data under GDPR, but that’s only a technicality. You could not brute force a hash like this. We refer to the user_signature as practically anonymous. If you had hundreds of trillions of dollars, you could brute force it.

Page Stats, Referrer Stats, Site Stats, Browser Stats, Device Type Stats and Country Stats

We keep data for pages, referrers, and other stats. We do this to ensure a fast dashboard experience. We do this to ensure maximum performance on dashboards.

Here is an example of how we track data for Page Stats:

{    site_id: 1234,    hostname: https://milliondollarhomepage.com,    pathname: /blog,    pageviews: 1,    visits: 1,    timestamp: 2021-01-01 00:00:00 (aggregated to current hour)}

It works in the same way for Referrer Stats, Browser Stats, etc. But for something like Browser Stats, we’d store browser_name and browser_version instead of hostname and pathname.

Step 6: Counting bounces and duration

When the user leaves the website page, we attempt to fire off a 2nd request. Due to different rules in different browsers, this request won’t always fire. However, it will fire most of the time since the most popular browsers in the world support this.

When the 2nd request fires, it sends across all the information that the 1st request had, but it also sends the time on page (in seconds). Since we’re receiving a second request, we also know that it wasn’t a bounce. So we can update the previous pageview (by User Signature and Random ID), set the duration, mark it as “not a bounce”, and then update the various aggregation tables (page_stats, browser_stats, etc.).

Step 7: Analytics have been collected without invading your digital privacy

That’s it!

We just processed your visit to your website using WebDataStudio.com, and this is how we do it for every visit that happens when your website uses our analytics script. We were able to extract information that will be useful for your business, and we preserved the privacy of your visitor. This is how it should be. We don’t need to invade your website visitors’ privacy to provide you with data. We have spent thousands of hours thinking about and then building the most privacy-friendly methods, and your website visitors’ will appreciate you for it.

Now you’ve just finished reading this, you’ve got a complete insight into how we do things at WebDataStudio.com, but you may be curious about how this ties into privacy law. We’re fortunate to have a top tier, EU-based privacy officer, access to some incredible lawyers and an obsession with privacy law.

Here is a list of the compliance that we focus on for our privacy-first analytics service:

And as always, if you have any questions about how we process data, you can contact us at info@iroot.in.

User Behavior Analytics

Being able to properly analyze your users, is one of the key success factors of a website. Powerful tools are needed.

Web Statistics

These statistics are the foundation on which to perform website audits and improve your online presence.

Heatmaps

Create heatmaps for every page of your website and see at a glance which areas experience a lot or little visitor interaction.

Campaigns Performance

Enhance your existing links with Urchin Tracking Module parameters in order to let our tool to automatically process & visualize your marketing campaigns data!