The Data Cape: Compliance AND completeness

Legal compliance and data completeness for a rock-solid data foundation

Summary

This document explains the Data Cape, a data architecture developed by Cape.ly based on over a decade of experience. The main goal of the framework is to solve the biggest problem related to user-related data: the conflict between regulatory compliance and data completeness. You will learn how to achieve both and put individual data initiatives and your company as a whole on the road to success.

The problem: Compliance OR completeness?

Please note that this document is still a work in progress.

1. Introduction

Companies are doing their best to avoid hefty fines and costly lawsuits. This usually leads to an internal tug-of-war because legal compliance limitations and data completeness requirements are opposing forces.

Unfortunately, the conflict between the legal department and other business units usually results in subpar data and causes huge issues for tools and use cases.

2. Tug of war between legal and business units

Privacy and data rules and regulations like the examples below limit the creation and use of user-related or personally identifiable information (PII):

Typically, legal departments push for scarce use of user-related data while business units like the examples below want as much data as possible, especially for AI use cases:

Not only does this struggle tie up a lot of resources, it’s also never-ending and should therefore be avoided at all cost. But it also negatively affects the data itself.

3. Data erosion: What’s that?

In an ideal world, companies would have data that fully represents every aspect of their entire business and everyone interacting with it, or in other words, the data would be complete.

However, there are legal, technological, and resource limitations that typically decrease the data’s completeness. The difference between the complete data and the actual data, regardless of the reason why, is data that has eroded.

The various reasons and types of data erosion are discussed below, but it is important to understand that data erosion can mean that data is completely missing, or that individual data points don’t exist or contain incomplete values.

We also distinguish between two types of data erosion:

Steady tidal erosion occurs when you end up with incomplete data due to compliance requirements.
Sudden flooding erosion occurs when you lose non-compliant data due to a sudden event like an audit.

4. What data is vulnerable to erosion?

It is important to understand what data triggers the tug of war mentioned above (scroll up). Generally speaking, any data that can potentially be tied to an individual user (or the user’s device) poses a compliance risk.

Adjustments to the data could be required, leading to erosion. Some affected types of data are:

Behavioral / analytics / click stream / user event data
Campaign and conversion tracking data
Order cancellations, refunds, subscription renewals
Website interactions and mobile app interactions
Cross-device tracking
Email views and clicks, QR code scans
Chatbot interactions, form usage data
Other sources of customer journey data

5. The age of AI requires rock-solid user data

Over the past few decades, data has become very important to almost every business. However, increasing automation and the rise of AI require more data and more reliable data. It has become more important than ever before.

Unlike their human counterparts, AI lacks a lot of the context outside the data that humans have from talking to their co-workers, for example. Additionally, AI requires fresher data than ever before in order to successfully facilitate real-time interactions with users.

With incomplete data, meaning not full coverage, AI is basically blind to the extent of the missing data. This could be irrelevant for some tasks, but the more complex the task the more missing data becomes a problem.

6. Data erosion due to compliance

Legal boundaries often limit the amount of data that can be gathered. Because laws are not always super clear, it’s usually on the legal department to impose a stricter or less strict interpretation.

In order to gather personally identifiable information (PII), more and more rules and regulations require the user’s consent. Only collecting data with consent means that a lot of data is not going to be collected.

However, even data that does not contain personally identifiable information can come with legal issues, for example when the user’s consent is required to execute some form of tracking code on the user’s device, even if the gathered data would strictly not contain any PII.

7. Data erosion due to non-compliance

A lot of companies gather data that they don’t have a legal basis for, usually due to a lack of internal oversight or because the legal department wasn’t successful at putting proper guardrails in place.

However, building a business on such data is like building on sand. If you have a sword of Damocles hanging over your head in the form of possible internal or external audits, that data does not provide a strong foundation.

When it’s only a matter of time until something bad happens to the data, it’s not something to build data initiatives on. The lack of reliability and trust into the longevity of data is already a form of erosion, even if the data itself has not eroded yet.

8. Data erosion due to rogue employees

Without the right measures in place and even if the company wants to do everything right, individual employees can still go rogue. They can gather data they are not supposed to, and they can hide this fact from coworkers and the legal department.

While most rogue employees don’t intentionally break the law and just take short-cuts to get their job done, some know very well that what they are doing is illegal. Some even go to great lengths to hide their illegal activity from their employers.

The result is again unreliable data that can take everything built upon it with them once its illegal nature is discovered and the data can’t be used anymore.

9. Data erosion due to rogue technology partners

Just like employees mentioned above (scroll up), technology partners can go rogue too. This can happen in two ways:

Similar to the previous section, there can be rogue employees at other companies, and they can be inclined to misrepresent compliance facts, for example in order to hit certain quotas. Because most are aware that the customer is ultimately legally responsible, not them.

Another thing that can happen is that the way 3rd-party technology changes. These changes can have huge compliance implications, often times completely unintended.

Similar to technology partners, consultants, agencies, and other service providers, can cause similar issues for similar reasons.

10. Direct and downstream costs of data erosion

Businesses usually make decisions based on data. When it comes to creating the foundation, they have two choices:

Spend more money upfront gathering data that is compliant and complete, i.e. not prone to erosion.
Save money gathering data that is either compliant (but not complete) or complete (and not compliant), i.e. prone to erosion.

When data powers the entire business, the first option should be chosen. However, the second option is still the sad reality at most companies and incurs much higher total costs:

Employee productivity

People work less efficiently or can’t do their jobs altogether. As a result, more workers are required, or the existing team has a lower output.

Use cases

In data, the general rule is “garbage in, garbage out” (or GIGO). Subpar inputs, the data, produce bad outcomes. What that is depends on the respective use case, but in online marketing, this can mean incorrect ROI analysis, for example.

Tool performance

Software has gotten better and better, especially with the recent advancements in AI. However, tools can always only be as good as the data that is fed into them. Data erosion can cause them to malfunction completely or produce subpar results.

Developer resources

Because data has become so important, companies spend a lot of resources on fixing it. With data erosion, this can become a Sisyphean task and drain costly and scarce developer resources.

Damages

For medium-sized to large companies, this can easily cause damages in the millions of dollars per year, not to mention the general frustration data erosion causes for everyone affected.

11. Data erosion puts business success at risk

Data erosion affects everything: Tools contain non-compliant or incomplete data, data initiatives and use cases are built on unreliable data, and entire departments can be forced to stop or significantly change what they are doing.

Regardless of wether data erosion occurred due to compliance or non-compliance, the implications for data-driven businesses are huge. Most businesses already rely heavily on data, and this reliance will only increase with the rise of AI. Building on unreliable data is like building on sand and highly negligent.

12. Conclusion

Data erosion is a gigantic problem, and with the rise of AI it’s not going to get smaller. However, based on more than a decade of experience, we know that legal compliance and data completeness is possible, so we made it our mission to help companies with this to ensure their success.

Scroll back to top

Introduction
Tug of war between legal and business units
Data erosion: What’s that?
What data is vulnerable to erosion?
The age of AI requires rock-solid user data
Data erosion due to compliance
Data erosion due to non-compliance
Data erosion due to rogue employees
Data erosion due to rogue technology partners
Direct and downstream costs of data erosion
Data erosion puts business success at risk
Conclusion

Our mission: User privacy AND business success

Please note that this document is still a work in progress.

1. Introduction

For more than a decade, we at Cape.ly have been helping companies gather user-centric data that is both legal and high-quality. During that time we have been confronted with the same set of beliefs over and over again:

Maximizing business success requires maximizing user privacy violations.
Maximizing user privacy leads to subpar data which negatively affects the business.

However, the implied assumption that user privacy and business success can’t be combined is absolutely false.

It is actually possible, it’s just not very easy to do. And because we care deeply about user privacy and want to help businesses, we have made it our mission to share and promote our approach as a framework to achieve both at the same time.

2. Most use cases work well without invasive data

The most important thing to understand is that most use cases actually don’t require personally identifiable information (PII). It is just much easier to gather and work with user-related data. Additionally, basically the entire ecosystem runs on user-related data for historical reasons.

The very few use cases that actually require PII, and even that’s debatable, are ad networks with retargeting, and anything else that involves addressing specific users or devices individually, for example for, again, advertising purposes or security reasons.

All other use cases work very well with data that is not tied to an individual user or device. Some common examples include:

Website and mobile app usage analysis
Marketing campaign ROI analysis
Funnel / navigation / user behavior analysis

These and almost all other use cases work with cohort-based data, consented PII and representative synthetic PII data.

3. Users deserve privacy, regardless of the law

Over the past decade, users have increasingly voiced concerns about privacy violations and demanded change. As a result, politicians around the world have created more and more privacy laws, with Europe being the strictest at the moment.

However, even without any of these laws in place, businesses should have to realize one thing: Respecting their privacy is an essential part of the overall respectful behavior that companies should demonstrate towards their customers.

It is important to understand that respect for the law and customer privacy is a huge competitive advantage, especially when there is little to no affect on the usability of the data.

4. Violations are risky and ultimately expensive

Most violations of privacy laws can result in fines and/or lawsuits. And while there is certainly a lack of enforcement compared to the amount of violations, user-facing violations are relatively easy to detect and can therefore trigger a cascade of events at any moment.

The risk of fines and potentially losing all or a substantial portion of a business’ data foundation, is not a winning long-term strategy.

Once non-compliant data has been identified, businesses are often required to shut down and/or redo the way they gather data. In the end, they are spending more than if they did it correctly from the start. Shortcuts are risky and ultimately expensive.

6. Conclusion

We know how complicated user-related data, so we developed a set of tools and methods that protect both users and businesses:

Users from privacy violations
Businesses from legal risks due to non-compliant data
Use cases and tools from issues due to incomplete data

We call our approach the Data Cape and hope this framework can help businesses to fortify and future-proof their data capabilities.

Scroll back to top

Introduction
Most use cases work well without invasive data
Users deserve privacy, regardless of the law
Violations are risky and ultimately expensive
Conclusion

The Data Cape: Compliant AND complete data

Please note that this document is still a work in progress.

1. Introduction

Businesses have become completely data-dependent. However, an ever increasing number of rules and regulations causes a lot of erosion.

In nature, a cape forms when everything else falls victim to erosion. Because our goal is to create a rock-solid foundation immune to erosion, we chose this term for both our architecture and our company.

From a data architecture perspective, the Data Cape’s strategic location between land and water is ideal to integrate with data lakes and provide a rock-solid foundation for data lake houses and data warehouses.

From a legal perspective, the Data Cape’s solidness stands for protection. Businesses don’t have to worry about fines and lawsuits, and users can be sure that there are no privacy violations.

2. Governance, risk, compliance AND completeness

The Data Cape incorporates the functions of a traditional GRC program but has one very important additional component that makes it different: Its primary goal is to produce data of maximum completeness. It’s a unique in that it is a highly integrated solution.

From a data perspective, GRC programs are often seen as a nuisance across business units that need data to operate. That’s what makes the Data Cape a perfect addition, because it provides as much data as possible and treats non-compliance and data erosion as equally important problems.

3. Completeness: Preventing data erosion

Data erosion can happen at any stage of the data lifecycle:

Data creation

User-related data is usually created on the user’s devices, e.g. in a browser on a desktop computer, tablet, or phone, in an app on a mobile device, etc. However, user-facing systems create user-related data as well, e.g. a customer support system or a CRM. This is the stage at which most issues are introduced into data. It is also the stage where companies have to ask for a user’s consent to comply with privacy regulations.

Data transmission

When data is created on the device of a user, it needs to be transmitted to the company’s data pipelines, securely and reliably.

Data processing

Because the receiving endpoints are usually unprotected, a lot of unusable data can potentially be received. There can also be a desire to filter out bot traffic, for example.

Data storage

With an increasing number of storage technologies, storing even the largest amounts of data has become a relatively easy task. However, a lot of mistakes can be made when modeling the data.

Data usage and downstream processes

Data can usually not be fixed downstream, so you need to get it right at the creation stage. While this is logical and should be common sense, the desire to get things done and show results quickly often leads to non-compliant and/or unnecessarily incomplete data.

Preventing data erosion

Please note that this document is still a work in progress. This section will explain in detail how to prevent data erosion at the creation stage.

4. Completeness: Through granular compliance

When ensuring compliance, this is often done at a very high level. For example, when a user doesn’t consent to the collection of PII, many companies don’t collect any data. But why not at least gather non-PII data?

When looking at individual data points and not an entire data object as a whole, there are many data points that are not tied to an individual but contain information that still can be helpful to the business.

The main priority always has to be compliance. But almost equally important is maximizing the amount of legal data, which can only be achieved with a data-point-level approach.

5. Completeness: Synthetic data for legacy tools

Most tools working with user-related data were created long before recent privacy and data regulations. That’s why the entire ecosystem relies heavily on user-related data, even if user-level data would not be required.

The issues with “consented” data

In order to stay compliant with regulations, a lot of companies choose to work exclusively with data of users that have given their consent. This approach minimizes legal risks and ensures compatibility with consuming tools requiring user-level data.

However, any analysis, insight, recommendation derived from partial data has to be adjusted accordingly, creating additional work and subpar results. Somebody creating reports and dashboards may have to add or subtract 30% to the numbers to reflect what actually occured, for example.

How synthetic data can overcome these issues

In order to deliver on its promises of data compliance and completeness, the Data Cape must anonymize data that lacks user consent and replace user-related data points with synthetic data that mimics user-related data without being user-related.

This approach maximizes compatibility with the existing tool landscape. 3rd-party tools consume the actual user-related data and the synthetic user-related data. The result are analysis, insights, reports, dashboards, etc. of maximum quality.

6. Compliance: For different types of data

Another common mistake is that all data is treated equal. However, different types of data need different treatment to ensure compliance and completeness. In order to keep this document relatively concise, there are more specific documents for different types of data:

7. Compliance: Across a multitude of tools

Please note that this document is still a work in progress. This paragraph will later explain how to ensure compliance across a multitude of tools.

8. Compliance: Across different regulations

Creating data that is compliant with one specific regulation is challenging. It becomes even more challenging if more than one jurisdiction is involved, especially because a lot of laws are kept rather vague and broad.

However, there is also some good news: Privacy and data rules and regulations around the world greatly overlap because they lawmakers tend to prefer not to reinvent the wheel and cooperate internationally. This is another reason to solve compliance in a central place.

9. Risk: Fines and lawsuits

Please note that this document is still a work in progress. This paragraph will later explain the risk of fines and lawsuits.

10. Risk: Reputational damage

Please note that this document is still a work in progress. This paragraph will later explain the risk of reputational damage.

11. Governance: Centralized oversight

Please note that this document is still a work in progress. This paragraph will later explain centralized oversight.

12. Additional benefit: Only one implementation

Please note that this document is still a work in progress. This paragraph will later explain the benefit of having only one implementation.

13. Additional benefit: Single Source of Truth

Please note that this document is still a work in progress. This paragraph will later explain how this approach automatically established a Single Source of Truth.

14. Additional benefit: Data quality

Please note that this document is still a work in progress. This paragraph will later explain why this approach produces significantly higher data quality.

15. Additional benefit: Tool agnosticism

Please note that this document is still a work in progress. This paragraph will later explain why it is very beneficial to be tool and technology agnostic.

16. Additional benefit: Central monitoring and alerting

Please note that this document is still a work in progress. This paragraph will later explain why it’s better to monitor data in a central place.

17. Additional benefit: Faster time to market

Please note that this document is still a work in progress. This paragraph will later explain how this approach leads to faster time to market.

18. Additional benefit: Consolidation

Please note that this document is still a work in progress. This paragraph will later explain how this approach usually leads to tool and data consolidation.

19. Additional benefit: Cost efficiency

Please note that this document is still a work in progress. This paragraph will later explain how this approach saves money.

20. Additional benefit: Matching numbers across tools

Please note that this document is still a work in progress. This paragraph will later explain how this approach leads to matching numbers across different tools.

21. Additional benefit: Easier maintenance

Please note that this document is still a work in progress. This paragraph will later explain why this approach is much easier to maintain.

22. Additional benefit: Minimal development efforts

Please note that this document is still a work in progress. This paragraph will later explain how this approach minimizes developer efforts.

23. Additional benefit: Data reusability

Please note that this document is still a work in progress. This paragraph will later explain data reusability.

24. Conclusion

Please note that this document is still a work in progress. This paragraph will later give a conclusion.

Scroll back to top

Introduction
Governance, risk, compliance AND completeness
Completeness: Preventing data erosion
Completeness: Through granular compliance
Completeness: Synthetic data for legacy tools
Compliance: For different types of data
Compliance: Across a multitude of tools
Compliance: Across different regulations
Risk: Fines and lawsuits
Risk: Reputational damage
Governance: Centralized oversight
Additional benefit: Only one implementation
Additional benefit: Single Source of Truth
Additional benefit: Data quality
Additional benefit: Tool agnosticism
Additional benefit: Central monitoring and alerting
Additional benefit: Faster time to market
Additional benefit: Consolidation
Additional benefit: Cost efficiency
Additional benefit: Matching numbers across tools
Additional benefit: Easier maintenance
Additional benefit: Minimal development efforts
Additional benefit: Data reusability
Conclusion

The result: Optimal user data to feed everywhere

Please note that this document is still a work in progress.

1. Introduction

One of the main advantages of the Data Cape is that data is only optimized for completeness once, and compliance is only done once, not multiple times or even once per tool / data consumer.

For businesses, it’s increasingly important to execute and adapt quickly to ever-changing environments and use cases. Especially automation, machine learning and artificial intelligence require very quick, high-quality data.

2. Tools are only as good as the data fed into them

There are more software tools than ever and most of them depend on data just as much as businesses on the tools themselves. Problems arise when the data that these tools work with is incomplete.

That can easily lead to situations where, figuratively speaking, the tools are blind on one eye. For most tools, this doesn’t mean that they stop working. However, it can quickly results in partially or completely incorrect results, further emphasizing how much more companies should focus on the input data.

3. Self-hosted data pipelines

Please note that this document is still a work in progress. This paragraph will later explain how to send data to self-hosted data pipelines, like for example:

Elasticsearch

4. Cloud platforms data pipelines

Please note that this document is still a work in progress. This paragraph will later explain how to send data to cloud platform data pipelines, like for example:

5. Customer data platforms (CDP)

Please note that this document is still a work in progress. This paragraph will later explain how to send data to customer data platforms, like for example:

Rudderstack CDP

6. Tag managers

Please note that this document is still a work in progress. This paragraph will later explain how to send data to server-side tag managers, like for example:

7. Analytics tools

Please note that this document is still a work in progress. This paragraph will later explain how to send data to analytics tools, like for example:

8. Marketing technologies (MarTech)

Please note that this document is still a work in progress. This paragraph will later explain how to send data to the thousands of MarTech tools, like for example:

Google Ads

Meta Ads

9. Conclusion

Our long-term vision is that all tools can be powered from a single, central data source so that all resources can be spend to maximize legal compliance, and data completeness instead of being spread out. We at Cape.ly want everyone be able to focus on their job without worrying about any data at all.

Scroll back to top

Introduction
Tools are only as good as the data fed into them
Self-hosted data pipelines
Cloud platforms
Customer data platforms (CDP)
Tag managers
Analytics tools
Marketing technologies (MarTech)
Conclusion

The Data Cape: Compliance AND completeness

Legal compliance and data completeness for a rock-solid data foundation

Table of contents

Summary

The problem: Compliance OR completeness?

1. Introduction

2. Tug of war between legal and business units

3. Data erosion: What’s that?

4. What data is vulnerable to erosion?

5. The age of AI requires rock-solid user data

6. Data erosion due to compliance

7. Data erosion due to non-compliance

8. Data erosion due to rogue employees

9. Data erosion due to rogue technology partners

10. Direct and downstream costs of data erosion

Employee productivity

Use cases

Tool performance

Developer resources

Damages

11. Data erosion puts business success at risk

12. Conclusion

Our mission: User privacy AND business success

1. Introduction

2. Most use cases work well without invasive data

3. Users deserve privacy, regardless of the law

4. Violations are risky and ultimately expensive

6. Conclusion

The Data Cape: Compliant AND complete data

1. Introduction

2. Governance, risk, compliance AND completeness

3. Completeness: Preventing data erosion

Data creation

Data transmission

Data processing

Data storage

Data usage and downstream processes

Preventing data erosion

4. Completeness: Through granular compliance

5. Completeness: Synthetic data for legacy tools

The issues with “consented” data

How synthetic data can overcome these issues

6. Compliance: For different types of data

7. Compliance: Across a multitude of tools

8. Compliance: Across different regulations

9. Risk: Fines and lawsuits

10. Risk: Reputational damage

11. Governance: Centralized oversight

12. Additional benefit: Only one implementation

13. Additional benefit: Single Source of Truth

14. Additional benefit: Data quality

15. Additional benefit: Tool agnosticism

16. Additional benefit: Central monitoring and alerting

17. Additional benefit: Faster time to market

18. Additional benefit: Consolidation

19. Additional benefit: Cost efficiency

20. Additional benefit: Matching numbers across tools

21. Additional benefit: Easier maintenance

22. Additional benefit: Minimal development efforts

23. Additional benefit: Data reusability

24. Conclusion

The result: Optimal user data to feed everywhere

1. Introduction

2. Tools are only as good as the data fed into them

3. Self-hosted data pipelines

4. Cloud platforms data pipelines

5. Customer data platforms (CDP)

6. Tag managers

7. Analytics tools

8. Marketing technologies (MarTech)

9. Conclusion

Compliant and complete user event data

Stop building on sand, only legal data provides a rock-solid foundation:

For teams

For use cases

For tools

For regulations

Marketing & Product

Data & Engineering

Organization