Why comply with data protection laws when you can just anonymize personally identifiable data? The situation is not so clear-cut…

In the current and ongoing global concern over data privacy, among the many techniques that can be deployed for compliance, one method is anonymizing the data when appropriate.

In fact, at the Centre for Development of Advanced Computing (C-DAC), India, it is an axiom that “personal data that has been properly treated this way helps organizations avoid the need for regulation.”

According to the center’s Principal Technical Officer, Shilpa Oswal: “You don’t need to abide by the privacy laws when you’re using truly anonymized data. Once data is truly anonymized and individuals are no longer identifiable, the data will not fall within the scope of the GDPR.”

By definition, data anonymization is the process of protecting private or sensitive information by erasing, encrypting, or masking identifiers that connect an individual to stored data. Examples of personally identifiable data include names, social security numbers,  mobile numbers, etc.

“Truly anonymized data sets, which do not relate to an identified or identifiable natural person, can be published or shared with any party without legal obligations,” continued Shilpa. “We don’t need user consent to share it or use it.”

Techniques to anonymize data

According to Shilpa, database providers are cognizant of the need for anonymization, and cited an example of how the open source object-relational database PostgreSQL has an extension available to implement anonymization.

However, the act of anonymization is not as easy as flicking a switch, and care must be taken on how it is done, depending on what the data is needed for.

For example, organizations need to decide whether the anonymization will be static or dynamic.

  • Static anonymization means that the data is changed permanently on the database (or more usually, a copy of it).
  • Dynamic anonymization means that the change is applied to the results of the query, and not the entire data set.

Shilpa felt that most industries use static anonymization because it is ‘once-and-done’, with the added benefit that once data is anonymized, it does not matter what happens to the data, even if it is stolen, “whereas dynamic anonymization is a less mature technology for the moment, and there are very few customer success stories for this.”

Another consideration is how the data is physically anonymized, via several different techniques, each with its own benefits:

  • Attribute or record separation means deleting the attribute or record directly from the data set. There is no risk of re-identification, but there is permanent data loss.
  • Pseudonymization is the use of fake or pseudo identifiers. Pseudo identifiers are created with a one-to-one mapping to the original identifiers, which means the pseudo data can be ‘translated’ back to the original.
  • Generalization is to make the data more generic by grouping them into broad areas. For example, although Bob is 28 years old, it is recorded that Bob’s age is between 20 and 30 years of age. However, higher generalization impacts the utility of the data.
  • Synthetic data uses completely artificial data to replace the original. It is suitable for testing purposes, and there is no risk of re-identification. However, large datasets may require high computing resources, so cost may become a factor.
  • Data perturbation is when the data is modified by adding random noise. Mostly suitable for numeric values.
  • Data swapping is when data sets are rearranged, essentially a reshuffling of data. However, it may create unusual conditions (e.g., if the male and female gender of patients are swapped in a medical database).

All pros and no cons?

Despite the benefits that anonymization brings, Shilpa noted that it can also bring disadvantages, especially as websites strive to deliver a more personalized experience to visitors. “Many websites are doing that, and it is not possible if we have only anonymized data. We cannot use this anonymized data for marketing efforts.”

Nevertheless, as far as Shilpa is concerned, anonymization is a key tool in protecting data privacy: in her opinion, organizations of all types and sizes should endeavor to implement data privacy as best they can to protect the digital identities of their users.

If organizations do not double down on improving data privacy, digital identities of individuals will inevitably become compromised. “When this happens, the consequences would have serious implications for the individuals whose identities are stolen, and for the organizations suffering the breach: these include negative brand exposure, loss of customer trust, and potential litigation due to non-compliance with data privacy regulation,” Shilpa reiterated.

So, does anonymization absolve every organization of compliance laws? What are the ethical and legal considerations if an organization elects certain anonymization techniques that end up allowing cybercriminals to piece together disparate pieces of sub-anonymized data to form usable data for identifying individuals?

Current data privacy and protection laws have also not addressed possible exploitable loopholes in how organizations can interpret the term “without disproportionate effort” that describes how anonymization levels must foil casual attempts at piecing fragments of data into a fuller picture. This can lead to grey zones that, when anonymization comes into widespread use, encourage unethical corporate behavior.

Ultimately, while anonymization is both a good buffer and catalyst of data management compliance, regulatory authorities will also need to keep a tight rein on how organizations enjoy the liberties responsibly and take full responsibility for anything that goes awry.