What does data anonymization mean?
Data anonymization is the process in which identifiable information, like age, gender, name, etc., is changed or removed from a set of data so that it is impossible, or nearly impossible, to determine the individual the data belongs to. It is commonly referred to as “data sanitization” or “data masking.”
Any industry that relies on the collection of sensitive, personal information must practice some form of data anonymization. The level of anonymization varies depending on the nature of the business, the type of data collected, and whether the data is shared publicly or privately as a controlled release to a designated group of recipients. While certain elements of data must remain intact to provide value, data must be anonymized enough so that, if a breach does occur, the hackers cannot reap any benefits from the information. Properly anonymized data has no direct personal identifiers, such as names, addresses, social security numbers, or telephone numbers. It also contains no indirect identifiers, including place of work, salary, or diagnosis, that can be linked together to identify an individual.
Similar to anonymization, pseudonymization is used to protect sensitive, personally relevant information. Pseudonymization, however, does not remove all identifying information, but enough so that identifying an individual from what’s left would prove extremely difficult, if not impossible. Both anonymization and pseudonymization are important to GDPR regulations.
Common examples of data anonymization include:
- Medical research: Healthcare professionals and researchers looking to examine data pertaining to the prevalence of a particular disease among a specific population would perform data anonymization. This ensures they protect patients’ privacy and remain compliant with HIPAA standards.
- Marketing enhancements: Many online retailers want to improve how, and when, they communicate with their customers through emails, social media, digital advertisements, and their website. To improve their services and meet rising demand for custom or unique user experiences, digital agencies rely on insights gleaned from consumer data. To reap relevant information while remaining compliant, these marketers and analysts must leverage data anonymization.
- Software and product development: Developers often rely heavily on realistic data to develop new tools that can improve efficiencies, solve new challenges, and enhance service offerings. This data must be anonymized so that if a data breach does occur, highly personal information isn’t jeopardized.
- Business performance: Many large corporations gather employee-related data to optimize performance, increase productivity, and improve employee safety. Through data anonymization and aggregation, these companies can get the valuable information they need without making employees feel judged, monitored, or exploited.
How to anonymize data
There are a number of data anonymization techniques out there. While many of these methods are designed to sufficiently mask data, some may need to be used in conjunction with others to ensure both direct and indirect identifiers are anonymized. A few of the most common data anonymization methods include:
- Character masking: In character masking, or “masking out,” the format of the data is maintained, but select characters are replaced with a mask character, such as “x” or “#.” An example of character masking would be changing the birthdate 3/24/1955 to ##/##/19##.
- Data shuffling: Also known as data swapping, this technique involves rearranging data so that data attributes remain present but do not correspond with their original records. Data shuffling is often equated to the shuffling of a deck of cards. This method is effective when there is no need to evaluate data based on relationships between the information contained within each record.
- Data substitution: With substitution, data from a column is completely replaced with random values from a list of fake, but similar-looking, data. For example, last names may be swapped out for other, nonrelevant last names, or credit card numbers may be replaced by a random string of 16 numbers. To properly leverage this method, users must have lists with equal to or more than the amount of data they are trying to anonymize.
- Generalization: Generalization works by eliminating the specificity of data and replacing it with more general, yet still relevant, information. This is often achieved through the use of ranges. Instead of saying 33 years old, generalized data might say an individual is between 30 to 40 years old. For addresses, only road names may be listed.
- Number and date variance: Algorithms can be used to change the value of numeric data by random percentages. This small step can make a big difference if implemented appropriately.
- Scrambling: In proper scrambling, letters are mixed and rearranged so intensely that the original data cannot be determined. A simplified example of this is turning the name “Daniel” into “Leniad,” “Jacqueline” into “Qcaelneiju,” and so on.
- Numeric blurring: Rather than entirely masked, blurred data values are changed so they’re just enough off of their actual value that the individual’s identity is protected. Numeric blurring can be performed in a number of ways, including reporting rounded values or group averages.
- Suppression: In certain cases, data columns and/or records may exist within a data set that do not aid the data evaluator in any way, but do contain identifying information. In these cases, it’s best to suppress, or remove, the columns and/or records. It is important that data is completely removed from the spreadsheet, versus simply hidden.
- Synthetic data: Unlike other data anonymization techniques, synthetic datasets are imitation versions of actual data rather than modified data. These synthetic datasets have many things in common with the actual data, such as format and relationships between data attributes, and are leveraged when a large amount of data is needed for system testing and actual data can’t be used.
Risks associated with data anonymization
While data anonymization can lead to great progress for companies across sectors, it is not without its limitations and risks. If executed incorrectly or with weak algorithms, poor anonymization can result in:
- Identity disclosure: Also known as singling out, identity disclosure is the term used to describe situations in which it is possible to identify all or some of the individuals within a dataset.
- Attribute disclosure: Attribute disclosure is the ability to determine if an attribute within a dataset is held by a specific individual. For example, an anonymized set of data could show that all employees within the sales department of a particular office arrive after 10 a.m. If it is known that a particular employee is within the sales department of this office, you know that they arrive after 10 a.m., even if their specific identity is masked within the datasets.
- Linkability: Linkability refers to when it is possible to connect multiple data points, whether in the same dataset or separate datasets, to create a more cohesive picture of a specific individual.
- Inference disclosure: Inference disclosure occurs when you are able to confidently make an inference about the value of an attribute based off other attributes.
Proper data anonymization can be time consuming and challenging, but it’s imperative that these techniques are performed accurately, by experienced professionals, if companies want to maintain regulatory compliance and ward off attackers. There are a number of data anonymization tools and software out there that can help companies overcome data anonymization hurdles and securely reap the many benefits data collection has to offer.
Read our blog for more on these tools and additional techniques to monitor and protect your data.