Data Deletion: The Best Strategy for Data Defense

If you compare data to a new oil field, you must do a safe job to avoid data leaks that are equivalent to toxic leaks. It all starts with a powerful data deletion strategy.

After more than 650,000 customers’ personal information was leaked, the chain bar Wetherspoon decided to delete almost all of its stored customer information in order to reduce risk. After all, if you don’t have data, you don’t need to check for compliance, nor do you need to disclose the “subject access request” for GDPR, and you won’t apologize for data breaches.

In fact, the data is so toxic that Joshua de Larios-Heiman, chairman of the California Bar Association’s Internet and Privacy Law Committee, recommends it as a uranium rather than an oil field. He said: “What about the waste uranium rods? They become toxic assets and it is difficult to dispose of them. If not handled properly, people will sue you.”

If you start to think about risk in these areas, what data will your company lose by storing it?

Do not collect unwanted data

There is a lot of data generated by people themselves, you can’t get any value from it, and retaining it may increase the risk. Julia White, Microsoft’s vice president of Azure and Enterprise Security, commented: “What shocks me is that people don’t seem to find data they don’t want, or data that should be cleaned up for GDPR reasons.”

According to Jon Callas, senior technical researcher at ACLU, don’t be fooled by the decline in storage costs, thinking that saving data is cheap.

“The cost of saving data is higher than expected, and the benefits are low. It may be useful and contribute to the analysis. But it is more likely to be harmful – it will let you lose the breach,” he said. Or being summoned by the court. As time goes by, its usable value is less and less, but the hazard value remains the same. If you lose someone’s address five years ago, the EU does not care that this is not what you don’t want. The data doesn’t care about how it can help your business. If you lose it, you have to be responsible. At some point, data and business will cross. You should throw them away before the data crosses.”

Callas pointed out that “the cost of being subpoenaed and subject access requests is higher than the cost of storage media. Somewhat bad things can happen, some data may cause you to get into more bad things, and the resulting cost is much higher than these The value of the data. When you say ‘I only keep data that is reasonably retained, the program you have to take will put you in a completely different situation.”

High risk data

In an interview with, Veritas senior director Jasmit Sagoo said that one-third of the data stored in the data center is dispensable, outdated or even redundant.

He said: “These data have little business value and should be removed on their own initiative, especially when considering data breaches and risk levels. For example, the risk of former employees and former customer data is very high. This includes personally identifiable information, so only Legal reasons are worth saving for. Data records are particularly vulnerable to hackers, and this is a concrete example of sensitive data that needs to be carefully managed.”

How do I find data that I don’t need and should delete? Sagoo said: “As a starting point, companies should be able to identify the specific details in the data and pinpoint the scope of the risk and its potential value. It is also important to understand what is stored, who is accessing it, and how often it is accessed. Only then can we know which ones are there. Data, which is categorized according to a customized data retention strategy. These files are then deleted at least quarterly.”

According to Blair Hanley Frank, principal analyst at ISG, “Some data should never be stored for analysis. Any company that still stores user passwords in plain text in 2019 is asking for trouble.”

Delete data associated with production systems that are no longer in use. For example, the user data leaked by WeatherSpoon comes from an old website, and the data should not be there. The password data leaked by Adobe also comes from an old non-production system. Frank said: “Enterprises can’t ignore these outdated or rarely used systems just because they are part of the old IT infrastructure.”

Pay particular attention to tracking copies of customer databases that have been extracted (usually XLS or CSV files) and handed over to developers for use as sample data.

For this, you should block the data. By masking the data, you can retain the relevant statistical distribution of data for use in testing without the risk of disclosure.

Delphix director Benjamin Ross pointed out: “The non-production development and testing environment is very important, but it brings a lot of risks, and often the weakness of GDPR compliance.”

Don’t “go to identity”, just delete it.

It is only for the sake of current business that data is saved, rather than vaguely hoping that machine learning systems can find something useful. Callas pointed out that even Andreessen Horowitz, an artificial intelligence startup investor, is questioning whether it is valuable to collect large amounts of data. Callas said: “There is a mysterious belief that having such a ‘data moat has a sustainable competitive advantage, and as an investor, experience tells them that this is not the case. You might think this will make your business more Ok, but reality is unlikely.”

According to Mary L. Gray, a senior researcher at Microsoft Research, this applies especially to the personal identification information (PII) of the data set you are considering for training machine learning models. She said: “Since GDPR, we should very strictly limit what PII can collect, who can access it, and what audit measures are used to explain where, when and how PII is repurposed and sold to the company that collects it. An entity other than that, to see how long these entities can be kept.”

And “de-identification” data is not guaranteed to be safe, because as long as there is enough data, you can still identify your identity – even if you don’t want to. She warned: “It is really nonsense to think that it is possible to permanently identify the collected data.”

She continues, “The data-centric technology industry has yet to find a way to completely erase data, not to mention the complete cessation of collecting data. The industry finally agreed to hash the PII: this is equivalent to Run a black mark on it. But they can collect everything we do. If you can predict what you are doing and where you are doing, then you still have a digital footprint, which is no different from the PII in the picture.”

She added that although the removal of obvious identity (eg, name and date of birth) is simple, PII will still be included in the “go identity” data, for example, when the user adds the full name to a field that is not marked as a name. In the middle, and so on.

Gray explains: “That’s why it’s hard to block data breaches.” You can get a set of email address data, another set of geolocation metadata, and a third set of search query data, and run enough of these data combinations. Generates a search string that generates a name, date of birth, and location, recognizing the person associated with an email address. ”

Frank warned that these potentially harmful data could even delay the company’s data strategy. “Having a lot of basically useless information increases the amount of time people spend building and testing models, making it more difficult to analyze useful data,” he said. “To solve this problem, companies should proactively judge the value of information and Test the data to see if it has predictive value.”

Scott Guthrie, executive vice president of Microsoft’s Cloud and Artificial Intelligence Division, recommends reducing stored data and being as anonymous as possible. He said: “If you can remotely monitor web search, will you store the exact location of the person doing the web search? Or, you can do it anonymously on the street or at other levels, so that no matter whether you have a data leak or not, it will not invade privacy. Already?”

If you don’t have data, no one will abuse the data.

Callas said: “Don’t ask, ‘Why should I drop this data? Instead, ask, ‘Why should I keep it? Unless you know why you want to keep the data, you should lose the data, because in the current environment, we can Collect more and updated data at a lower cost. This can be done by providing an option on your website, filling out a questionnaire to reward, or remotely monitoring the test software program.

He pointed out that after throwing away PII, you can think, “This is what you want, anyway.”

Callas said: “If the bus management department is investigating because it wants to know what people are doing, then it really needs accurate data, and it makes sense to pay for it, but you should run it through a data grinder. These data, throw away the original data, and then completely dispose of the data in a year. For example, if you want to figure out which road to repair, you don’t need the data of the road you just repaired, especially the data shows that you have repaired These roads. Every piece of data on a newly repaired road is toxic: no good, only bad.”

There must also be clear policies for the retention time of the data, such as the log file is kept for less than one week (except for debugging). Callas recommends establishing “mandatory functions” to ensure these decisions are made. “If I say, ‘I put everything in my data warehouse, I will delete it ten years later, unless you tell me why you want to keep it, then you have to figure out why you want to put the data into the data warehouse.”