Last fall, Adblock Plus founder Vladimir Palant analyzed Avast Online Security, AVG Online Security, Avast SafePrice and AVG SafePrice products and concluded that Avast uses its popular antivirus software to collect and subsequently sell user data. The hype quickly faded away, because Avast Executive Director Ondrei Vlcek convinced users that the data collected was as anonymized as possible, deprived of any link to the identity of a particular person.

“Our company does not allow advertisers or third parties to access through Avast or any data that would allow third parties to target a specific person,” he said.

However, a study conducted by students at Harvard University shows that the depersonalization of the information collected is far from a guarantee of protection against “deanonymization”, what means, disclosure of a person’s identity according to the data available in the database. Young scientists have created a tool that combes through huge arrays of consumer data sets that have come into open access as a result of negligence, hacking, or some other kind of leak.

The program was fed with all the databases that leaked to the network since 2015. Including the data of MyHeritage accounts, user data of Equifax, Experian, etc. Despite the fact that many of these databases contain “anonymized” information, students say that identifying real users was not so difficult.

The principle of operation is quite simple. The program takes a list of identifying information (e-mail or person’s name), and then scans all leaked databases for information that matches the specified parameters. If there are matches, then students get more information about the person. And sometimes this information is enough to clearly identify it.

Gathering pieces of your personality

An individual leak is like a puzzle piece. By itself, it is not particularly useful, but when numerous leaks are collected, turning into a single database, you can get a surprisingly clear picture of our personality. People may forget about these leaks, but hackers have the opportunity to use this data after a lot of time. You just need to collect a few more puzzle pieces.

Imagine while one company can only store user names, passwords, email addresses and other basic account information, another company can store information about your browsing and search queries or data about your location. This information alone will not allow you to be identified, but in aggregate it may reveal numerous personal details that even your closest friends and family may not know about.

The purpose of student research is to show that such data collection, no matter how impersonal it is, still poses a potential threat to users. A dataset from one source can be easily linked to another through a line that is present in both sets. So, you should not think that your personal information is safe only because the company involved in the collection and storage of data assures of its complete depersonalization.

There is other evidence for this. For example, in one British study, machine-learning scientists were able to create a program that could correctly identify 99.98% of Americans in any anonymous dataset using only 15 characteristics. Another study by representatives of the Massachusetts Institute of Technology showed that users can be identified in 90% of cases if only four basic parameters are used.

It turns out that individually the information leaks are rather painful, but in the aggregate they become a real nightmare.

The problem is not only in companies

But you should not blame only the company. Despite the many scandals surrounding confidential data leaks, which have become almost a weekly phenomenon, the public greatly underestimates the impact of these leaks and hacks on personal security. Therefore, it ignores basic security measures. So, after analyzing one of the output data sets of the program, Harvard students found that out of 96,000 passwords contained in the database, only 26,000 were unique.

People are just too lazy to come up with something complicated using template passwords. The leaders are the passwords “12345” and “123456”. With such protection, no technology can be saved from hacking. It is difficult to protect a person’s data if he himself does not make any efforts to do so.

What else can I add? So much has been said about the importance of using unique passwords that it makes no sense to repeat itself. And companies will continue to collect data, reassuring us with promises to depersonalize everything as much as possible. But, as you see, these promises cannot always be trusted.