An Approach For Privacy – Protecting Big Data

Posted on February 6, 2017 I Written By

Anne Zieger is veteran healthcare editor and analyst with 25 years of industry experience. Zieger formerly served as editor-in-chief of FierceHealthcare.com and her commentaries have appeared in dozens of international business publications, including Forbes, Business Week and Information Week. She has also contributed content to hundreds of healthcare and health IT organizations, including several Fortune 500 companies. She can be reached at @ziegerhealth or www.ziegerhealthcare.com.

There’s little doubt that the healthcare industry is zeroing in on some important discoveries as providers and researchers mine collections of clinical and research data. Big data does come with some risks, however, with some observers fearing that aggregated and shared information may breach patient privacy. However, at least one study suggests that patients can be protected without interrupting data collection.

In what it calls a first, a new study appearing in the Journal of the American Medical Informatics Association has demonstrated that protecting the privacy of patients can be done without too much fuss, even when the patient data is pulled into big data stores used for research.

According to the study, a single patient anonymization algorithm can offer a standard level of privacy protection across multiple institutions, even when they are sharing clinical data back and forth. Researchers say that larger clinical datasets can protect patient anonymity without generalizing or suppressing data in a manner which would undermine its use.

To conduct the study, researchers set a privacy adversary out to beat the system. This adversary, who had collected patient diagnoses from a single unspecified clinic visit, was asked to match them to a record in a de-identified research dataset known to include the patient. To conduct the study, researchers used data from Vanderbilt University Medical Center, Northwestern Memorial Hospital in Chicago and Marshfield Clinic.

The researchers knew that according to prior studies, the more data associated with each de-identified record, and the more complex and diverse the patient’s problems, the more likely it was that their information would stick out from the crowd. And that would typically force managers to generalize or suppress data to protect patient anonymity.

In this case, the team hoped to find out how much generalization and suppression would be necessary to protect identities found within the three institutions’ data, and after, whether the protected data would ultimately be of any use to future researchers.

The team processed relatively small datasets from each institution representing patients in a multi-site genotype-disease association study; larger datasets to represent patients in the three institutions’ bank of de-identified DNA samples; and large sets which stood in for each’s EMR population.

Using the algorithm they developed, the team found that most of the data’s value was preserved despite the occasional need for generalization and suppression. On average, 12.8% of diagnosis codes needed generalization; the medium-sized biobank models saw only 4% of codes needing generalization; and among the large databases representing EMR populations, only 0.4% needed generalization and no codes required suppression.

More work like this is clearly needed as the demand for large-scale clinical, genomic and transactional datasets grows. But in the meantime, this seems to be good news for budding big data research efforts.