Christine Cox / NBC News file
Researchers say genetic genealogy databases can be leveraged to unlock more sensitive genetic information.
Researchers have shown that it's possible to link your identity to supposedly secret genetic information about your predisposition to diseases, merely by analyzing family-tree databases and other publicly available information.
"It was quite surprising," said Yaniv Erlich, a genetic researcher at the Whitehead Institute for Biomedical Research. "When we got the first family, I was surprised. ... It's as if you opened a box that for a long time was locked."
Erlich led the research team whose work is being published in this week's issue of the journal Science. The team's study already has led to a tightening of security measures for federally sponsored genetic databases.
The security-cracking trick relies on the availability of genetic information linked to surnames in a variety of public family-tree databases. DNA samples from males can be tested to look at dozens of genetic markers on the Y-chromosome that change only rarely from generation to generation. If the markers from two individuals with the same surname are a close match, that's a tip-off that the two are closely related, even if they don't know each other.
Tens of thousands of people (including yours truly) make that information public in hopes that someone else will match up with their results. The genealogical markers aren't linked to disease or other specific traits. But under the right circumstances, they could provide an opening for links with other, more sensitive genetic information.
How the secrets were revealed
Erlich and his colleagues conducted a three-step process to see how easy it'd be to use that opening. First, they analyzed anonymous Y-chromosome data from a public database for the 1000 Genomes Project, to come up with the DNA coding for markers that are used for genealogical purposes. Then they compared those markers against entries in the two largest family-tree databases, Ysearch and the Sorenson Molecular Genealogy Foundation.
The researchers said their analysis projected a success rate of 12 percent for recovering the surnames of U.S. Caucasian males. Another 5 percent would theoretically be linked up with the wrong surnames. They said upper- to middle-class Caucasian males were easier to identify, presumably because they're more likely to participate in the family-tree databases.
Once the surnames were identified, the third step was to look at other publicly available sources to go from the surname to a specific individual: Some genetic databases, for example, include information about the age and the state of residence of an anonymous participant, and even the number of children and their birth order. Those clues were added to information gleaned from other sources, ranging from public-record search engines to obituaries.
The researchers linked five specific individuals in three separate families with supposedly anonymous genetic records. The process took three to seven hours for each family pedigree, the scientists said. Then they traced those three family-tree pedigrees to find other connections between relatives and sensitive genetic data. "In total, surname inference breached the privacy of nearly 50 individuals from these three pedigrees," the researchers wrote.
"We show that if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified," Melissa Gymrek, a member of the Erlich Lab and the Science paper's principal author, explained in a news release. "In fact, even your fourth cousin Patrick, whom you've never met, could identify you if his DNA is in the database, as long as he's paternally related to you."
What is to be done with data?
Erlich and his colleagues made a point not to reveal the identities of those individuals, and said they were not advocating a clampdown on the availability of genetic information.
"Quite the opposite," Erlich said. "We found the gene for two devastating pediatric disorders by analyzing the data in public databases. Using these databases, we gave hope to these families and to other parents. We don't want to take away these databases. ... What we really want to do here is to have this really mature conversation about privacy — to tell people we cannot completely protect the privacy, but also to tell them about the benefits."
For years, experts have worried that sensitive genetic data could be used to discriminate against patients, potential employees or would-be insurance customers. Such discrimination is illegal when it comes to employment or health insurance, but the law doesn't cover life insurance, disability insurance or long-term care insurance. Theoretically, an insurer could search through genetic records and turn you down because you have a genetic predisposition to, say, Alzheimer's disease.
In a Science policy paper, representatives of the National Human Genome Research Institute and the National Institute of General Medical Sciences at the National Institutes of Health said it was time to "re-examine how to balance the protection of research participants ... with the societal benefits likely to be gained through the enhanced research that broad data sharing facilitates."
They said NIH "acted swiftly to mitigate future risks" by working with the NIGMS' genetic repository to shift the data about the age of study participants out of public view and into a controlled-access area of the database.
"That reduces the risk," Erlich said. "It creates another fence."
And what about the genealogical genetic data? Max Blankfeld, vice president for operations and marketing at Family Tree DNA, said his company has been dealing with privacy issues for more than a decade — and doesn't expect the latest research to lead to policy changes. Family Tree DNA has been running the Ysearch database as a free public resource for a decade, but does not force any of its more than 400,000 participants to use it.
"People voluntarily post their information in that database, and therefore it has nothing really to do with the vast majority of the people who take the test and choose to have it protected by Family Tree DNA," Blankfeld said. "This data, we don't share with anyone."
More about genetic ancestry:
- DNA takes on a family's mysteries
- Update on Irish roots: The wearin' o' the genes
- Gene-tracing project gets an upgrade
In addition to Erlich and Gymrek, the authors of "Identifying Personal Genomes by Surname Inference" include Amy McGuire, David Golan and Eran Halperin. The work was supported by the National Defense Science and Engineering Graduate Fellowship, the Edmond J. Safra Center for Bioinformatics at Tel Aviv University, and a gift from James and Cathleen Stone.
The authors of the Science policy paper, "The Complexities of Genomic Identifiability," include Laura Rodriguez, Lisa Brooks and Erick Green of NHGRI and Judith Greenberg of NIGMS.
Alan Boyle is NBCNews.com's science editor, and also the administrator of the Boyle Surname Project at Family Tree DNA.
Connect with the Cosmic Log community by "liking" the log's Facebook page, following @b0yle on Twitter and adding the Cosmic Log page to your Google+ presence. To keep up with Cosmic Log as well as NBCNews.com's other stories about science and space, sign up for the Tech & Science newsletter, delivered to your email in-box every weekday. You can also check out "The Case for Pluto," my book about the controversial dwarf planet and the search for new worlds.