The New Media Lawyer: The Problem of Anonymized Data

Over the past three or so years both OK Cupid and Facebook have published a great deal of aggregated, anonymized data for, let's say, the enlightenment and mirth of society at large. Facebook shared information on how "happy" users seemed on certain days and OK Cupid shared (as it has been doing) some interesting information about dating preferences of certain groups.

The LA Times blog has been one of the few voices to point out the privacy implications of these data shares, saying, "Despite its silly name, the Gross National Happiness indicator is creepy. We're in there."

Several events - most notably the promotional (and seemingly innocuous) publication by Netflix of data regarding movie preferences of its customers - have revealed fundamental security problems with so-called anonymized data (Netflix was sued in 2009 in connection with this and settled out of court). By some accounts, the widely used concept of "personally identifiable information" on web site privacy policies - which by implication suggests the existence of information about you that is somehow not "personally identifiable" - is misleading. There's been a spate of studies finding that people with PHDs have the capacity to "re-identify" or disambiguate so-called anonymized data. The underlying principle is that ostensibly random data points (such as a birth date and a zip code) can be tied to a specific person if coupled with some other set of information (publicly available or readily mined, etc.) It's not clear that data anonymized in accordance with best practices is easy (cheap) to re-identify but it is clear that it can be done.

Thus, a quandry for businesses that want to sell or otherwise profit from anonymized data. U.S. laws and state laws with respect to consumer data privacy don't forbid it. In fact there are hardly any restrictions whatsoever on properly de-identified data, even if the data is being sourced from financial or health-related data or data concerning children (laws are more restrictive with respect to these categories of data). Further, as most privacy policies on web sites inform users that "anonymized" data may be shared with third parties, in most circumstances users have a reasonable expectation (due to a privacy policy or otherwise - provided they had proper notice and consent of the privacy policy) that their data might be released in such a form and hence would not have contractual cause to sue in the event it ever was. Relatedly, one would not expect the FTC (or analogous state agency) to prosecute a company for sharing anonymized data (on grounds of misleading consumers about data collection and sharing practices) so long it took precautions to ensure the data couldn't be easily re-identified.

But this is uncertain. The lack of case law at this point makes it difficult to clearly define what constitutes a) proper precautions (if any) for anonymizing data and b) reasonable consumer expectations as to the same . By way of example, The California Supreme Court recently held, for the first time, that a ZIP code - standing alone- qualifies as "personally identifiable information." See Pineda v. Williams-Sonoma Stores Inc., No. S178241 slip op. (Calif. Feb. 10, 2011).

The New Media Lawyer

Tuesday, August 2, 2011

The Problem of Anonymized Data

No comments:

Post a Comment