My Research

Since the Cambridge Analytica scandal, skepticism about academic research involving collecting digital data has been heightened, and this is an important conversation, which focused on a study I am conducting in a recent article on Fast Company. Researchers who want to understand today’s digital world have two options for gathering data, asking users through informed consent or gathering data through corporations (e.g. Facebook, or research panel companies) who use opaque terms of service or recruitment methods. It is more transparent and open to scrutiny to partner with users, and I welcome this opportunity to explain how I am conducting this study.

I have made the study fully transparent for participants who choose to volunteer. In addition to the standard IRB process at American University:

  • I developed and posted open source software to provide interactive visualizations of the web browsing data on participants computers for them to review and decide whether or not to delete before they decide to consent to its collection, which is optional.
  • I also conducted an experimental study to determine whether these data visualizations actually inform users about what web browsing data contains and how it can be used. The visualizations performed significantly better than an informed consent form alone. I will be releasing the results of this study online next week.
  • I am also carefully safeguarding the data I have been entrusted with. It is encrypted during transmission, and once downloaded stored in an offline hard disk inaccessible to the Internet.

There is nothing deceptive about the study, unlike the Cambridge data collection. The data will only be used for this research project. In this project so far our ads have currently reached 126,476 Facebook users, receiving 51 comments, 15 of which were related to mistrust and the Cambridge Analytica scandal. We responded promptly and, I believe, appropriately.

Doing this work involves developing trust with potential participants, and this is a difficult time to do so. Nevertheless, I believe it is vitally important that research about digital life continues and that data collection is possible for academic researchers concerned with the impact of these technologies on social life, rather than restricted to corporate use.

Source code for the research extension: https://github.com/WebHistorian/community
Public version of the extension free to use with no data collection: https://chrome.google.com/webstore/detail/web-historian-edu-history/chpcblajbmmlbhecpnnadmjmlbhkloji and the website about this extension is available here: http://webhistorian.org/

Continue Reading

Privacy & Open Science: Universal Numerical Fingerprint

There is a tension between open and transparent science and privacy concerns. I have and will continue to work with real-world web history data, which, even when participants contribute it with fully informed consent, potentially has quite a bit of private information contained in it. Because of the level of detail it contains it would be difficult to anonymize in a way that didn’t strip away its utility.

What’s an open science advocate to do? Enter the universal numerical fingerprint (UNF):

The universal numerical fingerprint begins with “UNF”. Four features make the UNF especially useful: The UNF algorithm’s cryptographic technology ensures that the alphanumeric identifier will change when any portion of the data set changes. Not only does this assure future researchers that they can use the same data set referenced in a years-old journal article, it enables the data set’s owner to track each iteration of the owner’s research. When an original data set is updated or incorporated into a new, related data set, the algorithm generates a unique UNF each time. The UNF is determined by the content of the data, not the format in which it is stored. For example, you create a data set in SPSS, Stata or R, and five years later, you need to look at your data set again, but the data was converted to the next big thing (NBT). You can use NBT, recompute the UNF, and verify for certain that the data set you’re downloading is the same one you created originally. That is, the UNF will not change. Knowing only the UNF, journal editors can be confident that they are referencing a specific data set that never can be changed, even if they do not have permission to see the data. In a sense, the UNF is the ultimate summary statistic. The UNF’s noninvertible, cryptographic properties guarantee that acquiring the UNF of a data set conveys no information about the content of the data. Authors can take advantage of this property to distribute the full citation of a data set–including the UNF–even if the data is proprietary or highly confidential, all without the risk of disclosure. http://best-practices.dataverse.org/data-citation/#data-citation-standard

Continue Reading

Data fraud is not particular to graduate students

In the wake of the Michael LaCour (a political science graduate student at UCLA) data fabrication scandal that erupted last week (evidence, article, hashtag) I’ve heard several professor friends worry that their own students could have faked data, since they didn’t have procedures in place to catch fraud. Advisor-student relationships are often family-like, such that your advisor’s advisor would often half-jokingly be referred to as your grand-advisor. Advisors, like parents, range widely in the trust they place in their ‘children.’ However, data fraud is not a particular penchant of graduate students.

Take the case of Diederik Staples, a social psychologist in the Netherlands who faked studies for many years, including the data for studies on which his students based their dissertations. The more powerful supervisor is much more likely to harm the graduate student than the other way around. While I am absolutely in favor of common-sense transparent procedures to protect data integrity, like what Thomas Leper describes, I hope this incident doesn’t inspire paranoia on the part of graduate advisors, or anyone else. I suspect it is quite rare that people are willing to risk their career and reputation forever by fabricating data.

This makes such cases quite interesting, and my web browsing history visualization from last Friday shows.

laCourDay

Continue Reading

Web page refresh

With a blog post coming out on The Policy and Internet Blog tomorrow it felt like time to refresh my personal website and start from a clean WordPress installation. I’ll be doing a lot of “reinstallation” over the next few months as I’ll be moving back to the US after a wonderful nearly 3 years here in the Netherlands at Erasmus University Rotterdam to a position at American University in Washington D.C. in the School of Communication. I also just created another new website for a Chrome browser extension I’m working on called Web Historian, so I saw how easy WordPress is to work with these days. This website is currently pretty simple and I hope to keep it that way :)

Continue Reading