Gathering a User’s Posts from Twitter with R

Getting started

This tutorial will give you the information you need to go from wanting to analyze Twitter data to getting your own spreadsheet of tweets. It is aimed at beginners who want to start analyzing real, current social media data. See this page for basic info about what Twitter data is available.

  1. Sign up for a normal Twitter account if you don’t already have one.
  2. Get your API credentials by registering an app here. (Yes, you are a developer.)
  3. Download R and/or rStudio if you don’t already have it. You may need to do a basic tutorial to learn how to use R or rStudio, but you can probably follow this tutorial without much understanding of R beyond how to run commands.

Install required packages (one time only)

install.packages("twitteR")
install.packages("rio")

If you haven’t done so already, set your working directory (where the output file goes) Windows Example:setwd(“C:/Users/User Name/Documents/FOLDER”) Mac Example: setwd(“/Users/User Name/Documents/FOLDER”)

setwd("YourWorkingDirectory")

Downloading the data

Load the required packages (previously installed)

library("twitteR")
library("rio")

Authenticate to Twitter – fill in the info you got during Getting Started Step 2 above

setup_twitter_oauth(consumer_key='YourConsumerKey', consumer_secret='YourConsumerSecret', access_token='YourAccessToken', access_secret='YourAccessSecret')

If you are analyzing a Twitter user’s timeline:

Fill in the name of the user whose timeline you want to download. Do not start with @. It is not case sensitive. That is, hillaryclinton is the same as HillaryClinton. This must be the twitter user name, not the person’s name.

userName = "YourUserNameToDownload"

Download the timeline – this takes about 2-3 minutes

userTl = userTimeline(userName,n=3200, includeRts = TRUE)

If you are analyzing a Twitter search:

See the query operators on this page regarding what can be searched, and just try your search on Twitter first to see if you are getting what you want. You will want to change the number of tweets to something larger probably, and you may or may not not want to limit the language results, if not just delete , lang = “en”.

searchTerm = "Your search term"
tweets = searchTwitter(searchTerm, n=1000, lang = "en")
dfT = twListToDF(tweets)

All types of analysis:

Flatten the data structure twitteR provides down to a simple spreadsheet-like data frame.

userDf = twListToDF(userTl)

Strip out any tabs or new line characters in the tweet so that your tsv file will not be corrupt.

dfT$tweetC = gsub("\t"," ",dfT$text)
dfT$tweetC = gsub("\n"," ",dfT$tweetC)

Name the output TSV file YourUserNameToDownload-tweets.tsv

fileName = paste(userName,"-tweets.tsv",sep="")

Send the user’s timeline data frame to a TSV file on your computer, suitable for uploading to Google Sheets (or many other programs). This will OVERWRITE any existing file with the same name. You can have it append add ,append=TRUE, however if you do that you would also want to add ,col.names = FALSE to not duplicate the header row, and you’d need to check for duplicates when you analyze it.

export(userDf,fileName)

Once you run the export command your file is in your working directory with the name YourUserNameToDownload-tweets.tsv. This file is now ready to import into a spreadsheet. For ‘real’ data science / research you shouldn’t use a spreadsheet for analysis, but until you’re ready to learn R and/or Python you can learn quite a lot about working with data just starting with a spreadsheet.

Here is the code in one block:

#one time only
install.packages("twitteR")
install.packages("rio")
setwd("YourWorkingDirectory")

#download data (fill in your info)
library("twitteR")
library("rio")

setup_twitter_oauth(consumer_key='YourConsumerKey', consumer_secret='YourConsumerSecret', access_token='YourAccessToken', access_secret='YourAccessSecret')

#if you are analyzing a user time line
userName = "YourUserNameToDownload"
userTl = userTimeline(userName,n=3200, includeRts = TRUE)
dfT = twListToDF(userTl)

#if you are analyzing a search
searchTerm = "Your search term"
tweets = searchTwitter(searchTerm, n=1000, lang = "en")
dfT = twListToDF(tweets)
#export #strip out all tabs and new lines from the tweet field 
dfT$tweetC = gsub("\t"," ",dfT$text) 
dfT$tweetC = gsub("\n"," ",dfT$tweetC) 
#add a link to the tweet dfT$linkToTweet = paste("http://twitter.com/",dfT$screenName,"/status/",dfT$id,sep="") 
#strip out the link from the source field 
dfT$sourceC = sub("<a href=\".*\">","",dfT$statusSource) 
dfT$sourceC = sub("</a>","",dfT$sourceC) 
#subset only fields to export 
myvars = c("tweetC","created","linkToTweet","retweetCount","isRetweet","favoriteCount" ,"id","sourceC", "replyToSN", "truncated","replyToSID","replyToUID","screenName","longitude", "latitude") 
dataExport = dfT[myvars] fileName = paste(userName,"-tweets.tsv",sep="") 
export(dataExport,fileName)
Continue Reading

Privacy & Open Science: Universal Numerical Fingerprint

There is a tension between open and transparent science and privacy concerns. I have and will continue to work with real-world web history data, which, even when participants contribute it with fully informed consent, potentially has quite a bit of private information contained in it. Because of the level of detail it contains it would be difficult to anonymize in a way that didn’t strip away its utility.

What’s an open science advocate to do? Enter the universal numerical fingerprint (UNF):

The universal numerical fingerprint begins with “UNF”. Four features make the UNF especially useful: The UNF algorithm’s cryptographic technology ensures that the alphanumeric identifier will change when any portion of the data set changes. Not only does this assure future researchers that they can use the same data set referenced in a years-old journal article, it enables the data set’s owner to track each iteration of the owner’s research. When an original data set is updated or incorporated into a new, related data set, the algorithm generates a unique UNF each time. The UNF is determined by the content of the data, not the format in which it is stored. For example, you create a data set in SPSS, Stata or R, and five years later, you need to look at your data set again, but the data was converted to the next big thing (NBT). You can use NBT, recompute the UNF, and verify for certain that the data set you’re downloading is the same one you created originally. That is, the UNF will not change. Knowing only the UNF, journal editors can be confident that they are referencing a specific data set that never can be changed, even if they do not have permission to see the data. In a sense, the UNF is the ultimate summary statistic. The UNF’s noninvertible, cryptographic properties guarantee that acquiring the UNF of a data set conveys no information about the content of the data. Authors can take advantage of this property to distribute the full citation of a data set–including the UNF–even if the data is proprietary or highly confidential, all without the risk of disclosure. http://best-practices.dataverse.org/data-citation/#data-citation-standard

Continue Reading

Data fraud is not particular to graduate students

In the wake of the Michael LaCour (a political science graduate student at UCLA) data fabrication scandal that erupted last week (evidence, article, hashtag) I’ve heard several professor friends worry that their own students could have faked data, since they didn’t have procedures in place to catch fraud. Advisor-student relationships are often family-like, such that your advisor’s advisor would often half-jokingly be referred to as your grand-advisor. Advisors, like parents, range widely in the trust they place in their ‘children.’ However, data fraud is not a particular penchant of graduate students.

Take the case of Diederik Staples, a social psychologist in the Netherlands who faked studies for many years, including the data for studies on which his students based their dissertations. The more powerful supervisor is much more likely to harm the graduate student than the other way around. While I am absolutely in favor of common-sense transparent procedures to protect data integrity, like what Thomas Leper describes, I hope this incident doesn’t inspire paranoia on the part of graduate advisors, or anyone else. I suspect it is quite rare that people are willing to risk their career and reputation forever by fabricating data.

This makes such cases quite interesting, and my web browsing history visualization from last Friday shows.

laCourDay

Continue Reading

Web page refresh

With a blog post coming out on The Policy and Internet Blog tomorrow it felt like time to refresh my personal website and start from a clean WordPress installation. I’ll be doing a lot of “reinstallation” over the next few months as I’ll be moving back to the US after a wonderful nearly 3 years here in the Netherlands at Erasmus University Rotterdam to a position at American University in Washington D.C. in the School of Communication. I also just created another new website for a Chrome browser extension I’m working on called Web Historian, so I saw how easy WordPress is to work with these days. This website is currently pretty simple and I hope to keep it that way :)

Continue Reading