Twitter data with R

Collecting twitter data with R

This is a quick how-to on collecting twitter data with the rtweet package. I found the vignettes and website for this package incredibly helpful.

This tutorial was prompted by a request from a colleague:

use any program or method of your choosing, retrieve just a small sample of tweets with latitude and longitude info and use at least one or more of the keywords.

The “keywords” came in an excel file. I read this file into RStudio and examined the data structure with the following lines of code.

# fs::dir_ls("Data/Raw")
KeyWords <- readxl::read_xlsx("Data/Raw/Keywords racial slurs and neutral terms 041718.xlsx")
KeyWords %>% glimpse(60)

## Observations: 424
## Variables: 3
## $ Items    <chr> "afghanistan", "afghanistani", "afghan...
## $ Negative <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,...
## $ Race     <chr> "Middle Eastern", "Middle Eastern", "M...

I normalized the variable names to make life a little easier.

KeyWords <- KeyWords %>% 
    magrittr::set_names(str_to_lower(names(KeyWords)))
KeyWords %>% head()

## # A tibble: 6 x 3
##   items             negative race          
##   <chr>                <dbl> <chr>         
## 1 afghanistan              0 Middle Eastern
## 2 afghanistani             0 Middle Eastern
## 3 afghans                  0 Middle Eastern
## 4 african american         0 Black         
## 5 african americans        0 Black         
## 6 african't                1 Black

Now take a random sample of 100 words to search on (this code is presented here so you can reproduce)

KWSample <- KeyWords %>% sample(., size = 100)

Set up the twitter app (rtweet)

Load the package

library(rtweet)

This is the first step for collecting tweets based on location. See the vignette here. I’ve outlined this process in the link below.

rtweet_setup

Collecting twitter longitude and latitude data using keywords in KWSample

I need latitude and longitude of the items in the KWSample data frame. First I will put these into their own vector sample_items.

sample_items <- KWSample$items

The first function I want to use is the rtweet::search_tweets() function. This works by taking a query q (our term), the number of tweets I want to return (n = 250).

I’ll eventually be using all the terms in the sample_items column, but first I have to do a little wrangling to make sure the text string is appropriate to be used in this function (it can’t exceed 500 characters).

# sample_terms <- str_replace_all(toString(sample_items), 
#                                 pattern = ",", 
#                                 replacement = "")
# nchar(sample_terms) # too big
# sample_terms <- substr(sample_terms, 1, 498)
# nchar(sample_terms) # 498

given the news this morning, I hand-picked a few terms.

sample_terms <- "israelis OR islamic OR resettlement OR deport"

Now I can search on the terms with the 498 sample characters I’ve created in sample_terms.

## search for 10000 tweets containing the words in sample_terms
KWSampleTweets <- search_tweets(q = sample_terms, n = 10000, retryonratelimit = TRUE
)
KWSampleTweets %>% glimpse(78)

## Observations: 5,600
## Variables: 87
## $ user_id                 <chr> "100146431", "1002378390", "1003630033", ...
## $ status_id               <chr> "996431309231869952", "996433266176487424...
## $ created_at              <dttm> 2018-05-15 16:45:11, 2018-05-15 16:52:58...
## $ screen_name             <chr> "AmericanMOM01", "jkryan1", "Bulamwa", "s...
## $ text                    <chr> "@CBSNews 2)Capitol of Israel doesn't mak...
## $ source                  <chr> "Twitter for Android", "Twitter for Andro...
## $ display_text_width      <dbl> 222, 70, 140, 139, 258, 140, 139, 59, 278...
## $ reply_to_status_id      <chr> "996422219646554114", "996430032011640833...
## $ reply_to_user_id        <chr> "100146431", "27706099", NA, NA, "9104920...
## $ reply_to_screen_name    <chr> "AmericanMOM01", "FOXBaltimore", NA, NA, ...
## $ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ is_retweet              <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TR...
## $ favorite_count          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ retweet_count           <int> 0, 0, 707, 2784, 0, 90, 10, 58, 0, 5183, ...
## $ hashtags                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ symbols                 <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ urls_url                <list> [NA, NA, NA, NA, NA, NA, NA, NA, "timeso...
## $ urls_t.co               <list> [NA, NA, NA, NA, NA, NA, NA, NA, "https:...
## $ urls_expanded_url       <list> [NA, NA, NA, NA, NA, NA, NA, NA, "https:...
## $ media_url               <list> [NA, NA, NA, NA, NA, NA, NA, "http://pbs...
## $ media_t.co              <list> [NA, NA, NA, NA, NA, NA, NA, "https://t....
## $ media_expanded_url      <list> [NA, NA, NA, NA, NA, NA, NA, "https://tw...
## $ media_type              <list> [NA, NA, NA, NA, NA, NA, NA, "photo", NA...
## $ ext_media_url           <list> [NA, NA, NA, NA, NA, NA, NA, <"http://pb...
## $ ext_media_t.co          <list> [NA, NA, NA, NA, NA, NA, NA, <"https://t...
## $ ext_media_expanded_url  <list> [NA, NA, NA, NA, NA, NA, NA, <"https://t...
## $ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ mentions_user_id        <list> ["15012486", "27706099", "18426419", "77...
## $ mentions_screen_name    <list> ["CBSNews", "FOXBaltimore", "rafsanchez"...
## $ lang                    <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_status_id       <chr> NA, NA, "996391187694211072", "9962645210...
## $ retweet_text            <chr> NA, NA, "The Israelis keep dropping tear ...
## $ retweet_created_at      <dttm> NA, NA, 2018-05-15 14:05:45, 2018-05-15 ...
## $ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ retweet_favorite_count  <int> NA, NA, 446, 3210, NA, 117, 9, 97, NA, 86...
## $ retweet_retweet_count   <int> NA, NA, 707, 2784, NA, 90, 10, 58, NA, 51...
## $ retweet_user_id         <chr> NA, NA, "18426419", "774174489458331649",...
## $ retweet_screen_name     <chr> NA, NA, "rafsanchez", "TheMossadIL", NA, ...
## $ retweet_name            <chr> NA, NA, "Raf Sanchez", "The Mossad", NA, ...
## $ retweet_followers_count <int> NA, NA, 15185, 64948, NA, 2105955, 6695, ...
## $ retweet_friends_count   <int> NA, NA, 4097, 1, NA, 572, 704, 359, NA, 7...
## $ retweet_statuses_count  <int> NA, NA, 18205, 1608, NA, 132476, 100568, ...
## $ retweet_location        <chr> NA, NA, "Middle East", "Israel", NA, "", ...
## $ retweet_description     <chr> NA, NA, "Middle East correspondent | Tele...
## $ retweet_verified        <lgl> NA, NA, TRUE, FALSE, NA, TRUE, FALSE, FAL...
## $ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ geo_coords              <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
## $ coords_coords           <list> [<NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>,...
## $ bbox_coords             <list> [<NA, NA, NA, NA, NA, NA, NA, NA>, <NA, ...
## $ name                    <chr> "Cathie", "J K Ryan", "Bunge la Mwananchi...
## $ location                <chr> "Tishomingo, Oklahoma", "LV now. L.A. / N...
## $ description             <chr> "", "", "Main People's Assembly page in K...
## $ url                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, "http://t...
## $ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ followers_count         <int> 1686, 34, 1315, 2465, 6, 1930, 2063, 2627...
## $ friends_count           <int> 1946, 77, 534, 3196, 35, 953, 333, 688, 4...
## $ listed_count            <int> 25, 2, 7, 121, 0, 133, 152, 22, 18, 36, 9...
## $ statuses_count          <int> 173307, 7029, 2564, 174476, 451, 225898, ...
## $ favourites_count        <int> 207225, 982, 37, 46038, 177, 295138, 9800...
## $ account_created_at      <dttm> 2009-12-29 05:38:22, 2012-12-10 20:08:14...
## $ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ profile_url             <chr> NA, NA, NA, NA, NA, NA, NA, NA, "http://t...
## $ profile_expanded_url    <chr> NA, NA, NA, NA, NA, NA, NA, NA, "https://...
## $ account_lang            <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/10...
## $ profile_background_url  <chr> "http://abs.twimg.com/images/themes/theme...
## $ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/1632...

This can be narrowed to the San Francisco area using lookup_coords()

## get coordinates associated with the following addresses/components
sf_coords <- lookup_coords(address = "san francisco")
oak_coords <- lookup_coords(address = "oakland")

This gives me the two data frames below with tweets from San Francisco and Oakland.

## pass a returned coords object to search_tweets
KWSampleTweetsSF <- search_tweets(
                        q = sample_terms, 
                        n = 250, 
                        geocode = sf_coords, 
                        retryonratelimit = TRUE)

KWSampleTweetsOak <- search_tweets(
                        q = sample_terms, 
                        n = 250, 
                        geocode = oak_coords, 
                        retryonratelimit = TRUE)

As shown in the data frames above, the geo_coords variable contains the longitude and latitude variables requested above.

Visualizing tweets

Visualization is simplified using the ts_plot() function.

KWSampleTweets %>%
    dplyr::group_by(is_retweet) %>%
    rtweet::ts_plot(by = "mins") + 
    ggplot2::labs(
    title = "Tweets or Retweets for keywords israelis, islamic, resettlement, or deport",
    subtitle = "Tweets collected, parsed, and plotted using `rtweet`"
  )

KWSampleTweets_is_retweet_plot

We can see the vast majority of these tweets are coming from retweets.

KWSampleTweets %>% 
    dplyr::filter(str_detect(source, "Twitter")) %>% 
    dplyr::group_by(source) %>% 
    ts_plot(., by = "mins") + 
    facet_wrap(. ~ source) + 
    ggplot2::labs(
    title = "Tweets for keywords israelis, islamic, resettlement, or deport",
    subtitle = "Tweets collected, parsed, and plotted using `rtweet`"
)

KWSampleTweets_source_plot

The data and R code for this tutorial is available here.


You'll only receive email when they publish something new.

More from Martin J Frigaard
All posts