Tuesday, May 06, 2014

Daily Google Trends data using R

This blog post is a work in progress and has been updated several times. Last update 11.5.2016.

For an explanation of how to combine weekly and daily Google Trends data to create long daily time series with Search Volume Data, please refer to this blog post. You can find a working example in R here.

This R script weaves daily Google Trends data together into a continuous time series. For a look at how daily Google Trends data differs from the weekly data, take a look at this blog post. The graph below illustrated the daily search  data for "FTSE 100".


Since I had a lot of problems with authentication in the available tools for downloading Google Trends data, I decided to circumvent the whole authentication issue by downloading the files through the browser. The file downloads are automated using R. To execute the example below, you will need to add these functions from my Github account to R.
Make sure that you are signed in before you run the script. Since you will download 120 individual csv files, it will take several minutes to complete the run.

The function URL_GT creates the URL for the Google Trends file export. The function readGT downloads the file through the browser and returns the time series data. When the functions are available to R, running the code below shoud return a time series that looks like the one above. Let me know in the comments if it isn't working.


library(Rmisc)
library(ggplot2)
library(dplyr)

# The Google Trends formating functions -----------------------------------

#This script automates the downloading of Google Trends.
#It works best with firefox in combination with the Tab Mix Plus add-on that is used to automate tab closing.
#Ask firefox not to prompt for new downloads and this script should run automatically.
#Google Trends restricts the number of download to roughly 400 at a time.

URL_GT=function(keyword="", country=NA, region=NA, year=NA, month=1, length=3){
  
  start="http://www.google.com/trends/trendsReport?hl=en-US&q="
  end="&cmpt=q&content=1&export=1"
  geo=""
  date=""
  
  #Geographic restrictions
  if(!is.na(country)) {
    geo="&geo="
    geo=paste(geo, country, sep="")
    if(!is.na(region)) geo=paste(geo, "-", region, sep="")
  }
  
  queries=keyword[1]
  if(length(keyword)>1) {
    for(i in 2:length(keyword)){
      queries=paste(queries, "%2C ", keyword[i], sep="")
    }
  }
  
  #Dates
  if(!is.na(year)){
    date="&date="
    date=paste(date, month, "%2F", year, "%20", length, "m", sep="")
  }
  
  URL=paste(start, queries, geo, date, end, sep="")
  URL <- gsub(" ", "%20", URL)
  return(URL)
}

downloadGT=function(URL, downloadDir){
  
  #Determine if download has been completed by comparing the number of files in the download directory to the starting number
  startingFiles=list.files(downloadDir)
  browseURL(URL)
  endingFiles=list.files(downloadDir)
  
  while(length(setdiff(endingFiles,startingFiles))==0) {
    Sys.sleep(3)
    endingFiles=list.files(downloadDir)
  }
  filePath=setdiff(endingFiles,startingFiles)
  return(filePath)
}


readGT=function(filePath){
  rawFiles=list()
  
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  output=data.frame()
  name=vector()
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    name=as.character(t(data[5,-1]))
    
    #Select the time series
    start=which(data[,1]=="")[1]+3
    stop=which(data[,1]=="")[2]-2
    
    #Skip to next if file is empty
    if(ncol(data)<2) next
    if(is.na(which(data[,1]=="")[2]-2)) next
    
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    
    #Convert all columns except date column into numeric
    for(j in 2:ncol(data)) data[,j]=as.numeric(as.character(data[,j]))
    
    #FORMAT DATE
    len=nchar(data[1,1])
    
    #Monthly data
    if(len==7) {
      data[,1]=as.Date(paste(data[,1], "-1", sep=""), "%Y-%m-%d")
      data[,1]=sapply(data[,1], seq, length=2, by="1 month")[2,]-1
      data[,1]=as.Date(data[,1], "%Y-%m-%d", origin="1970-01-01")
    }
    
    #Weekly data
    if(len==23){
      data[,1]=sapply(data[,1], substr, start=14, stop=30)
      data[,1]=as.Date(data[,1], "%Y-%m-%d")
    }
    
    #Daily data
    if(len==10) data[,1]=as.Date(data[,1], "%Y-%m-%d")
    
    #Structure into panel data format
    panelData=data[1:2]
    panelData[3]=name[1]
    names(panelData)=c("Date", "SVI", "Keyword")
    if(ncol(data)>2) {
      
      for(j in 3:ncol(data)) {
        appendData=data[c(1,j)]
        appendData[3]=name[j-1]
        names(appendData)=c("Date", "SVI", "Keyword")
        panelData=rbind(panelData, appendData)
      }
    }
    
    #Add file name  
    panelData[ncol(panelData)+1]=filePath[i]
    
    #Add path to filename
    names(panelData)[4]="Path"
    
    #Merge several several files into one
    if(i==1) output=panelData
    if(i>1) output=rbind(output, panelData)
  }
  return(output)
}

readGeoGT=function(filePath){
  output=data.frame()
  rawFiles=list()
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    start=which(data[,1]=="")[3]+3
    stop=which(data[,1]=="")[4]-1
    names=data[start-1,]
    
    for(j in 1:ncol(names)) names(data)[j]=as.character(names[1,j])
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    data[,-1]=as.numeric(as.character(data[,-1]))
    data[ncol(data)+1]=filePath[i]
    
    output=rbind(output, data)
  }
  return(output)
}


# Downloading the data ----------------------------------------------------


search_terms = c("bull market", "bear market", "recession")

years = c(2005,2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
months = c(1,4,7,10)
res.daily=list()
counter=1
for(year in years){
  for(month in months){
    url=URL_GT(search_terms, year=year, month=month)
    GT_dir = downloadGT(url, downloadDir)
    GT_dir = paste(downloadDir, GT_dir, sep='/')
    res.daily[[counter]] = readGT(GT_dir)
    counter=counter+1
  }
}

df.daily <- do.call("rbind", res.daily)

url = URL_GT(search_terms)
GT_dir = downloadGT(url, downloadDir)
GT_dir = paste(downloadDir, GT_dir, sep='/')
df.weekly = readGT(GT_dir)


# Formating the data ------------------------------------------------------


df.merged = merge(df.daily, df.weekly, by=c('Date', 'Keyword'), all.x=T)
df.merged$adjustment_factor = df.merged$SVI.y /df.merged$SVI.x

for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(!is.finite(df.merged$adjustment_factor[r][j])){
      df.merged$adjustment_factor[r][j] = df.merged$adjustment_factor[r][j-1]
    }
  }
}
df.merged$daily = df.merged$adjustment_factor * df.merged$SVI.x
df.merged$weekly = df.merged$SVI.y
for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(is.na(df.merged$weekly[r][j])){
      df.merged$weekly[r][j] = df.merged$weekly[r][j-1]
    }
  }
}


# Plotting the data -------------------------------------------------------

df.merged$daily[which(is.infinite(df.merged$daily))] = NA

p1 = df.merged %>%
  ggplot(aes(Date, daily, color=Keyword))+geom_line()

p2 = df.merged %>%
  ggplot(aes(Date, weekly, color=Keyword))+geom_line()

multiplot(p1,p2)


# Saving the data ---------------------------------------------------------


write.csv(df.merged,'df.merged.csv')

25 comments:

Unknown said...

Hi,

Thanks for the code. However, I don't know how to use R. Could you please tell me what I have to do to automate downloading data for a set of firms' names. Thank u so much !

ErikJJ said...

Hi Amal, just download R from their web site. You can then paste this code into the console for a working example.- https://github.com/321k/Google-Trends/edit/master/Working%20example.

Unknown said...

Hello Erik,

I have a (un)related question with regard to Google Trends data's normalization. I understand that for a certain search term, the highest search volume is always scaled to 100. However, for a category (try e.g. Law & Government/Social Services/Welfare & Unemployment for the US, which is in Choi & Varian (2009)), the index is given in percentage change. If you play with the time span, the first observation is always 0%, but other observations may change.

I am guessing it is the underlying normalization that causes this change, but I want to know exactly how this index in percentage change is calculated, or at least how this transformation is made. I am using Trends data for forecasting, so it is important for me to understand how data get updated each week.

Thank you in advance!

Unknown said...

Hello Erik,
I am trying to open a url in a browser in MAC attempting to save a file. For example, the following link when opened in a browser will download a cvs file from google trends. However if I use the function browseURL, it does not open anything or saves anything.

URL="http://www.google.com/trends/trendsReport?hl=en-US&q=FTSE 100&date=1%2F2015 1m&cmpt=q&content=1&export=1"
browseURL(URL)
it works for webpages like (it open the page on the browser) URL="http://www.google.com" browseURL(URL)

but it doesn't work when it is the csv file of google trends.


Also, I would really appreciate if someone could tell me how many download google trends allows before blocking the IP. How long should I wait to try again? Thanks

Unknown said...

Hello Erik,
I am trying to open a url in a browser in MAC attempting to save a file. For example, the following link when opened in a browser will download a cvs file from google trends. However if I use the function browseURL, it does not open anything or saves anything.

URL="http://www.google.com/trends/trendsReport?hl=en-US&q=FTSE 100&date=1%2F2015 1m&cmpt=q&content=1&export=1"
browseURL(URL)
it works for webpages like (it open the page on the browser) URL="http://www.google.com" browseURL(URL)

but it doesn't work when it is the csv file of google trends.


Also, I would really appreciate if someone could tell me how many download google trends allows before blocking the IP. How long should I wait to try again? Thanks

ErikJJ said...

You can download around 400 CSVs, possibly more. The limit seems to be per day. Sorry I don't know how to get the browseURL function to work on mac.

Artful Dodger said...

Hi Erik,

I have been able to get this function to work when I restrict the downloads through 2014, but I run into issues when I include the year 2015 in the selections. The last download (for an incomplete month) is generating an error message.

Is there some way to modify the code to set a fixed end date for extraction (at a more granular level than calendar year)? Or just to fix the end date so that the last extraction doesn't result in an error?

Thanks

R makes the world a better place... said...

Hi Erik,

Thanks for your code.
As I change the following search keyword:

query="$TWTR"

R stops and shows the following error message:

Error in `[.data.frame`(gt_data, , 2) : undefined columns selected

Please let me know what need to be modified to be able to query the csv data.

Thanks,

Unknown said...

Dear Erik,

Thanks so much for the code. I have been trying to repeat your exercise using "S&P 500" instead of
"FTSE 100".

I think that the presence of "&" in the word "S&P 500" messes up the query and I get the search results of "S" instead of "S&P 500". Have you encountered this problem before?

Thanks!

Unknown said...

Dear Erik,
I figured it out. Just needed to replace "S&P 500" with "S%26P 500". Thanks!

ErikJJ said...

Good to hear, share your results when you're done!

Unknown said...

FYI. I had to add a pause during the download function to get it to work properly with chrome. On occasion it was confusing the chrome temporary download.

IT was just a 5 second pause before the
endingFiles=list.files(downloadDir)

Unknown said...

Hi Eric,

When I run the script all the csv-files get named Report(#). So I wondered if there is a way to rename the file?

ErikJJ said...

What browser are you using? Try using Firefox, that works for me. Chrome also works fine.

Unknown said...

I'm using Firefox. Is there a renaming function in the script already?

Unknown said...

Hi Erik,

Thanks for the wonderful code. Could you please suggest a change has to do to get the trends data on the minute level. It would be a great help.

Thanks.
Prakash Anand

ErikJJ said...

There's no renaming function unfortunately. You can rename the files after they have been downloaded with the file.rename function in R. More here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/files.html

ErikJJ said...

Here's an working example in R in case you get stuck. Should work with copy paste if you have the libraries installed. https://github.com/321k/Google-Trends/blob/master/Daily%20data%20example.R

Unknown said...

Hi Erik,

Thanks for your code from me as well. Very useful indeed! I did get it running but the daily re-indexed data I get in the "summary" dataframe seems to subtract one day from each month, i.e. I get 30 days of data for January instead of 31 and 27 days for February instead of 28 etc.

Any help on what I might be doing wrong here would be highly appreaciated!

Many thanks,
Paul

Unknown said...

Dear Erik,

Thank you for your code.
I test this code by using the same inputs (just copy and paste).
But I only get 3 months data (January to March 2005).
The program keeps running without any other output (and without error).
The process stuck on the downloading loop...

> for(year in years){
+ for(month in months){
+ url=URL_GT(search_terms, year=year, month=month)
+ GT_dir = downloadGT(url, downloadDir)
+ GT_dir = paste(downloadDir, GT_dir, sep='/')
+ res.daily[[counter]] = readGT(GT_dir)
+ counter=counter+1
+ }
+ }

After several hour, I have to press stop in order to end this process.

Could you please advice?
How can I solve this issue?

Thanks,
Phasin

calories in caesar salad said...
This comment has been removed by a blog administrator.
LuckyBao said...

Hi Erik,

Thanks for providing this method to download the google trends. However, when I try to run the code it always gives me an error message in the browser said that the "404 The requested URL was not found on this server", can you help me solve this problem? Thank you.

Elizabeth said...

I was so anxiuos to know what my husband was always doing late outside the house so i started contacting hackers and was scamed severly until i almost gave up then i contacted this one hacker and he delivered a good job showing evidences i needed from the apps on his phone like whatsapp,facebook,instagram and others and i went ahead to file my divorce papers with the evidences i got,He also went ahead to get me back some of my lost money i sent to those other fake hackers,every dollar i spent on these jobs was worth it.Contact him so he also help you.
mail: premiumhackservices@gmail.com
text or call +1 4016006790

Anonymous said...

HELLO VIEWERS

TESTIMONY ON HOW I GOT MY LOAN €300,000.00EURO FROM A FINANCE COMPANY LAST WEEK Email for immediate response: drbenjaminfinance@gmail.com

Do you need a loan to start a business or pay your debts {Dr.Benjamin Scarlett Owen} can also help you with a legit loan offer. He Has also helped some other colleagues of mine with a loan finance. Get your Blank ATM card or CREDIT CARD deliver to your doorstep that works in all ATM machines all over the world with the help of BENJAMIN LOAN FINANCE the ATM cards can be used to withdraw at the ATM Machines or swipe, at stores and POS. they give out this cards to all interested clients worldwide, If you need a loan without cost/stress he his the right loan lender to wipe away your financial problems and crisis today. BENJAMIN LOAN FINANCE holds all of the information about how to obtain money quickly and painlessly via Email drbenjaminfinance@gmail.com

patialalegitimate said...

Are you in need of Loan? Here all problem regarding Loans is solve between a short period of time what are you waiting for apply now and solve your problem or start a business with funds Contact us now. many more 2% interest rate.(Whats App) number +919394133968 patialalegitimate515@gmail.com
Mr Sorina

Entertaining Blogs - BlogCatalog Blog Directory
Bloggtoppen.se