Monday, May 19, 2014

Google Trends mass import using R

I was looking for a tool for mass download of daily Google Trends data for my thesis and couldn't find anything that worked for my needs. The tool was built to download Google Trends data in monthly snippets for  a given search word and given year or years.

I thought I would share my work here in case someone else has stumbled on the same issues. The file can be found here on github: https://gist.github.com/321k/823cce9769e58bc14214.

Make sure you have the quantmod library installed before you run it.

Getting the data

Since the code works by downloading the file thorugh the browser, it is not affected by changes to Google's authentication policy. Simply make sure that you are signed in to your Google account. This does however mean that the data download is slow. I recommend using Firefox with a tab manager to close the tabs after a download has been completed. Tab mix plus 0.4.1.3.1 works great for me. You will also need to check the box in the download prompt that lets Firefox download without prompting you. Finally, you need to specify the download directory as an empty folder.

There are four functions; downloadGT, importGT, formatGT, and mergeGT. downloadGT takes two imputs, the years you want to download, and the search querry you want to get. To run for multiple querries, simply add a loop:
querries=c("MSFT", "AAPL")
years=c("2012", "2013")
for(i in 1:length(querries) {downloadGT(years, querries[i])}

 Formating the data

Once we have downloaded the files, we need to import it to R and put it in a useable format. importGT gets the data, formatGT extracts the time series data, and mergeGT put the individual months into complete time series by company.
path="C:/data"
rawData=importGT(path)
formatedData=formatGT(rawData)
mergedData=mergeGT(formatedData)
And there you have it. I'm currently in the process of downloading the Google Trends data for FTSE 100 between 2004 and 2013. Google's quota limit allows me to download about ten companies per day, or 1200 files. Feel free to leave a comment or suggestion
.

Tuesday, May 06, 2014

Daily Google Trends data using R

This blog post is a work in progress and has been updated several times. Last update 11.5.2016.

For an explanation of how to combine weekly and daily Google Trends data to create long daily time series with Search Volume Data, please refer to this blog post. You can find a working example in R here.

This R script weaves daily Google Trends data together into a continuous time series. For a look at how daily Google Trends data differs from the weekly data, take a look at this blog post. The graph below illustrated the daily search  data for "FTSE 100".


Since I had a lot of problems with authentication in the available tools for downloading Google Trends data, I decided to circumvent the whole authentication issue by downloading the files through the browser. The file downloads are automated using R. To execute the example below, you will need to add these functions from my Github account to R.
Make sure that you are signed in before you run the script. Since you will download 120 individual csv files, it will take several minutes to complete the run.

The function URL_GT creates the URL for the Google Trends file export. The function readGT downloads the file through the browser and returns the time series data. When the functions are available to R, running the code below shoud return a time series that looks like the one above. Let me know in the comments if it isn't working.


library(Rmisc)
library(ggplot2)
library(dplyr)

# The Google Trends formating functions -----------------------------------

#This script automates the downloading of Google Trends.
#It works best with firefox in combination with the Tab Mix Plus add-on that is used to automate tab closing.
#Ask firefox not to prompt for new downloads and this script should run automatically.
#Google Trends restricts the number of download to roughly 400 at a time.

URL_GT=function(keyword="", country=NA, region=NA, year=NA, month=1, length=3){
  
  start="http://www.google.com/trends/trendsReport?hl=en-US&q="
  end="&cmpt=q&content=1&export=1"
  geo=""
  date=""
  
  #Geographic restrictions
  if(!is.na(country)) {
    geo="&geo="
    geo=paste(geo, country, sep="")
    if(!is.na(region)) geo=paste(geo, "-", region, sep="")
  }
  
  queries=keyword[1]
  if(length(keyword)>1) {
    for(i in 2:length(keyword)){
      queries=paste(queries, "%2C ", keyword[i], sep="")
    }
  }
  
  #Dates
  if(!is.na(year)){
    date="&date="
    date=paste(date, month, "%2F", year, "%20", length, "m", sep="")
  }
  
  URL=paste(start, queries, geo, date, end, sep="")
  URL <- gsub(" ", "%20", URL)
  return(URL)
}

downloadGT=function(URL, downloadDir){
  
  #Determine if download has been completed by comparing the number of files in the download directory to the starting number
  startingFiles=list.files(downloadDir)
  browseURL(URL)
  endingFiles=list.files(downloadDir)
  
  while(length(setdiff(endingFiles,startingFiles))==0) {
    Sys.sleep(3)
    endingFiles=list.files(downloadDir)
  }
  filePath=setdiff(endingFiles,startingFiles)
  return(filePath)
}


readGT=function(filePath){
  rawFiles=list()
  
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  output=data.frame()
  name=vector()
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    name=as.character(t(data[5,-1]))
    
    #Select the time series
    start=which(data[,1]=="")[1]+3
    stop=which(data[,1]=="")[2]-2
    
    #Skip to next if file is empty
    if(ncol(data)<2) next
    if(is.na(which(data[,1]=="")[2]-2)) next
    
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    
    #Convert all columns except date column into numeric
    for(j in 2:ncol(data)) data[,j]=as.numeric(as.character(data[,j]))
    
    #FORMAT DATE
    len=nchar(data[1,1])
    
    #Monthly data
    if(len==7) {
      data[,1]=as.Date(paste(data[,1], "-1", sep=""), "%Y-%m-%d")
      data[,1]=sapply(data[,1], seq, length=2, by="1 month")[2,]-1
      data[,1]=as.Date(data[,1], "%Y-%m-%d", origin="1970-01-01")
    }
    
    #Weekly data
    if(len==23){
      data[,1]=sapply(data[,1], substr, start=14, stop=30)
      data[,1]=as.Date(data[,1], "%Y-%m-%d")
    }
    
    #Daily data
    if(len==10) data[,1]=as.Date(data[,1], "%Y-%m-%d")
    
    #Structure into panel data format
    panelData=data[1:2]
    panelData[3]=name[1]
    names(panelData)=c("Date", "SVI", "Keyword")
    if(ncol(data)>2) {
      
      for(j in 3:ncol(data)) {
        appendData=data[c(1,j)]
        appendData[3]=name[j-1]
        names(appendData)=c("Date", "SVI", "Keyword")
        panelData=rbind(panelData, appendData)
      }
    }
    
    #Add file name  
    panelData[ncol(panelData)+1]=filePath[i]
    
    #Add path to filename
    names(panelData)[4]="Path"
    
    #Merge several several files into one
    if(i==1) output=panelData
    if(i>1) output=rbind(output, panelData)
  }
  return(output)
}

readGeoGT=function(filePath){
  output=data.frame()
  rawFiles=list()
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    start=which(data[,1]=="")[3]+3
    stop=which(data[,1]=="")[4]-1
    names=data[start-1,]
    
    for(j in 1:ncol(names)) names(data)[j]=as.character(names[1,j])
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    data[,-1]=as.numeric(as.character(data[,-1]))
    data[ncol(data)+1]=filePath[i]
    
    output=rbind(output, data)
  }
  return(output)
}


# Downloading the data ----------------------------------------------------


search_terms = c("bull market", "bear market", "recession")

years = c(2005,2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
months = c(1,4,7,10)
res.daily=list()
counter=1
for(year in years){
  for(month in months){
    url=URL_GT(search_terms, year=year, month=month)
    GT_dir = downloadGT(url, downloadDir)
    GT_dir = paste(downloadDir, GT_dir, sep='/')
    res.daily[[counter]] = readGT(GT_dir)
    counter=counter+1
  }
}

df.daily <- do.call("rbind", res.daily)

url = URL_GT(search_terms)
GT_dir = downloadGT(url, downloadDir)
GT_dir = paste(downloadDir, GT_dir, sep='/')
df.weekly = readGT(GT_dir)


# Formating the data ------------------------------------------------------


df.merged = merge(df.daily, df.weekly, by=c('Date', 'Keyword'), all.x=T)
df.merged$adjustment_factor = df.merged$SVI.y /df.merged$SVI.x

for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(!is.finite(df.merged$adjustment_factor[r][j])){
      df.merged$adjustment_factor[r][j] = df.merged$adjustment_factor[r][j-1]
    }
  }
}
df.merged$daily = df.merged$adjustment_factor * df.merged$SVI.x
df.merged$weekly = df.merged$SVI.y
for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(is.na(df.merged$weekly[r][j])){
      df.merged$weekly[r][j] = df.merged$weekly[r][j-1]
    }
  }
}


# Plotting the data -------------------------------------------------------

df.merged$daily[which(is.infinite(df.merged$daily))] = NA

p1 = df.merged %>%
  ggplot(aes(Date, daily, color=Keyword))+geom_line()

p2 = df.merged %>%
  ggplot(aes(Date, weekly, color=Keyword))+geom_line()

multiplot(p1,p2)


# Saving the data ---------------------------------------------------------


write.csv(df.merged,'df.merged.csv')
Entertaining Blogs - BlogCatalog Blog Directory
Bloggtoppen.se