For an explanation of how to combine weekly and daily Google Trends data to create long daily time series with Search Volume Data, please refer to this blog post. You can find a working example in R here.
This R script weaves daily Google Trends data together into a continuous time series. For a look at how daily Google Trends data differs from the weekly data, take a look at this blog post. The graph below illustrated the daily search data for "FTSE 100".
Since I had a lot of problems with authentication in the available tools for downloading Google Trends data, I decided to circumvent the whole authentication issue by downloading the files through the browser. The file downloads are automated using R. To execute the example below, you will need to add these functions from my Github account to R.
The function URL_GT creates the URL for the Google Trends file export. The function readGT downloads the file through the browser and returns the time series data. When the functions are available to R, running the code below shoud return a time series that looks like the one above. Let me know in the comments if it isn't working.
library(Rmisc)
library(ggplot2)
library(dplyr)
# The Google Trends formating functions -----------------------------------
#This script automates the downloading of Google Trends.
#It works best with firefox in combination with the Tab Mix Plus add-on that is used to automate tab closing.
#Ask firefox not to prompt for new downloads and this script should run automatically.
#Google Trends restricts the number of download to roughly 400 at a time.
URL_GT=function(keyword="", country=NA, region=NA, year=NA, month=1, length=3){
start="http://www.google.com/trends/trendsReport?hl=en-US&q="
end="&cmpt=q&content=1&export=1"
geo=""
date=""
#Geographic restrictions
if(!is.na(country)) {
geo="&geo="
geo=paste(geo, country, sep="")
if(!is.na(region)) geo=paste(geo, "-", region, sep="")
}
queries=keyword[1]
if(length(keyword)>1) {
for(i in 2:length(keyword)){
queries=paste(queries, "%2C ", keyword[i], sep="")
}
}
#Dates
if(!is.na(year)){
date="&date="
date=paste(date, month, "%2F", year, "%20", length, "m", sep="")
}
URL=paste(start, queries, geo, date, end, sep="")
URL <- gsub(" ", "%20", URL)
return(URL)
}
downloadGT=function(URL, downloadDir){
#Determine if download has been completed by comparing the number of files in the download directory to the starting number
startingFiles=list.files(downloadDir)
browseURL(URL)
endingFiles=list.files(downloadDir)
while(length(setdiff(endingFiles,startingFiles))==0) {
Sys.sleep(3)
endingFiles=list.files(downloadDir)
}
filePath=setdiff(endingFiles,startingFiles)
return(filePath)
}
readGT=function(filePath){
rawFiles=list()
for(i in 1:length(filePath)){
if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
}
output=data.frame()
name=vector()
for(i in 1:length(rawFiles)){
data=rawFiles[[i]]
name=as.character(t(data[5,-1]))
#Select the time series
start=which(data[,1]=="")[1]+3
stop=which(data[,1]=="")[2]-2
#Skip to next if file is empty
if(ncol(data)<2) next
if(is.na(which(data[,1]=="")[2]-2)) next
data=data[start:stop,]
data[,1]=as.character(data[,1])
#Convert all columns except date column into numeric
for(j in 2:ncol(data)) data[,j]=as.numeric(as.character(data[,j]))
#FORMAT DATE
len=nchar(data[1,1])
#Monthly data
if(len==7) {
data[,1]=as.Date(paste(data[,1], "-1", sep=""), "%Y-%m-%d")
data[,1]=sapply(data[,1], seq, length=2, by="1 month")[2,]-1
data[,1]=as.Date(data[,1], "%Y-%m-%d", origin="1970-01-01")
}
#Weekly data
if(len==23){
data[,1]=sapply(data[,1], substr, start=14, stop=30)
data[,1]=as.Date(data[,1], "%Y-%m-%d")
}
#Daily data
if(len==10) data[,1]=as.Date(data[,1], "%Y-%m-%d")
#Structure into panel data format
panelData=data[1:2]
panelData[3]=name[1]
names(panelData)=c("Date", "SVI", "Keyword")
if(ncol(data)>2) {
for(j in 3:ncol(data)) {
appendData=data[c(1,j)]
appendData[3]=name[j-1]
names(appendData)=c("Date", "SVI", "Keyword")
panelData=rbind(panelData, appendData)
}
}
#Add file name
panelData[ncol(panelData)+1]=filePath[i]
#Add path to filename
names(panelData)[4]="Path"
#Merge several several files into one
if(i==1) output=panelData
if(i>1) output=rbind(output, panelData)
}
return(output)
}
readGeoGT=function(filePath){
output=data.frame()
rawFiles=list()
for(i in 1:length(filePath)){
if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
}
for(i in 1:length(rawFiles)){
data=rawFiles[[i]]
start=which(data[,1]=="")[3]+3
stop=which(data[,1]=="")[4]-1
names=data[start-1,]
for(j in 1:ncol(names)) names(data)[j]=as.character(names[1,j])
data=data[start:stop,]
data[,1]=as.character(data[,1])
data[,-1]=as.numeric(as.character(data[,-1]))
data[ncol(data)+1]=filePath[i]
output=rbind(output, data)
}
return(output)
}
# Downloading the data ----------------------------------------------------
search_terms = c("bull market", "bear market", "recession")
years = c(2005,2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
months = c(1,4,7,10)
res.daily=list()
counter=1
for(year in years){
for(month in months){
url=URL_GT(search_terms, year=year, month=month)
GT_dir = downloadGT(url, downloadDir)
GT_dir = paste(downloadDir, GT_dir, sep='/')
res.daily[[counter]] = readGT(GT_dir)
counter=counter+1
}
}
df.daily <- do.call("rbind", res.daily)
url = URL_GT(search_terms)
GT_dir = downloadGT(url, downloadDir)
GT_dir = paste(downloadDir, GT_dir, sep='/')
df.weekly = readGT(GT_dir)
# Formating the data ------------------------------------------------------
df.merged = merge(df.daily, df.weekly, by=c('Date', 'Keyword'), all.x=T)
df.merged$adjustment_factor = df.merged$SVI.y /df.merged$SVI.x
for(i in search_terms){
r=which(df.merged$Keyword==i)
for(j in 2:length(r)){
if(!is.finite(df.merged$adjustment_factor[r][j])){
df.merged$adjustment_factor[r][j] = df.merged$adjustment_factor[r][j-1]
}
}
}
df.merged$daily = df.merged$adjustment_factor * df.merged$SVI.x
df.merged$weekly = df.merged$SVI.y
for(i in search_terms){
r=which(df.merged$Keyword==i)
for(j in 2:length(r)){
if(is.na(df.merged$weekly[r][j])){
df.merged$weekly[r][j] = df.merged$weekly[r][j-1]
}
}
}
# Plotting the data -------------------------------------------------------
df.merged$daily[which(is.infinite(df.merged$daily))] = NA
p1 = df.merged %>%
ggplot(aes(Date, daily, color=Keyword))+geom_line()
p2 = df.merged %>%
ggplot(aes(Date, weekly, color=Keyword))+geom_line()
multiplot(p1,p2)
# Saving the data ---------------------------------------------------------
write.csv(df.merged,'df.merged.csv')
25 comments:
Hi,
Thanks for the code. However, I don't know how to use R. Could you please tell me what I have to do to automate downloading data for a set of firms' names. Thank u so much !
Hi Amal, just download R from their web site. You can then paste this code into the console for a working example.- https://github.com/321k/Google-Trends/edit/master/Working%20example.
Hello Erik,
I have a (un)related question with regard to Google Trends data's normalization. I understand that for a certain search term, the highest search volume is always scaled to 100. However, for a category (try e.g. Law & Government/Social Services/Welfare & Unemployment for the US, which is in Choi & Varian (2009)), the index is given in percentage change. If you play with the time span, the first observation is always 0%, but other observations may change.
I am guessing it is the underlying normalization that causes this change, but I want to know exactly how this index in percentage change is calculated, or at least how this transformation is made. I am using Trends data for forecasting, so it is important for me to understand how data get updated each week.
Thank you in advance!
Hello Erik,
I am trying to open a url in a browser in MAC attempting to save a file. For example, the following link when opened in a browser will download a cvs file from google trends. However if I use the function browseURL, it does not open anything or saves anything.
URL="http://www.google.com/trends/trendsReport?hl=en-US&q=FTSE 100&date=1%2F2015 1m&cmpt=q&content=1&export=1"
browseURL(URL)
it works for webpages like (it open the page on the browser) URL="http://www.google.com" browseURL(URL)
but it doesn't work when it is the csv file of google trends.
Also, I would really appreciate if someone could tell me how many download google trends allows before blocking the IP. How long should I wait to try again? Thanks
Hello Erik,
I am trying to open a url in a browser in MAC attempting to save a file. For example, the following link when opened in a browser will download a cvs file from google trends. However if I use the function browseURL, it does not open anything or saves anything.
URL="http://www.google.com/trends/trendsReport?hl=en-US&q=FTSE 100&date=1%2F2015 1m&cmpt=q&content=1&export=1"
browseURL(URL)
it works for webpages like (it open the page on the browser) URL="http://www.google.com" browseURL(URL)
but it doesn't work when it is the csv file of google trends.
Also, I would really appreciate if someone could tell me how many download google trends allows before blocking the IP. How long should I wait to try again? Thanks
You can download around 400 CSVs, possibly more. The limit seems to be per day. Sorry I don't know how to get the browseURL function to work on mac.
Hi Erik,
I have been able to get this function to work when I restrict the downloads through 2014, but I run into issues when I include the year 2015 in the selections. The last download (for an incomplete month) is generating an error message.
Is there some way to modify the code to set a fixed end date for extraction (at a more granular level than calendar year)? Or just to fix the end date so that the last extraction doesn't result in an error?
Thanks
Hi Erik,
Thanks for your code.
As I change the following search keyword:
query="$TWTR"
R stops and shows the following error message:
Error in `[.data.frame`(gt_data, , 2) : undefined columns selected
Please let me know what need to be modified to be able to query the csv data.
Thanks,
Dear Erik,
Thanks so much for the code. I have been trying to repeat your exercise using "S&P 500" instead of
"FTSE 100".
I think that the presence of "&" in the word "S&P 500" messes up the query and I get the search results of "S" instead of "S&P 500". Have you encountered this problem before?
Thanks!
Dear Erik,
I figured it out. Just needed to replace "S&P 500" with "S%26P 500". Thanks!
Good to hear, share your results when you're done!
FYI. I had to add a pause during the download function to get it to work properly with chrome. On occasion it was confusing the chrome temporary download.
IT was just a 5 second pause before the
endingFiles=list.files(downloadDir)
Hi Eric,
When I run the script all the csv-files get named Report(#). So I wondered if there is a way to rename the file?
What browser are you using? Try using Firefox, that works for me. Chrome also works fine.
I'm using Firefox. Is there a renaming function in the script already?
Hi Erik,
Thanks for the wonderful code. Could you please suggest a change has to do to get the trends data on the minute level. It would be a great help.
Thanks.
Prakash Anand
There's no renaming function unfortunately. You can rename the files after they have been downloaded with the file.rename function in R. More here: https://stat.ethz.ch/R-manual/R-devel/library/base/html/files.html
Here's an working example in R in case you get stuck. Should work with copy paste if you have the libraries installed. https://github.com/321k/Google-Trends/blob/master/Daily%20data%20example.R
Hi Erik,
Thanks for your code from me as well. Very useful indeed! I did get it running but the daily re-indexed data I get in the "summary" dataframe seems to subtract one day from each month, i.e. I get 30 days of data for January instead of 31 and 27 days for February instead of 28 etc.
Any help on what I might be doing wrong here would be highly appreaciated!
Many thanks,
Paul
Dear Erik,
Thank you for your code.
I test this code by using the same inputs (just copy and paste).
But I only get 3 months data (January to March 2005).
The program keeps running without any other output (and without error).
The process stuck on the downloading loop...
> for(year in years){
+ for(month in months){
+ url=URL_GT(search_terms, year=year, month=month)
+ GT_dir = downloadGT(url, downloadDir)
+ GT_dir = paste(downloadDir, GT_dir, sep='/')
+ res.daily[[counter]] = readGT(GT_dir)
+ counter=counter+1
+ }
+ }
After several hour, I have to press stop in order to end this process.
Could you please advice?
How can I solve this issue?
Thanks,
Phasin
Hi Erik,
Thanks for providing this method to download the google trends. However, when I try to run the code it always gives me an error message in the browser said that the "404 The requested URL was not found on this server", can you help me solve this problem? Thank you.
I was so anxiuos to know what my husband was always doing late outside the house so i started contacting hackers and was scamed severly until i almost gave up then i contacted this one hacker and he delivered a good job showing evidences i needed from the apps on his phone like whatsapp,facebook,instagram and others and i went ahead to file my divorce papers with the evidences i got,He also went ahead to get me back some of my lost money i sent to those other fake hackers,every dollar i spent on these jobs was worth it.Contact him so he also help you.
mail: premiumhackservices@gmail.com
text or call +1 4016006790
HELLO VIEWERS
TESTIMONY ON HOW I GOT MY LOAN €300,000.00EURO FROM A FINANCE COMPANY LAST WEEK Email for immediate response: drbenjaminfinance@gmail.com
Do you need a loan to start a business or pay your debts {Dr.Benjamin Scarlett Owen} can also help you with a legit loan offer. He Has also helped some other colleagues of mine with a loan finance. Get your Blank ATM card or CREDIT CARD deliver to your doorstep that works in all ATM machines all over the world with the help of BENJAMIN LOAN FINANCE the ATM cards can be used to withdraw at the ATM Machines or swipe, at stores and POS. they give out this cards to all interested clients worldwide, If you need a loan without cost/stress he his the right loan lender to wipe away your financial problems and crisis today. BENJAMIN LOAN FINANCE holds all of the information about how to obtain money quickly and painlessly via Email drbenjaminfinance@gmail.com
Are you in need of Loan? Here all problem regarding Loans is solve between a short period of time what are you waiting for apply now and solve your problem or start a business with funds Contact us now. many more 2% interest rate.(Whats App) number +919394133968 patialalegitimate515@gmail.com
Mr Sorina
Post a Comment