Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Saturday, December 26, 2015

Problems with the BTYD walk-through fixed

If you're going through the Buy 'Til You Die package's walk-through, you are bound to get stuck in a couple of places. Here are fixes to some of those problems.

Page 5
Warning message:
In cbind(f, r, T) : number of rows of result is not a multiple of vector length (arg 2)

This error occurs because the walk through specifies tot.cbt as tot.cbt = dc.CreateFreqCBT(elog). This is incorrect and should be tot.cbt = dc.CreateFreqCBT(elog.cal). After making that change, the error is fixed.

Page 6
Error: could not find function "pnbd.PlotDropoutHeterogeneity"

pnbd.PlotDropoutHeterogeneity(params) doesn't work because the function name has changed. Replace it with pnbd.PlotDropoutRateHeterogeneity(params) and it works fine.

Page 8
Error in pnbd.PAlive(params, x, t.x, T.cal) : could not find function "hyperg_2F1"

If you haven't loaded the package "gsl", the function pnbd.ConditionalExpectedTransactions will throw an error. It's easily fixed by loading the gsl library.

Friday, October 16, 2015

Three tips when using a Random Forest in R

1. Make sure to have either factors or numeric variables in the regression. No strings allowed!
2. Make sure that you have a reasonable number of factors. About six should do the trick.
3. Reduce your sample size and the number of trees when testing. You only need a large number of trees to avoid overfitting. If your model is underperforming with a small number of trees, your problem isn't overfitting.

Monday, October 12, 2015

Faster BTYD

The R-library "Buy 'Til You Die" for customer equity forecasting looked promising, but I quickly realised that the implementation relies heavily on loops and is not suited for any data set bigger than a couple of thousand rows.

To speed things up, I switched out a couple of the loops for some data.table magic. Here it is, about 10000X faster than the original, but there is still some room for improvement.

If you're having trouble getting the BTYD package to run, take a look at this post for fixes.

Tuesday, September 08, 2015

How to speed up csv import in R

If you're storing a lot of data in csv files, reading it to R with read.csv can be painfully slow. Using the fread function from the data.table library increase read speed by 100x, so definitely check it out.

library(data.table)
data <- fread('data.csv')

If you prefer to work with a data.frame, it's easy to tell fread to save it as such as well.

data <- fread('data.csv', data.table=FALSE)

Thursday, May 21, 2015

TransferWise's community visualized

Click image for high-res version.

From the TransferWise blog:

As TransferWise grows we notice something pretty special – our members love sharing the service with friends.  To say thanks, the TransferWise referral programme was born. Now, you’re rewarded if you refer a friend to TransferWise.

Created with R and Gephi.

Monday, December 08, 2014

Creating daily search volume data from weekly and daily data using R

In my previous post, I explained the general principle behind using Google Trends' weekly and daily data to create daily time series longer than 90 days. Here, I provide the steps to take in R to achive the same reuslts.


#Start by copying these functions in R. 
#Then run the following code:
#NB! In order for the code to run properly, you will have to specify the download directory of your default browser (downloadDir)

downloadDir="C:/downloads"

url=vector()
filePath=vector()
adjustedWeekly=data.frame()
keyword="google trends"



#Create URLs to daily data
for(i in 1:12){
    url[i]=URL_GT(keyword, year=2013, month=i, length=1)
}

#Download
for(i in 1:length(url)){
    filePath[i]=downloadGT(url[i], downloadDir)
}

dailyData=readGT(filePath)
dailyData=dailyData[order(dailyData$Date),]

#Get weekly data
url=URL_GT(keyword, year=2013, month=1, length=12)
filePath=downloadGT(url, downloadDir)
weeklyData=readGT(filePath)

adjustedDaily=dailyData[1:2]
adjustedDaily=merge(adjustedDaily, weeklyData[1:2], by="Date", all=T)
adjustedDaily[4:5]=NA
names(adjustedDaily)=c("Date", "Daily", "Weekly", "Adjustment_factor", "Adjusted_daily")

#Adjust for date missmatch
for(i in 1:nrow(adjustedDaily)){
    if(is.na(adjustedDaily$Daily[i])) adjustedDaily$Daily[i]=adjustedDaily$Daily[i-1]
}

#Create adjustment factor
adjustedDaily$Adjustment_factor=adjustedDaily$Weekly/adjustedDaily$Daily

#Remove data before first available adjustment factor
start=which(is.finite(adjustedDaily$Adjustment_factor))[1]
stop=nrow(adjustedDaily)
adjustedDaily=adjustedDaily[start:stop,]

#Fill in missing adjustment factors
for(i in 1:nrow(adjustedDaily)){
    if(is.na(adjustedDaily$Adjustment_factor[i])) adjustedDaily$Adjustment_factor[i]=adjustedDaily$Adjustment_factor[i-1]
}

#Calculated adjusted daily values
adjustedDaily$Adjusted_daily=adjustedDaily$Daily*adjustedDaily$Adjustment_factor


#Plot the results
library(ggplot2)
ggplot(adjustedDaily, aes(x=Date, y=Adjusted_daily))+geom_line(col="blue")+ggtitle("SVI for Google Trends")

Friday, December 05, 2014

Scraping Google Trends with R

These R functions will allow you to programmatically download Google Trends data and importing it to R.

Step 1: Install the Google Trends functions from my Github account.
Step 2: Sign in to Google Trends in your main browser
Step 3: define the keywords you need

keywords=c("Samsung", "Apple", "Xiaomi")

Step 4: create list of URL:s (in this example, we'll have only one URL)

url=URL_GT(keywords)

Step 5: specify your browser download directory and set it as your working directory

downloadDir="C:/downloads"
setwd(downloadDir)

Step 6: download the csv:s. The function outputs the file name.

filePath=downloadGT(url, downloadDir)

Step 7: import the csv to R

googletrends_data=readGT(filePath)

In this post, I write about how to merge daily data from Google Trends into longer time series using R.

Wednesday, December 03, 2014

Converting Google Trends weekly data into a regular date in R

Getting dates in the right format is always a big headache when getting your data ready for analysis. Google Trends provides data on three levels, monthly, weekly or daily. Here, I explain how to convert Google Trend's weekly dates into R's date class.

The weekly date format used by Google Trends looks like this:

2004-01-04 - 2004-01-10
 

This might pose a problem for instance if we want to plot the data. We will need to convert the date interval provided by Google Trends into a single date. We can do so using the following code:

#First, import the data 
data=read.csv(filePath, header=F, blank.lines.skip=F)
 
#Then select the ending date of the date interval
data[,1]=sapply(data[,1], substr, start=14, stop=30)
 
#And convert it into a date 
data[,1]=as.Date(data[,1], "%Y-%m-%d")
 
And then you have a date format that you can use for plots or data analysis.


 

Creating your own Google Trends plots in R

With this post, I want to demonstrate how you can use the R functions I've built to create your own Google Trends graphs.



First, install the functions by pasting this code into R.

You now have the following functions at your disposal:
  • URL_GT(keywords)
  • downoloadGT(url, downloadDir)
  • readGT(file path)
By running the following code, you will get chart above.

#downloadDir=Where you save the csv from Google Trends
downloadDir="C:/downloads"
setwd(downloadDir)
keywords=c("Samsung", "Apple", "Nokia")
url=URL_GT(keywords)
filePath=downloadGT(url, downloadDir)
smartphones=readGT(filePath)

library(ggplot2)

ggplot(smartphones, aes(x=Date, y=SVI, color=Keyword))+geom_line()

Monday, May 19, 2014

Google Trends mass import using R

I was looking for a tool for mass download of daily Google Trends data for my thesis and couldn't find anything that worked for my needs. The tool was built to download Google Trends data in monthly snippets for  a given search word and given year or years.

I thought I would share my work here in case someone else has stumbled on the same issues. The file can be found here on github: https://gist.github.com/321k/823cce9769e58bc14214.

Make sure you have the quantmod library installed before you run it.

Getting the data

Since the code works by downloading the file thorugh the browser, it is not affected by changes to Google's authentication policy. Simply make sure that you are signed in to your Google account. This does however mean that the data download is slow. I recommend using Firefox with a tab manager to close the tabs after a download has been completed. Tab mix plus 0.4.1.3.1 works great for me. You will also need to check the box in the download prompt that lets Firefox download without prompting you. Finally, you need to specify the download directory as an empty folder.

There are four functions; downloadGT, importGT, formatGT, and mergeGT. downloadGT takes two imputs, the years you want to download, and the search querry you want to get. To run for multiple querries, simply add a loop:
querries=c("MSFT", "AAPL")
years=c("2012", "2013")
for(i in 1:length(querries) {downloadGT(years, querries[i])}

 Formating the data

Once we have downloaded the files, we need to import it to R and put it in a useable format. importGT gets the data, formatGT extracts the time series data, and mergeGT put the individual months into complete time series by company.
path="C:/data"
rawData=importGT(path)
formatedData=formatGT(rawData)
mergedData=mergeGT(formatedData)
And there you have it. I'm currently in the process of downloading the Google Trends data for FTSE 100 between 2004 and 2013. Google's quota limit allows me to download about ten companies per day, or 1200 files. Feel free to leave a comment or suggestion
.

Tuesday, May 06, 2014

Daily Google Trends data using R

This blog post is a work in progress and has been updated several times. Last update 11.5.2016.

For an explanation of how to combine weekly and daily Google Trends data to create long daily time series with Search Volume Data, please refer to this blog post. You can find a working example in R here.

This R script weaves daily Google Trends data together into a continuous time series. For a look at how daily Google Trends data differs from the weekly data, take a look at this blog post. The graph below illustrated the daily search  data for "FTSE 100".


Since I had a lot of problems with authentication in the available tools for downloading Google Trends data, I decided to circumvent the whole authentication issue by downloading the files through the browser. The file downloads are automated using R. To execute the example below, you will need to add these functions from my Github account to R.
Make sure that you are signed in before you run the script. Since you will download 120 individual csv files, it will take several minutes to complete the run.

The function URL_GT creates the URL for the Google Trends file export. The function readGT downloads the file through the browser and returns the time series data. When the functions are available to R, running the code below shoud return a time series that looks like the one above. Let me know in the comments if it isn't working.


library(Rmisc)
library(ggplot2)
library(dplyr)

# The Google Trends formating functions -----------------------------------

#This script automates the downloading of Google Trends.
#It works best with firefox in combination with the Tab Mix Plus add-on that is used to automate tab closing.
#Ask firefox not to prompt for new downloads and this script should run automatically.
#Google Trends restricts the number of download to roughly 400 at a time.

URL_GT=function(keyword="", country=NA, region=NA, year=NA, month=1, length=3){
  
  start="http://www.google.com/trends/trendsReport?hl=en-US&q="
  end="&cmpt=q&content=1&export=1"
  geo=""
  date=""
  
  #Geographic restrictions
  if(!is.na(country)) {
    geo="&geo="
    geo=paste(geo, country, sep="")
    if(!is.na(region)) geo=paste(geo, "-", region, sep="")
  }
  
  queries=keyword[1]
  if(length(keyword)>1) {
    for(i in 2:length(keyword)){
      queries=paste(queries, "%2C ", keyword[i], sep="")
    }
  }
  
  #Dates
  if(!is.na(year)){
    date="&date="
    date=paste(date, month, "%2F", year, "%20", length, "m", sep="")
  }
  
  URL=paste(start, queries, geo, date, end, sep="")
  URL <- gsub(" ", "%20", URL)
  return(URL)
}

downloadGT=function(URL, downloadDir){
  
  #Determine if download has been completed by comparing the number of files in the download directory to the starting number
  startingFiles=list.files(downloadDir)
  browseURL(URL)
  endingFiles=list.files(downloadDir)
  
  while(length(setdiff(endingFiles,startingFiles))==0) {
    Sys.sleep(3)
    endingFiles=list.files(downloadDir)
  }
  filePath=setdiff(endingFiles,startingFiles)
  return(filePath)
}


readGT=function(filePath){
  rawFiles=list()
  
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  output=data.frame()
  name=vector()
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    name=as.character(t(data[5,-1]))
    
    #Select the time series
    start=which(data[,1]=="")[1]+3
    stop=which(data[,1]=="")[2]-2
    
    #Skip to next if file is empty
    if(ncol(data)<2) next
    if(is.na(which(data[,1]=="")[2]-2)) next
    
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    
    #Convert all columns except date column into numeric
    for(j in 2:ncol(data)) data[,j]=as.numeric(as.character(data[,j]))
    
    #FORMAT DATE
    len=nchar(data[1,1])
    
    #Monthly data
    if(len==7) {
      data[,1]=as.Date(paste(data[,1], "-1", sep=""), "%Y-%m-%d")
      data[,1]=sapply(data[,1], seq, length=2, by="1 month")[2,]-1
      data[,1]=as.Date(data[,1], "%Y-%m-%d", origin="1970-01-01")
    }
    
    #Weekly data
    if(len==23){
      data[,1]=sapply(data[,1], substr, start=14, stop=30)
      data[,1]=as.Date(data[,1], "%Y-%m-%d")
    }
    
    #Daily data
    if(len==10) data[,1]=as.Date(data[,1], "%Y-%m-%d")
    
    #Structure into panel data format
    panelData=data[1:2]
    panelData[3]=name[1]
    names(panelData)=c("Date", "SVI", "Keyword")
    if(ncol(data)>2) {
      
      for(j in 3:ncol(data)) {
        appendData=data[c(1,j)]
        appendData[3]=name[j-1]
        names(appendData)=c("Date", "SVI", "Keyword")
        panelData=rbind(panelData, appendData)
      }
    }
    
    #Add file name  
    panelData[ncol(panelData)+1]=filePath[i]
    
    #Add path to filename
    names(panelData)[4]="Path"
    
    #Merge several several files into one
    if(i==1) output=panelData
    if(i>1) output=rbind(output, panelData)
  }
  return(output)
}

readGeoGT=function(filePath){
  output=data.frame()
  rawFiles=list()
  for(i in 1:length(filePath)){
    if(length(filePath)==1) rawFiles[[1]]=read.csv(filePath, header=F, blank.lines.skip=F)
    if(length(filePath)>1) rawFiles[[i]]=read.csv(filePath[i], header=F, blank.lines.skip=F)
  }
  
  for(i in 1:length(rawFiles)){
    data=rawFiles[[i]]
    start=which(data[,1]=="")[3]+3
    stop=which(data[,1]=="")[4]-1
    names=data[start-1,]
    
    for(j in 1:ncol(names)) names(data)[j]=as.character(names[1,j])
    data=data[start:stop,]
    data[,1]=as.character(data[,1])
    data[,-1]=as.numeric(as.character(data[,-1]))
    data[ncol(data)+1]=filePath[i]
    
    output=rbind(output, data)
  }
  return(output)
}


# Downloading the data ----------------------------------------------------


search_terms = c("bull market", "bear market", "recession")

years = c(2005,2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)
months = c(1,4,7,10)
res.daily=list()
counter=1
for(year in years){
  for(month in months){
    url=URL_GT(search_terms, year=year, month=month)
    GT_dir = downloadGT(url, downloadDir)
    GT_dir = paste(downloadDir, GT_dir, sep='/')
    res.daily[[counter]] = readGT(GT_dir)
    counter=counter+1
  }
}

df.daily <- do.call("rbind", res.daily)

url = URL_GT(search_terms)
GT_dir = downloadGT(url, downloadDir)
GT_dir = paste(downloadDir, GT_dir, sep='/')
df.weekly = readGT(GT_dir)


# Formating the data ------------------------------------------------------


df.merged = merge(df.daily, df.weekly, by=c('Date', 'Keyword'), all.x=T)
df.merged$adjustment_factor = df.merged$SVI.y /df.merged$SVI.x

for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(!is.finite(df.merged$adjustment_factor[r][j])){
      df.merged$adjustment_factor[r][j] = df.merged$adjustment_factor[r][j-1]
    }
  }
}
df.merged$daily = df.merged$adjustment_factor * df.merged$SVI.x
df.merged$weekly = df.merged$SVI.y
for(i in search_terms){
  r=which(df.merged$Keyword==i)
  for(j in 2:length(r)){
    if(is.na(df.merged$weekly[r][j])){
      df.merged$weekly[r][j] = df.merged$weekly[r][j-1]
    }
  }
}


# Plotting the data -------------------------------------------------------

df.merged$daily[which(is.infinite(df.merged$daily))] = NA

p1 = df.merged %>%
  ggplot(aes(Date, daily, color=Keyword))+geom_line()

p2 = df.merged %>%
  ggplot(aes(Date, weekly, color=Keyword))+geom_line()

multiplot(p1,p2)


# Saving the data ---------------------------------------------------------


write.csv(df.merged,'df.merged.csv')
Entertaining Blogs - BlogCatalog Blog Directory
Bloggtoppen.se