Wednesday, December 02, 2015

Text mining sparse data sets with R and tm

If you've been playing with the documentTermMatrix from the tm package in R, you might have encountered this error:

Error in vector(typeof(x$v), nr * nc) : vector size cannot be NA
In addition: Warning message:
In nr * nc : NAs produced by integer overflow

The object containing the data is too large to be converted to a matrix. How do we get around this? We need to remove sparse items.

corp <- ectorsource="" nbsp="" orpus="" x="">% tm_map(content_transformer(tolower)) %>% tm_map(stripWhitespace) %>% tm_map(stemDocument) %>% tm_map(removePunctuation)
dtm <- corp2="" documenttermmatrix="" p="">density <- length="ncol(dtm))</p" vector="">for(i in 1:ncol(dtm))
  density[i] <- dtm="" i="" j="" length="" p="">
r <- density="" which=""> 10)
m <- as.matrix="" dtm="" p="" r="">v <- colsums="" decreasing="TRUE)</p" m="" sort="">d <- data.frame="" word="names(v),freq=v)</p">wordcloud(words<- d="" freq="" p="" word="">

No comments:

Entertaining Blogs - BlogCatalog Blog Directory
Bloggtoppen.se