Frequently Asked Questions
This document contains answers to some of the most frequently asked questions about tm.
- How should I cite tm?
- Where can I find the tools to read in a PDF file?
- What is the easiest way to handle custom file formats?
- What about error messages indicating invalid multibyte strings?
- Can I use bigrams instead of single tokens in a term-document matrix?
- How can I plot a term-document matrix?
How should I cite tm?
Please have a look at the output of citation("tm") in R. A BibTeX representation can be obtained via toBibtex(citation("tm")).
The preferred way for journal and conference papers is to cite the JSS article.
I want to read in a PDF file using
readPDFreader. However, the manual says I need the tool pdftotext installed and accessable on my system. Where can I find and how can I install this tool?
Windows users might find a R-help thread on this topic useful.
My documents are stored in file format XYZ. How do I
get the material into tm and construct a corpus from
Please have a look at the vignette Extensions: How to Handle Custom File Formats.
What about error messages indicating invalid multibyte
Ensure that all your datasets and documents are encoded in UTF-8. If you still have problems
tm_map(yourCorpus, function(x) iconv(enc2utf8(x), sub = "byte"))will replace non-convertible bytes in
yourCorpuswith strings showing their hex codes.
Can I use bigrams
instead of single tokens in a term-document matrix?
library("RWeka") library("tm") data("crude") BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) inspect(tdm[340:345,1:10])
How can I plot a term-document matrix like Figure 6 in
article on tm?
Please check the manual accessible via ?plot.TermDocumentMatrix for available arguments to the plot function. A plot similar to Figure 6 can be produced e.g. with:
library("tm") data("crude") tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE)) plot(tdm, terms = findFreqTerms(tdm, lowfreq = 6)[1:25], corThreshold = 0.5)