Frequently Asked Questions
This document contains answers to some of the most frequently asked questions about tm.
- How should I cite tm?
- Where can I find the tools to read in a PDF file?
- What is the easiest way to handle custom file formats?
- What about error messages indicating invalid multibyte strings?
- Can I use bigrams instead of single tokens in a term-document matrix?
- How can I plot a term-document matrix?
How should I cite tm?
Please have a look at the output of citation("tm") in R. A BibTeX representation can be obtained via toBibtex(citation("tm")).
The preferred way for journal and conference papers is to cite the JSS article.
I want to read in a PDF file using
readPDFreader. However, the manual says I need the tool pdftotext installed and accessable on my system. Where can I find and how can I install this tool?
Windows users might find a R-help thread on this topic useful.
My documents are stored in file format XYZ. How do I
get the material into tm and construct a corpus from
Please have a look at the vignette Extensions: How to Handle Custom File Formats.
What about error messages indicating invalid multibyte
Ensure that all your datasets and documents are encoded in UTF-8. If you still have problems
tm_map(yourCorpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))will replace non-convertible bytes in
yourCorpuswith strings showing their hex codes.
Can I use bigrams
instead of single tokens in a term-document matrix?
library("tm") data("crude") BigramTokenizer <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE) tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer)) inspect(removeSparseTerms(tdm[, 1:10], 0.7))
How can I plot a term-document matrix like Figure 6 in
article on tm?
Please check the manual accessible via ?plot.TermDocumentMatrix for available arguments to the plot function. A plot similar to Figure 6 can be produced e.g. with:
library("tm") data("crude") tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE)) plot(tdm, terms = findFreqTerms(tdm, lowfreq = 6)[1:25], corThreshold = 0.5)