Frequently Asked Questions

This document contains answers to some of the most frequently asked questions about tm.


  1. How should I cite tm?
  2. Where can I find the tools to read in a PDF file?
  3. What is the easiest way to handle custom file formats?
  4. Error: Row indices are not sorted within columns
  5. Can I use bigrams instead of single tokens in a term-document matrix?

  1. How should I cite tm?

    Please have a look at the output of citation("tm") in R. A BibTeX representation can be obtained via toBibtex(citation("tm")). The output is also available via the CRAN package citation info.

    The preferred way for journal and conference papers is to cite the JSS article.

  2. I want to read in a PDF file using the readPDF reader. However, the manual says I need both the tools pdftotext and pdfinfo installed and accessable on my system. Where can I find and how can I install these tools?

    Many linux distributions provide pre-built packages: poppler-utils, xpdf-utils, or similar. Windows users need to download and install Xpdf. Ensure that both programs are included in your PATH variable.

    Windows users might find a recent R-help thread on this topic useful.

  3. My documents are stored in file format XYZ. How do I get the material into tm and construct a corpus from it?

    Please have a look at the vignette Extensions: How to Handle Custom File Formats.

  4. When I create a term-document matrix (TermDocMatrix) the message invalid class "dgCMatrix" object: row indices are not sorted within columns is displayed.

    This was a problem in the term-document matrix construction code which was triggered by a change in the Matrix package. Please update the tm package to the latest version available from CRAN in order to fix this issue.

  5. Can I use bigrams instead of single tokens in a term-document matrix?

    Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

      library("RWeka")
      library("tm")
    
      data("crude")
    
      BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
      tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
    
      inspect(tdm[340:345,1:10])