tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).