Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















Text Processing with Java

Weka: Data Mining with Java

Weka is a collection of machine learning algorithms for data mining tasks. Weka includes tools for data pre-processing, classification, regression, clustering, association rules and visualization.

MinorThird

MinorThird is a an open-source collection of Java classes for storing, categorizing and annotating text, and for learning to extract entities.

MinorThird offers a toolkit of learning methods which are tightly integrated with other tools for annotating text, both manually and programmatically. It also offers visualizing both training data and the performance of the various classifiers.

SecondString

SecondString is another open-source package from CMU Professor William W Cohen that provides a collection of approximate string matching techniques.