Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















Websites

Below is a list of websites with information relevant to my Masters project.

For a more dynamic collection of websites that I find useful, try my del.icio.us links page. Note that this page also contains links that are completely irrelevant to this work.


SiteAffiliationDescription
The Enron Email Corpus MIT/Stanford/Carnegie Mellon This website hosts the Enron Email Corpus. This is a collection of email data from about 150 users, mostly senior management of Enron, organized into folders. The data represents the only known substantial corpus of 'real' email (as opposed to synthesized or specifically elicited copora) that is publicly available.

Professor William W Cohen's site Carnegie Mellon University Contains references to a number of annotated email corpora relevant to my proposed topic. Has links to a number of pieces of software used for statistical classification, machine learning and approximate matching.

MinorThird SourceForge Project MinorThird is a an open-source collection of Java classes for storing, categorizing and annotating text, and for learning to extract entities. It offers a toolkit of learning methods which are tightly integrated with other tools for annotating text, both manually and programmatically. It also offers visualizing both training data and the performance of the various classifiers.

Weka University of Waikato, New Zealand Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.