Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















Carnegie Mellon University

Preliminary Corpus Analysis

Bryan Klimt and Yiming Yang presented an introductory paper about the contents of the Enron corpus. In the paper, they present some preliminary work on data cleansing, which results in a cleaned corpus of 200,399 email messages belonging to 158 different users, with an average of 757 messages per user.

A preliminary analysis of the data provides some insight into some characteristics of the dataset:

  • Distribution of messages: The high average number of messages per user (757) is the result of a small number of users having a large number of messages. The dataset does, hwoever, contain data for users with all amounts of email.
  • Folder Categorization: Most users use multiple folders to organize their email. The upper bound for the number of folders for each user appears to be a log of the number of messages of that user.
  • Threads: Just over 30,000 threads were detected in the cleaned corpus, consisting of 123,501 email messages. The average thread size is 4.10 messages. The median thread length is 2 messages, meaning there are a few large threads, and many small ones.



    Thread Size 2 3 4 5 6 7 8 9 10 (10-20] 20+
    # of threads 16736 4782 3049 1282 879 903 378 214 178 1260 430