Carnegie Mellon University
Preliminary Corpus Analysis
Bryan Klimt and Yiming Yang presented an introductory paper about the
contents of the Enron corpus. In the paper, they present some preliminary work
on data cleansing, which results in a cleaned corpus of 200,399 email messages
belonging to 158 different users, with an average of 757 messages per user.
A preliminary analysis of the data provides some insight into some characteristics of the dataset:
- Distribution of messages: The high average number of messages per
user (757) is the result of a small number of users having a large number of
messages. The dataset does, hwoever, contain data for users with all amounts of
email.
- Folder Categorization: Most users use multiple folders to organize
their email. The upper bound for the number of folders for each user appears to
be a log of the number of messages of that user.
- Threads: Just over 30,000 threads were detected in the cleaned
corpus, consisting of 123,501 email messages. The average thread size is 4.10
messages. The median thread length is 2 messages, meaning there are a few large
threads, and many small ones.
| Thread Size |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
(10-20] |
20+ |
| # of threads |
16736 |
4782 |
3049 |
1282 |
879 |
903 |
378 |
214 |
178 |
1260 |
430 |
|