MediaDefender Email Corpus: 6600 email messages released
Tuesday September 18th 2007, 11:56 am
Filed under: email,language technology,research,search,technology
Posted by: Andrew Lampert

The internet is buzzing with conversations about the huge email leak from MediaDefender, a company which makes its living selling services and software to prevent illegal content sharing in peer-to-peer networks. I was made aware of this hugely exciting opportunity thanks to the excellent Death By Email blog which provides a good summary of the unfolding drama.

Given its business, MediaDefender is of course not a popular company within the file-sharing community. It thus shouldn’t be surprising that people have been very eager to jump on the more than 6600 company email messages from MediaDefender employees and begin dissecting their content. The emails appear to date from the period between April 2007 and September 2007.

According to Ars-Technica, the e-mail was leaked to the public by a group that calls itself MediaDefender-Defenders. In a text file distributed with the email data, the group claims that MediaDefender employee Jay Mairs forwarded all of his company emails to a Gmail account, from where the email data was leaked. “A special thanks to Jay Maris, for circumventing there entire email-security by forwarding all your emails to your gmail account, and using the really highly secure password: blahbob”.

The group’s motivation for releasing the email is also made clear: “By releasing these emails we hope to secure the privacy and personal integrity of all peer-to-peer users. The emails contains information about the various tactics and technical solutions for tracking p2p users, and disrupt p2p services. So here it is; we hope this is enough to create a viable defense to the tactics used by these companies …”

As someone whose first use of bit-torrent was to download this email corpus, my interest in the data is purely academic – is this another corpus we could use for email research? Conveniently, the MediaDefender email data is released in mbox format, which is a welcome change from the image-based PDF files (created by scanning printed email messages!) that have been released in recent US court cases. Being in mbox format, the data has all the header information, making the data perfect for research purposes.

The (insurmountable?) problem with using this data for research is the of course the fact that the email was not legally obtained. So, is there any way we could get ethics approval for publishing experiments using this data? It seems very doubtful to me, but I’d be curious to hear your thoughts.


1 Comment so far
Leave a comment

Nice post. Wanted to ask you about your Zebra email segmentation tool http://zebra.thoughtlets.org/zebra.php. How is it doing? Tried to contact you via email but the message couldn’t be delivered.

Comment by Sergey 12.03.11 @ 9:18 pm



Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

(required)

(required)