oooh … first attempted comment spam
Wednesday June 29th 2005, 3:09 pm
Filed under: technology
Posted by: Andrew Lampert

I’m sure I just passed another of my (online) life’s milestones: I just moderated my first attempted comment spam for this blog. Don’t understand at all what the point of the posts were, as they just contained random URLs (which didn’t even exist). I can only assume they were testing the waters before posting something with real live spammy links?



Another Short Paddle
Sunday June 26th 2005, 5:37 pm
Filed under: kayak, outdoor, uni
Posted by: Andrew Lampert

Made sure we had another short paddle on the Lane Cove River this morning. We intended to paddle as far as we could upstream; unfortunately, this was trivially easy, since we couldn’t get any further than the Steakhouse at Fullers Bridge due to the Parramatta Epping to Chatswood Rail Link tunnel works, which have blocked the river. Bugger.

So we ended up paddling up and back between Fullers Bridge and Magdala Park. We both felt pretty sluggish this morning, so we only paddled for about an hour. It rained for much of the paddle, but that didn’t make it any less enjoyable. In fact the sound of the rain on the river was quite soothing.

Have been trying to focus on writing my dynamic time warping speech/speaker recognizer in R this afternoon. Hrm… it’s turning out to be trickier than I thought, but that’s largely because I don’t know R as well as I should. I know exactly what I want to do conceptually – it’s just a matter of working out a smart way to do it in R (like finding out how kmeans clustering works, to avoid having to reinvent the wheel). Back to the grind-stone now. Wish me luck!



Information Engineering Strategy
Tuesday June 21st 2005, 7:34 pm
Filed under: csiro
Posted by: Andrew Lampert

Yet another tiring day in Canberra. Another 6am start and early morning breakfast in the Qantas Club (ahh, the perks of travelling with Cécile ;-) ). A worthwhile meeting though – setting strategic direction for our Information Engineering Lab. The really nice thing from my point of view is that our future direction aligns very nicely with where I want to focus for my Masters thesis and PhD. I’m not sure that the details are supposed to be public yet, so I won’t talk about the details of what our excellent plans are!! Suffice to say that there’s plenty of real problems to solve and lots of exciting research work in store for us all.



Kayaking the Lane Cove River
Sunday June 19th 2005, 4:38 pm
Filed under: kayak, outdoor
Posted by: Andrew Lampert

Just got back (ok it was actually a few hours ago!) from 2 hours of solid paddling on the Lane Cove River. We had a nice tail-wind for most of the paddle from Chatswood/North Ryde through to Riverview, from where we had an unexpectedly great view of the Harbour Bridge and city skyline. Of course, given the downstream tail-wind, we had a monster head-wind paddling back upstream, but surprisingly we managed to make the return trip in almost exactly the same time as the downstream journey. I think at least part of that was because we refrained from stopping to often, due to the fact that we started drifting backwards every time that we did stop paddling to rest.

I’m not sure exactly how far the paddle was, but a quick guesstimate would be something like 12-15km for the whole trip. Not bad for a Sunday morning! Fear our tanky kayaking muscles!!



Enron Email Corpus – Are you using it?
Thursday June 09th 2005, 5:34 pm
Filed under: email, language technology
Posted by: Andrew Lampert

As I just mentioned in my last post, I’m trying to setup a useful resource site for people using the Enron Email Corpus.

The Enron corpus is completely unparalleled in terms of email datasets that can be used for research purposes. It is more extensive than any other research-friendly email corpus (that I know of) by several orders of magnitude. Many people in a variety of Natural Language Processing, Machine Learning and a bunch of other fields have realised this, and have started to analyse the corpus as the basis of a number of different research programs. These range from investigations into social networks and organisational communication to data mining and text classification tasks. Quite a range of research has already been published, though most of it is fairly preliminary at this stage.

Unfortunately, despite such widespread interest, the community using the Enron corpus seems to be very fragmented, with many researchers seemingly unaware of how others are using the corpus. This has the potential to result in much wasted effort if different research groups duplicating each other’s work, especially in terms of data markup and cleansing, which are both huge tasks given the size and inconsistencies of the corpus.

The main motivation for me to create yet another website is to pull together all the known work happening with the Enron Corpus, and to encourage users to share data and knowledge about the corpus. I have also setup an Enron Corpus discussion list for exactly this purpose.

If you’re working with (or thinking of using) the Enron dataset, why not join the discussion list. If you know anyone who is using the Enron corpus, point them over to the Enron Corpus Mailing List and encourage them to join.



Finally got my website published
Thursday June 09th 2005, 5:29 pm
Filed under: email, java, language technology, uni
Posted by: Andrew Lampert

So it’s far from finished, but I’ve finally bitten the bullet and published my new site to the web. It’s been sitting on my staging/development server for more than two years now, although the current incarnation bears very little resemblance to that old site!

I’ve focussed on two sections at the moment:
- Trying to create a collective resource that documents how people are using the Enron Email Corpus. This is a massive collection of real-world email from Enron that is available for research purposes. (If you’re interested, head on over to the Enron Email Corpus pages)
- Documenting relevent resources for my Masters Project (which will hopefully lead into a PhD), looking at discourse structures and intention in email communication. This is based around email classification (at least partially), and will hopefully make use of the Enron corpus, both for investigating patterns of communication, and for ensuring that the tools produced work on real-world, noisy data.

Still got a mountain of uni work to do (about 15,000 words of essays for Speech Recognition, as well as constructing a speaker/speech recognition system in R; and a formal literature review and research proposal for my research project). One day I’ll feel like I’m actually making progress!