Friday, August 11, 2006

Data Mining

Earlier this week, D came to me, "wanna see something scary? It's in the computer room." Worried that I might have got scooped, I followed him to his big screen and saw many names next to social security numbers. That was scary! Identity theft had happened to one friend of mine. And those people with their SSN out there are apparently in danger. How did D get all this?

D explained that he was only extracting data from the "accidentally" released AOL dataset of 20 million web queries collected from more than 650 thousand users over a period of three months this spring. Who would know that people sometimes search their own SSNs online to make sure nothing bad happened to their identities, but such queries would be one day released for academia research, and into the public domain?

Poking around a little bit, we found more interesting behavior of individual users. For example, a lot of people had to query "google" to find the google site; a lot of people query their own names or their family and friend's. This was very interesting! In the last few days, data-mining from this 2-gigbyte dataset had become our computer gurus D and M's favorite past-time. They could pretty much reconstruct an user's personal life from the searches they did, and figure out what they do in real life. For example, today they found

1) a cancer patient on a special diet who also had some special "physical" need
2) a high school student who had to search online to complete his/her biology homework in DNA replication (Oh, my biggest dream when I was a high school student was to have a home phone so that I could discuss homework with my best friend! How technology had changed our lives! To GOOGLE the answers!)
3) more weirdoes (I can't remember more now)

How shocking a person's privacy could just easily be revealed in a few search strings. Just like what I said in one of my previous posts that the Internet is not as private as we had thought. You just have to leave too many traces.

Looking back at my most recent google searches, I am glad to see that I am such a simple person. But still, you can see what has been going on in my mind from the following queries:

aol data release
google analytics
Metropolitan Atlanta Rapid Transit Authority
google suggest
窦唯 摇滚世界杯
duke pharmacology

What's going on in your google searches?

1 comment:

allegro said...

自沙。

从我的搜索可以看出,我是科技与八卦并进。

另,知道我下周要去哪儿了吧?

另另,其实我并不太感兴趣窦唯的歌。可怜的人!