CemWEB Research Project
[info]cemcom
Culturally Embedded Computing Group LiveJournal Account
CEmCom has a LiveJournal presence!

Thanks to Phoebe for providing the funds for a 6-month paid membership.

Our work-in-progress website: http://www.infosci.cornell.edu/cemcom/ ... The website contains biographical information of CEmCom members, as well as updates on current projects.

 
[info]cemcom
CemWEB
This semester, I (Eugene Medynskiy) will be working with Phoebe Sengers and Daniel Huttenlocher on the CemWEB project. A brief description of the project, as provided on the LIFE Research funding application follows. It's a bit formal and uptight, but such are funding applications. As the project gains form and momentum, I will be posting updates, results, and discussions here.

---
During the first leg of the research, we will be concerned with creating software tools to efficiently collect basic information about LiveJournal communities and their members, as well as how membership in communities changes with time. We will also develop mathematical models and algorithms to represent and analyze community structure, relationships, membership, and how these change over time.
Later, we will use these tools to examine three separate topics:
(a) dynamics of community affiliation: when and why do communities experience a significant flux in numbers of people joining or leaving? Can these bursts of activity be tied to "real-world" events?
(b) dynamics of a given community's 'agenda', as tracked by changes to its "Interests" (a list of subjects selected by the community's moderators), posts, and the "Interests" of its members.
(c) metrics of community distance: how can we talk about the relationship between two or more given communities, even if they don't share any members or "Interests"? is it possible to group communities into common cliques? We will try to apply these findings to create a "community browser" that LiveJournal users may use to find new communities that they might want to join, or conversely to find and watch communities with, say, opposing political views.
---

 
[info]cemcom
LIFE Funding
Dan Huttenlocher and I have submitted a proposal to LIFE Research Funding (a Cornell College of Engineering research funding program) for $1,000. This money would go towards buying a dedicated system to run the LiveJournal crawling/polling code, store the data, and analyze it.

 
[info]cemcom
Request to LiveJournal
A few days ago, I emailed the LiveJournal webmaster about our project, asking for suggestions, help, and easier access to the data we would need.


Dear LJ Webmaster,

My name is Eugene, and I work with the Culturally Embedded Computing
Group at Cornell University. This semester, as an undergraduate
independent research project, I would like to examine the structure of
LJ communities, and how it changes over time. In particular, three broad
topics I would like look at are:
(a) dynamics of community affiliation: when and why do communities
experience a significant flux in numbers of people joining or leaving?
Can these bursts of activity be tied to "real-world" events?
(b) dynamics of a given community's 'agenda', as tracked by changes to
its "Interests" and the "Interests" of its members.
(c) metrics of community distance: how can we talk about the
relationship between two or more given communities, even if they don't
share any members or "Interests"? How is it possible to automatically
group communities into common cliques?

To help me gather data for this research, it would be great if you could provide me with the following information/services:
(1) a list of all current LJ communities
(2a) access to the full membership lists for all communities (I've been looking at FOAF data for communities, and it seems that for the larger ones, only about 1000 members are listed in the FOAF file, even if the community actually has more members)
(2b) daily changes in community membership -- either though some service you provide, if it is not too much trouble, or just permission to poll every community's membership list daily (I have read the Bot Policy page, and will do my best to minimize impact on your servers).
(3) A daily list of all the newly created communities.
(4) If you guys keep this sort of thing, a log of when members of all current communities joined and left.

I hope I am not asking for too much, and would be grateful for any and all help and suggestions. Also, I can program in PERL, Java, and C, and I know SQL, so if I can help in any way to set these services up, it would be a pleasure to do so.

I have a paid account right now ([info]'cemcom'), and will be using
it to post periodic updates, preliminary results, discussions, etc. All
of the code and algorithms developed for the project will be shared with
the community (there will probably be a wiki for this project at some
point, too).

Just FYI, my two research advisors on this project are Dan Huttenlocher
(dph@cs.cornell.edu ; http://www.cs.cornell.edu/~dph/) and Phoebe
Sengers (sengers@cs.cornell.edu ;
http://www.cs.cornell.edu/people/sengers/).

Thanks much,
Eugene


So far the only response I've gotten is from someone promising to forward the email to parties who might be able to help.

 
[info]cemcom
Possible Cornell Theory Center Resources
[PLEASE NOTE THAT THIS MESSAGE IS PRIVATE AND THUS ONLY VISIBLE TO THIS ACCOUNT, WHEN IT IS LOGGED IN]

I was recently forwarded an email by Phoebe from Bill Arms, a Cornell CS Professor / IS Co-Director.

-------- Original Message --------
Subject: Web Research
Date: Wed, 15 Sep 2004 15:34:39 -0400
From: William Y. Arms


The Computer Science department has recently been awarded an NSF
Research Infrastructure grant to build a pedabyte data store for
scientific research, in conjunction with the Theory Center. The
database group has led the project and the PI is Al Demers.

One of the planned areas for research is web data, which we will
obtain from the Internet Archive in San Francisco. We will have the
capacity to think in terms of hundreds of millions or even billions
of pages online, with large memory computers that could store a
complete web graph in memory.

Are you interested in doing research that uses large volumes of web
data online on machines in the Theory Center? If so, we need to know
what data is important to you: large or small snapshots of the web,
which formats, etc. Do you want a single snapshot, or several that
show change over time? How valuable is the web graph without complete
pages?

If you have interest let me know.

Bill
----------------------------------

I have sent Bill a brief description of this project and asked if we could get in on some of their action. Waiting for a reply, or a request for more information.

 
[info]cemcom
LiveJournal (non-)responses
It's been a week since I initially emailed webmaster@livejournal.com asking for help with data collection. Since then I have sent another follow up email to see if anything has been progressing... So far, the two responses I have gotten are both in the "This message has been forwarded to someone who can help. Sit tight." vein... Hopefully, said helpful person will respond soon, as crawling LiveJournal community structure using only the publically available resources would be slow, needlessly harsh on the LJ servers, and would probably give incomplete data (ex. AFAIK it is not possible to get membership lists to communities with more than 1000 members).

The two replies I have recieved are at:

(September 8, 2004): I will forward this to someone who may be able to help.
(September 16, 2004): I show that your last request was forwarded on to someone and they should be contacting you shortly.

For the text of my original email, see this post.

 
[info]cemcom
Related Reading
I recently looked through Preferred Placement: Knowledge Politics on the Web (ed. Richard Rogers, 2000), a book Phoebe let me borrow, as well as Phoebe's big "Embedded World" binder that holds a huge amount of papers related to analyzing web structure for meaningful relationships (authorities, hubs, etc.), as well as critique on current methods and search engines, along with some other randomly related stuff. Below is a list of the relevant sections of the book and interesting papers, broken down by how they might be applied to this research.


Analyzing for related groups of communities or community themes/topics
Bharat, K. & M. Henzinger. "Improved Algorithms for Topic Distillation in a Hyperlinked Environment"

Flake, G. W., S. Lawrence, et al. "Self-Organization and Identification of Web Communities"
Algorithm for identifying web communities by connectivity alone (no textual analysis)

Gibson, D., J. Kleinbert, & P. Raghavan. "Inferring Web Communities from Link Topology" (also cites "Authoritative Sources in a Hyperlinked Environment" by J. Kleinberg)
"Authoritative Sources in a Hyperlinked Environment" is a description of the PageRank algorithm. This might be useful for community authority structure analysis too.

Kumar, R., R. Prabhakar, et al. "Trawling the web for emerging cyber-communities"

Morris, I. & R. Rogers. "Operating the Internet with Socio-Epistemological Logics" (from Preferred Placement)
Analysis of the Internet as a "Rumour Mill". "Hit economy" vs. "Link economy" Analysis for distilling the 'core' of a debate carried out online. Case study with GM food debate using UK websites.

Renncock, D. M., G. W. Flake, et al. "Winners don't take all: Characterizing the competition for links on the web"
Link distribution and related phenomenon among subcategories of pages.

Tajima, K., K. Hatano, et al. "Discovery & Retrieval of Logical Information Units in Web"
Most of the listed algorithms use a single page as a node, but this paper tries to find a better logical unit for web-graph analysis.

Blog/LiveJournal Sociology
Gorny, E. "Russian LiveJournal: National Specifics in the Development of a Virtual Community"
An analysis of the politics and culture behind LiveJournal's Russian community. Includes some interesting comparisons between various community structure of Russian LJers and their Western counterparts.

Community authority structure
Adar, E., L. Zhang, et al. "Implicit Structure & the Dynamics of Blogspace"
Create an algorithm (iRank) for computing blog authority based on who first published a certain piece of information, and who copied what from whom.

Hindman, M., K. Tsioutsiouliklis, & J. A. Johnson. "'Googlearchy': How a Few Heavily-Linked Sites Dominate Politics on the Web"
Shows that in-link counts of political websites follow the power law (and examines repercussions this has on political discourse). Authors also use SVMs to separate sites into political and non-political.

Marres, N. & R. Rogers. "Depluralising the Web" (from Preferred Placement)
Analysis of relevance and authority in WWW structure. Authority as in-links from a "thematic community". Closed or open interlinking "issue networks".

Rafiei, D. & A. O. Mendelzon. "What is this webpage known for? Computing web page reputations"
An algorithm for generating a list of topics a particular page as 'authority' on.

Robers, R., & N. Marres. "Landscaping Climate Change: A Mapping Technique for Understanding Science and Technology Debates on the World Wide Web
Analyzes in- and out-link structure between and within .com, .org, and .gov domains, and how this structure affects online political debate.

Temporal structure
Kleinberg, J. "Bursty and Hierarchical Structure in Streams"

Kumar, R., P. Raghavan, et al. "Extracting large-scale knowledge bases from the web"
Develop a model for the evolution of a web graph.

Venolia, G. "A Matter of Life or Death: Modeling Blog Mortality"
Fits a simple model of blog "life and death" to some raw data from LJ stats. Paper provides some nice graphs of LJ new-blog-per-day and new-post-per-day activity.

Visualizing community link structure
Brandes, U. & S. Cornelsen. "Visual Ranking of Link Structures"
Visualization of link structure computed in sync with popular link-structure+ranking algorithms.

Dodge, M. "Mapping the World Wide Web" (from Preferred Placement)
Showcase of various 2D maps used to map information spaces. Topographical (categorical & hierarchical organization by Kohonen self-organizing maps) ands News Maps (shaded elevation maps with key phrases/words as peaks).

Freeeman, L. "Visualizing Social Networks"

 
[info]cemcom
No Support From LiveJournal Developers
A "Brad" (Brad Fritzpatrick himself?) has sent an official response to my original request for help from LJ developers on this project. Basically, I was refused any help, for lack of manpower and resources... This is unfortunate, but I'm going to go ahead and mine/crawl LiveJournal myself. We may have to settle on smaller data sets, especially in the beginning, but hopefully this revelation won't be too much of a problem. I'll also keep an eye on for direction.

I have also found this stats dump. It lists interesting data like new members per day since the start of LJ, popular interests and their counts, number of posts per day (this data, unfortunately, ends on 03/11/2003). The number of new members per day is especially interesting, since it can potentially be compared to individual community membership.

 
[info]cemcom
LIFE Funding!
Our LIFE Funding Proposal was accepted, and we were granted $800 to use for this research! Tomorrow I will meet with Dan (and possibly Phoebe) to discuss ways this money can be spent.

 
[info]cemcom
New Related Reading
I have added two new papers to the Related Reading post:

  • Hindman, M., K. Tsioutsiouliklis, & J. A. Johnson. "'Googlearchy': How a Few Heavily-Linked Sites Dominate Politics on the Web"

  • Robers, R., & N. Marres. "Landscaping Climate Change: A Mapping Technique for Understanding Science and Technology Debates on the World Wide Web"


Both papers discuss the politics of link structure and how it affects online political discourse/debate. See their entries in the Related Reading post for more information.

The next paper I will most likely read is Eugene Gorny's Russian LiveJournal: National specifics in the development of a virtual community.

I also need to to get to coding an LJ crawler and setting up a MySQL or PostgreSQL database/schema to store the mined data...

 
[info]cemcom
Russian LiveJournal Community (RLJ)
Added the Russian LiveJournal: National specifics in the development of a virtual community paper by Eugene Gorny to the Related Readings post. There is now a "Blog/LiveJournal sociology" section in that post. The paper is mostly a sociocultural piece, but highlights some interesting structural differences between RLJ and Western LJ.

 
[info]cemcom
Updated LJ Style
I have changed the style of this LiveJournal and added a 'Quick Links' navigation bar to the left.

 
[info]cemcom
LJ Crawl In Progress
At this moment, my laptop at home is crawling LJ for community memberships. This is a test run to see how the software performs, if it has bugs, etc. Unless it crashes, it should be running all night at which point I will also be able to tell how much of LJ I can crawl in reasonable amounts of time (in my last correspondence with LJ devs, I was told there are about 203k community journals).

When the bugs are ironed out, I will post the script (a heavy, heavy modification of [info]'emmastrange''s ljbot.pl script found here) and a breakdown of the logic behind the crawl.

Right now all data is being saved in comma-separated text files, and I'm caching all friend-data and user/community-info pages I get. Once the new Dell system arrives, I will import the text files directly into a MySQL database for easy access.

 
[info]cemcom
Data Collection Update
This weekend my LJ spider gathered data on 4239 communities. In particular, the following information was collected:
(1) Number of members of every checked community
(2) For any checked community with less than 500 members, the members of that community (including other communities)
(3) The interests of every checked community
(4) The 'member of' relations of every checked community

Interestingly, I found that there were only 211075 unique personal-account members in the 4239 communities.

The script I used to gather this data is available from me, if you email the email listed in the userinfo for this account... I'll be uploading it somewhere at some future time, but not before I deal with some issues that cropped up during the crawl, such as getting it to be better at saving state if the user breaks execution... It's pretty usable as it, however.

The data is available by request.

 
[info]cemcom
Using this research to fulfill the School of Engineering's Technical Writing Requirement
I'm going to try to have this research count as my Technical Writing Requirement in the College of Engineering at Cornell, as part of a new program offered to students conducting LIFE-sponsored independent research. Though this may not sound very relevant to the specifics of the research, it is going to dictate how and where I will present results, and thus directions I may focus on.

The idea behind the program is for me to do four different "communication" projects, two dictated by the program and two of my own choosing. After those are completed, my Technical Writing Requirement is met. The great thing is that these projects don't have to be completed by the end of the semester.

A proposal I submitted lists the following projects I would like to undertake:

  • Final Paper -- Dictated by the program. Basically what a lot of my grade for this class will be based on, and will also be sent to the LIFE people (with possible modifications) to justify the execution of this research.

  • Poster -- Dictated by the program. I will create a poster to be exhibited at the 2005 BOOM (Bits On Our Minds) poster session here at Cornell. This poster may later be modified if I want to submit it to other sessions.

  • Electronic Research Journal -- Proposed by me. This page! If this proposal is accepted, the updates on this page will become more comprehensive and I will talk more about the social implications for this research, further directions Phoebe, Dan, and I are working on, my experiences programming and crawling LiveJournal, etc. Basically, I would convert this page into a useful resource for researchers interested in similar projects as this (whether they involve LiveJournal, online community formation, or some other facet of this research). I would also post script/code updates and copies of the data I mined (which I fully intend to do anyway).

  • Academic Paper OR Verbal Presentation -- Depending on how interesting my results and critical designs are, I will either try to write and submit a paper to some academic conference or journal (maybe CHI2006 or DIS2006 if I'm really good...) If this doesn't quite go through, I will instead opt for giving some sort of verbal presentation on this project. This might be in the form of a basic technical-intro presentation for interested sociology/communication students, a presentation focusing on theory application for CS/IS students, or a general interest presentation for either interested faculty/students or maybe for local high school students interested in cross-disciplinary research.

 
[info]cemcom
Another Crawl
Two days ago I conducted another, much shorter, crawl of LiveJournal, this time starting with the , since last time I started with , which led me towards more liberal communities, at least initially.