 |



 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
This semester, I (Eugene Medynskiy) will be working with Phoebe Sengers and Daniel Huttenlocher on the CemWEB project. A brief description of the project, as provided on the LIFE Research funding application follows. It's a bit formal and uptight, but such are funding applications. As the project gains form and momentum, I will be posting updates, results, and discussions here.
---
During the first leg of the research, we will be concerned with creating software tools to efficiently collect basic information about LiveJournal communities and their members, as well as how membership in communities changes with time. We will also develop mathematical models and algorithms to represent and analyze community structure, relationships, membership, and how these change over time.
Later, we will use these tools to examine three separate topics:
(a) dynamics of community affiliation: when and why do communities experience a significant flux in numbers of people joining or leaving? Can these bursts of activity be tied to "real-world" events?
(b) dynamics of a given community's 'agenda', as tracked by changes to its "Interests" (a list of subjects selected by the community's moderators), posts, and the "Interests" of its members.
(c) metrics of community distance: how can we talk about the relationship between two or more given communities, even if they don't share any members or "Interests"? is it possible to group communities into common cliques? We will try to apply these findings to create a "community browser" that LiveJournal users may use to find new communities that they might want to join, or conversely to find and watch communities with, say, opposing political views.
---
|
 |
 |
 |
 |
|
 |
 |




 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
A few days ago, I emailed the LiveJournal webmaster about our project, asking for suggestions, help, and easier access to the data we would need.
Dear LJ Webmaster,
My name is Eugene, and I work with the Culturally Embedded Computing
Group at Cornell University. This semester, as an undergraduate
independent research project, I would like to examine the structure of
LJ communities, and how it changes over time. In particular, three broad
topics I would like look at are:
(a) dynamics of community affiliation: when and why do communities
experience a significant flux in numbers of people joining or leaving?
Can these bursts of activity be tied to "real-world" events?
(b) dynamics of a given community's 'agenda', as tracked by changes to
its "Interests" and the "Interests" of its members.
(c) metrics of community distance: how can we talk about the
relationship between two or more given communities, even if they don't
share any members or "Interests"? How is it possible to automatically
group communities into common cliques?
To help me gather data for this research, it would be great if you could provide me with the following information/services:
(1) a list of all current LJ communities
(2a) access to the full membership lists for all communities (I've been looking at FOAF data for communities, and it seems that for the larger ones, only about 1000 members are listed in the FOAF file, even if the community actually has more members)
(2b) daily changes in community membership -- either though some service you provide, if it is not too much trouble, or just permission to poll every community's membership list daily (I have read the Bot Policy page, and will do my best to minimize impact on your servers).
(3) A daily list of all the newly created communities.
(4) If you guys keep this sort of thing, a log of when members of all current communities joined and left.
I hope I am not asking for too much, and would be grateful for any and all help and suggestions. Also, I can program in PERL, Java, and C, and I know SQL, so if I can help in any way to set these services up, it would be a pleasure to do so.
I have a paid account right now ( 'cemcom'), and will be using
it to post periodic updates, preliminary results, discussions, etc. All
of the code and algorithms developed for the project will be shared with
the community (there will probably be a wiki for this project at some
point, too).
Just FYI, my two research advisors on this project are Dan Huttenlocher
(dph@cs.cornell.edu ; http://www.cs.cornell.edu/~dph/) and Phoebe
Sengers (sengers@cs.cornell.edu ;
http://www.cs.cornell.edu/people/sengers/).
Thanks much,
Eugene
So far the only response I've gotten is from someone promising to forward the email to parties who might be able to help.
|
 |
 |
 |
 |
|
 |
 |


 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
[PLEASE NOTE THAT THIS MESSAGE IS PRIVATE AND THUS ONLY VISIBLE TO THIS ACCOUNT, WHEN IT IS LOGGED IN]
I was recently forwarded an email by Phoebe from Bill Arms, a Cornell CS Professor / IS Co-Director.
-------- Original Message --------
Subject: Web Research
Date: Wed, 15 Sep 2004 15:34:39 -0400
From: William Y. Arms
The Computer Science department has recently been awarded an NSF
Research Infrastructure grant to build a pedabyte data store for
scientific research, in conjunction with the Theory Center. The
database group has led the project and the PI is Al Demers.
One of the planned areas for research is web data, which we will
obtain from the Internet Archive in San Francisco. We will have the
capacity to think in terms of hundreds of millions or even billions
of pages online, with large memory computers that could store a
complete web graph in memory.
Are you interested in doing research that uses large volumes of web
data online on machines in the Theory Center? If so, we need to know
what data is important to you: large or small snapshots of the web,
which formats, etc. Do you want a single snapshot, or several that
show change over time? How valuable is the web graph without complete
pages?
If you have interest let me know.
Bill
----------------------------------
I have sent Bill a brief description of this project and asked if we could get in on some of their action. Waiting for a reply, or a request for more information.
|
 |
 |
 |
 |
|
 |
 |




 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
I recently looked through Preferred Placement: Knowledge Politics on the Web (ed. Richard Rogers, 2000), a book Phoebe let me borrow, as well as Phoebe's big "Embedded World" binder that holds a huge amount of papers related to analyzing web structure for meaningful relationships (authorities, hubs, etc.), as well as critique on current methods and search engines, along with some other randomly related stuff. Below is a list of the relevant sections of the book and interesting papers, broken down by how they might be applied to this research.
Analyzing for related groups of communities or community themes/topics
Bharat, K. & M. Henzinger. "Improved Algorithms for Topic Distillation in a Hyperlinked Environment"
Flake, G. W., S. Lawrence, et al. "Self-Organization and Identification of Web Communities"
Algorithm for identifying web communities by connectivity alone (no textual analysis)
Gibson, D., J. Kleinbert, & P. Raghavan. "Inferring Web Communities from Link Topology" (also cites "Authoritative Sources in a Hyperlinked Environment" by J. Kleinberg)
"Authoritative Sources in a Hyperlinked Environment" is a description of the PageRank algorithm. This might be useful for community authority structure analysis too.
Kumar, R., R. Prabhakar, et al. "Trawling the web for emerging cyber-communities"
Morris, I. & R. Rogers. "Operating the Internet with Socio-Epistemological Logics" (from Preferred Placement)
Analysis of the Internet as a "Rumour Mill". "Hit economy" vs. "Link economy" Analysis for distilling the 'core' of a debate carried out online. Case study with GM food debate using UK websites.
Renncock, D. M., G. W. Flake, et al. "Winners don't take all: Characterizing the competition for links on the web"
Link distribution and related phenomenon among subcategories of pages.
Tajima, K., K. Hatano, et al. "Discovery & Retrieval of Logical Information Units in Web"
Most of the listed algorithms use a single page as a node, but this paper tries to find a better logical unit for web-graph analysis.
Blog/LiveJournal Sociology
Gorny, E. "Russian LiveJournal: National Specifics in the Development of a Virtual Community"
An analysis of the politics and culture behind LiveJournal's Russian community. Includes some interesting comparisons between various community structure of Russian LJers and their Western counterparts.
Community authority structure
Adar, E., L. Zhang, et al. "Implicit Structure & the Dynamics of Blogspace"
Create an algorithm (iRank) for computing blog authority based on who first published a certain piece of information, and who copied what from whom.
Hindman, M., K. Tsioutsiouliklis, & J. A. Johnson. "'Googlearchy': How a Few Heavily-Linked Sites Dominate Politics on the Web"
Shows that in-link counts of political websites follow the power law (and examines repercussions this has on political discourse). Authors also use SVMs to separate sites into political and non-political.
Marres, N. & R. Rogers. "Depluralising the Web" (from Preferred Placement)
Analysis of relevance and authority in WWW structure. Authority as in-links from a "thematic community". Closed or open interlinking "issue networks".
Rafiei, D. & A. O. Mendelzon. "What is this webpage known for? Computing web page reputations"
An algorithm for generating a list of topics a particular page as 'authority' on.
Robers, R., & N. Marres. "Landscaping Climate Change: A Mapping Technique for Understanding Science and Technology Debates on the World Wide Web
Analyzes in- and out-link structure between and within .com, .org, and .gov domains, and how this structure affects online political debate.
Temporal structure
Kleinberg, J. "Bursty and Hierarchical Structure in Streams"
Kumar, R., P. Raghavan, et al. "Extracting large-scale knowledge bases from the web"
Develop a model for the evolution of a web graph.
Venolia, G. "A Matter of Life or Death: Modeling Blog Mortality"
Fits a simple model of blog "life and death" to some raw data from LJ stats. Paper provides some nice graphs of LJ new-blog-per-day and new-post-per-day activity.
Visualizing community link structure
Brandes, U. & S. Cornelsen. "Visual Ranking of Link Structures"
Visualization of link structure computed in sync with popular link-structure+ranking algorithms.
Dodge, M. "Mapping the World Wide Web" (from Preferred Placement)
Showcase of various 2D maps used to map information spaces. Topographical (categorical & hierarchical organization by Kohonen self-organizing maps) ands News Maps (shaded elevation maps with key phrases/words as peaks).
Freeeman, L. "Visualizing Social Networks"
|
 |
 |
 |
 |
|
 |
 |


 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
A "Brad" (Brad Fritzpatrick himself?) has sent an official response to my original request for help from LJ developers on this project. Basically, I was refused any help, for lack of manpower and resources... This is unfortunate, but I'm going to go ahead and mine/crawl LiveJournal myself. We may have to settle on smaller data sets, especially in the beginning, but hopefully this revelation won't be too much of a problem. I'll also keep an eye on for direction.
I have also found this stats dump. It lists interesting data like new members per day since the start of LJ, popular interests and their counts, number of posts per day (this data, unfortunately, ends on 03/11/2003). The number of new members per day is especially interesting, since it can potentially be compared to individual community membership.
|
 |
 |
 |
 |
|
 |
 |














 |
cemcom
|
 |
 |
 |
 |
|
 |
 |
I'm going to try to have this research count as my Technical Writing Requirement in the College of Engineering at Cornell, as part of a new program offered to students conducting LIFE-sponsored independent research. Though this may not sound very relevant to the specifics of the research, it is going to dictate how and where I will present results, and thus directions I may focus on.
The idea behind the program is for me to do four different "communication" projects, two dictated by the program and two of my own choosing. After those are completed, my Technical Writing Requirement is met. The great thing is that these projects don't have to be completed by the end of the semester.
A proposal I submitted lists the following projects I would like to undertake:
- Final Paper -- Dictated by the program. Basically what a lot of my grade for this class will be based on, and will also be sent to the LIFE people (with possible modifications) to justify the execution of this research.
- Poster -- Dictated by the program. I will create a poster to be exhibited at the 2005 BOOM (Bits On Our Minds) poster session here at Cornell. This poster may later be modified if I want to submit it to other sessions.
- Electronic Research Journal -- Proposed by me. This page! If this proposal is accepted, the updates on this page will become more comprehensive and I will talk more about the social implications for this research, further directions Phoebe, Dan, and I are working on, my experiences programming and crawling LiveJournal, etc. Basically, I would convert this page into a useful resource for researchers interested in similar projects as this (whether they involve LiveJournal, online community formation, or some other facet of this research). I would also post script/code updates and copies of the data I mined (which I fully intend to do anyway).
- Academic Paper OR Verbal Presentation -- Depending on how interesting my results and critical designs are, I will either try to write and submit a paper to some academic conference or journal (maybe CHI2006 or DIS2006 if I'm really good...) If this doesn't quite go through, I will instead opt for giving some sort of verbal presentation on this project. This might be in the form of a basic technical-intro presentation for interested sociology/communication students, a presentation focusing on theory application for CS/IS students, or a general interest presentation for either interested faculty/students or maybe for local high school students interested in cross-disciplinary research.
|
 |
 |
 |
 |
|
 |
 |


|