Methodology
Definitions
For the purposes of this study, a blog is considered to be a regularly updated personal website with posts that appear in reverse chronological order. Community sites like Slashdot and MetaFilter are included in the crawl; however, sites that serve as the home page for a major blogging tool (MovableType.org, Blogger.com) are not included.
For the purposes of our study, an active weblog is a site that has over 500 bytes of textual content (about a hundred words), and was updated sometime in the last 90 days.
Finding weblogs
We find weblogs by crawling the web - starting at one site, and following all of the outbound links to see if any of those sites are weblogs. We also seed our crawl queue with lists of known weblog URLs. These known URLs come to us from a variety of sources:
- Private lists from established databases
Several maintainers of large projects have been kind enough to provide us with lists of known weblogs URLs. You can find their names on the acknowledgements page.
- Publically available URL lists
Several blogging tools maintain the equivalent of a donor's page or comprehensive directory of known sites using the tool. We mine these directories to seed the queue. Here is a list of directory sites pillaged to date:
- Blog.pl search engine
- Movable Type donor list
- Greymatter donor list
- Cafelog 'powered by' page
- pMachine users list ()
- Pivot user list
- MonBlogue user list
- Swiss Blogs list
- Antville site list
- Skyblog site list
- Malaysia Central blog list
- Nucleus user list
- Blogalia directory.
- Tenbit.pl home page
- G-Blog members list
- Seznam Czech topic blogs, personal blogs
- Weblog Recently Updated Sites There are several weblog update sites now in existence. These sites typically provide an XML feed of recently changed weblogs. We monitor the following update sites for changes, typically on an hourly basis:
- Weblogs.com
- Blogger update page
- Blogdir updates page
- Weblogues.com (French)
- Weblogchecker.de (German)
- Skyblog
- Individual submissions
Anyone can submit a blog URL using the form on our homepage. These submissions are added to the crawl queue on a daily basis.
Blog Determination
How do we know that a site is a weblog? Our policy is to err on the side of false negatives - that is, we'd rather miss some real weblogs than improperly include a non-blog site. All sites in the database are marked with a 'certainty level', depending on how sure we are of their status.
A site will be marked as a weblog if it meets any of the following criteria, in order of precedence:
- The URL came from a recognized weblog update site, weblog database, or a user submission.
- The URL matches a known blog hosting provider (BlogSpot, pitas, weblogger.com.br, joueb.com...)
- The page HTML contains a META tag with a generator attribute
- The page HTML contains a button, logo, or "Powered by" link associated with a blogging engine
- The page HTML contains some idiosyncratic code that can serve to positively ID the blogging tool (for example, default comment JavaScript for Movable Type)
- The site contains at least five uses of the word "blog"
- The site has an RSS feed
Our blog identification code is open source and available on the Comprehensive Perl Archive Network (CPAN) as WWW::Blog::Identify.
Blogs are stored with a confidence value attached. The highest value is for blogs confirmed as such by a human user (i.e., myself). The next highest is for user-submitted blogs, then blogs from update sites, then blogs detected by the Perl module, and finally sites rejected by all methods.
If a site is marked as a blog because of one set of criteria, and subsequently appears in a set of URLs with a higher confidence level, its status is upgraded. I.e, if the crawler guesses that a site is a Manila weblog based on a GIF in the HTML, and then that site appears on the weblogs.com update list, its confidence level is bumped up accordingly. So the best way to make sure you are included in our crawl, short of filling out the online form, is to ping weblogs.com.
Language Identification
We do language identification using a program adapted from TextCat, which analyzes trigram (three-letter) patterns in site text. Blogs with fewer than 500 bytes of text (as determined by HTML::TreeBuilder) are ignored.
Please note that we ignore any language metadata in the HTML markup itself, since many non-English templates claim to be English anyway.
Bilingual bloggers are likely to have a wonky determination made. I'm thinking about how to fix this.
Software
Crawl data is stored in a MySQL database. The crawler is written in Perl. Everything runs on a Linux server.
Contact
Please address all questions and comments to Aaron Coburn.
Updated 06-13-2003
