|
daypop weblog
Yes, I realize most of the links on the Top 40 are for the Blogging Network. All blogs from there must have had a bunch of common links introduced on their pages. This is much like what happens when Blogspot creates a common link on all their pages. As soon as I get home from work, this will get fixed.
I used Daypop to search for Daypop in RSS Headlines for the first time today and came across this page:
http://www.apmforum.com/news-feeds/iraq-conflict.php
This is the first page that I know of that's using Daypop's RSS output to create a custom news feed. Daypop has been able to output RSS for any of its search pages for a while now and it's good to see someone using it and publishing the results.
Here's a little known secret: append &sum=desc to any search URL and you'll get the first paragraph of the article as the summary text.
http://www.daypop.com/search?q=iraq&t=n&sum=desc
One step further, append &o=rss to output RSS and you'll get your own personalized news headlines in your favorite news aggregator!
http://www.daypop.com/search?q=iraq&sum=desc&t=n&o=rss
To test out the RSS output, Daypop has an RSS Viewer.
http://www.daypop.com/rssview?url=http%3A%2F%2Fwww.daypop.com%2Fsearch%3Fq%3Diraq%26sum%3Ddesc%26t%3Dn%26o%3Drss&rss=View+RSS
The easiest way to get at the RSS for search results is to click on the XML button at the top of every search page.
Use this in your news aggregator to monitor what webloggers are saying about the War in Iraq:
http://www.daypop.com/search?q=iraq&t=w&o=rss
The Top Weblogs page just got recalculated and it shows Salam Pax, The Agonist and Command-Post in the Top 100. Salam Pax is #1 using both rankings, but it'd be interesting to note how long he stays up there as he's probably not on many permanent blogrolls.
mod_gzip update: Seems to have been an anomaly, my inability to load the Top 40 page, because it works now. I've reactivated mod_gzip. See how it goes for today...
I read dive into mark's comments on Speed Up Your Site and found a little gem about using mod_gzip to compress data going out from your server. Since most clients support gzip encoding, this means Daypop can serve less bytes/page.
I looked around the net for mention about mod_gzip and the information out there is pretty sparse. I installed it anyway and tested it out. The length of the Top 40 page using gzip is 5.5K! That's about 85-90% improvement. Or rather, it takes about 1/10 the bandwidth that it used to.
Where's the catch?
Well, I've encountered one problem so far. My IE6 won't load the Top 40 page when gzipped!
So the jury is still out on this. More testing is definitely needed.
Update: I can't get my browser to load the Top 40 even after deactivating mod_gzip, and deleting my local cache! I can access the page through the filesystem. The gzip test above still works (it says it loaded the page). What's going on?
Anyone else having problems loading the Top 40?
The donation drive is getting a great response so far from the blogging community with a lot of people offering spare bandwidth, in addition to PayPal or Amazon contributions. It's really good to know that there are users out there willing to help!
I've started to standardize all the pages in Daypop so that they all have a common header and footer, those Nav bars that are everywhere. I had to move the static information pages like About and Search Tips into their own folder so that I could turn on Server Side Includes for that specific folder.
I've also done a little bit of CSS for the Nav bars to shrink the size of the pages. I think it's pretty safe to use CSS now. When Daypop first started, it used super common HTML, to work on the lowest common denominator. I suspect everyone's up to 5.0 by now.
Anyway, the standardization is pretty much complete and I'm back to working hard on new features again.
The News Burst page was out of control. It was just massive and sucked down bandwidth like you wouldn't believe. 10 headlines and news site links per word times 20 words... it was a monster to load and contributed, I bet, to latency when accessing Daypop. If someone started loading the News Burst page, forget about any one else loading a page for a while.
I've shortened it to 6 headlines per word and no links to news site front pages and I'm hoping that helps.
First, Daypop is exceeding the bandwidth that I've got. I've noticed it's been slow when I access Daypop at work.
Second, there was a bug in the RSS spider that I introduced last week, when I tried to change it. As a result, there are no headline searches right now.
Third, what happened to the word subhonker? Why isn't it in the database? There's most likely some sort of corruption in the database that I'll have to sort out tonight. Update: Ahh, it's actually subhonker7 and the burst algorithm reads in words composed of letters.
There doesn't seem to be enough activity on the Top Weblog Posts page. A lot of the bottom of the list is too lightly linked to warrant inclusion on the page. For now, I'm shortening the list to the Top 15 Weblog Posts.
All the pages are a little messed up right now.
I'm letting the Top 40 and Top Posts pages update themselves at their proper time. The next scheduled update is at 12:00 for the Top 40 and 12:30 for the Top News and Top Posts.
The Word Bursts and News Bursts won't be updated until 3:00 and 3:30.
zipcodeworld dot com is spamming the Top 40 by commenting on everyone's blogs and including their URL. Their URL then shows up on the front page of a blog and gets indexed and ranked by Daypop.
The obvious solution is to make all webloggers either blogroll their links or use meta-data to differentiate links that they attribute authority to. I don't like either solution, really, because it would be impossible to get everyone to do it.
Any ideas?
In the early days of the Top 40, a lot of the top links happened to be weblog posts. That was one reason for creating the Top News page -- to filter out anything other than news articles, which didn't have the scores to make the Top 40 page.
It seems like in the last year, the tides have changed and a lot of the Top 40 links have become news articles. How do we keep up with the weblogging conversation that's occurring? I've made a page that filters out everything but weblog posts.
Top 25 Weblog Posts
They're scored the same as Top 40 or Top News links.
I answered a couple questions for Pamela O'Connell about Daypop Word Bursts and it got written up at:
http://nytimes.com/2003/03/13/technology/circuits/13diar.html
Bursting Out
Sites that track what people are linking to or searching for online provide a peek at the Web's collective consciousness. But do those statistics capture all the hot topics?
The study of "word bursts" in Weblogs or e-mail could prove a sort of early-detection system for online trends, according to research by Jon Kleinberg, associate professor of computer science at Cornell University (www.cs.cornell.edu/home/kleinber). Professor Kleinberg's algorithm tracks words that occur with high intensity over a limited period of time - those that are "bursty," not necessarily those that are most common. The goal is to identify "the relationship of topics and time on the Web," he said.
At a recent conference, Professor Kleinberg conjectured that word bursts could be used to track what people were discussing in personal Weblogs. Daypop, a search engine focused on news sites and Weblogs, quickly added a list of top word bursts (www.daypop.com/burst). By comparing this list with Daypop's Top 40 links, one finds topics that are popular but may not depend on a link to a specific site. For example, news articles about the recent suicide of the French chef Bernard Loiseau did not make the list of top links, but his death was a high-ranking topic on the burst list.
The next step, said Daypop's founder, Daniel Chan, is to focus more on phrases and context, not just individual words. Context was sorely lacking, he noted, when the algorithm grouped WABI, the call letters of a television station, with wabi, a Japanese term that connotes a refined simplicity. And I am still trying to figure out what was behind last week's "pancake" burst.
First, the pancake burst. Apparently March 4 is International Pancake Day.
The coverage wasn't as good as I expected. I talked about memeless word bursts and I went into more detail about the need for context and exactly what it meant in terms of polysemy and synonymy.
So, I'm going to reprint some of my answers to her questions here. Italics are her questions.
I've read your weblog entries on why you created this section. Do you think that, so far at least, it is picking up on anchorless memes (to use your phrase) and can you cite any in particular?
I could be wrong on some (or even all of these), but I don’t remember seeing “Horsemen of the Ablogalypse”, the Lexmark suit, French chef Bernard Loiseau’s suicide, or Fred Durst using the word “Agreeance”, on the Top 40. Khalid Shaik, on the other hand, ended up on the Top 40 after showing up on the Word Bursts page. Of course, there were plenty of bursts that were memeless. I think these words are just as important as indicators of our collective conscious – words like “perilous” and “concessions” are important because at this point in time the possibility of war looms large. #9 and #18 from:
http://www.daypop.com/burst/archive/2003/02/20030228090001.htm
Word bursts seem to represent the first attempt to catalog not just what links are being featured most in weblogs, but what people are really talking about. I tend to think of this as the missing piece (the piece that wasn't covered necessarily by search terms and links) in terms of tracking the collective consciousness on the web -- do you agree? Or are there other pieces still missing?
Good question. I think that what people are _really_ talking about is determined by context. See answer below.
What was your thinking in applying the concept to news -- and how many newspapers does daypop cover?
Daypop covers approximately 1,000 news sites.
I applied the concept to news because it was relatively simple to implement (extension of Word Bursts) and it gives a picture of what news items are on the rise.
The first statistical analysis that I did on news sites was back when Daypop had just launched. That analysis caught phrases that were big at _that moment in time_. It provided a _snapshot_ of what was happening around the world. That algorithm would catch Iraq as an important news story.
Bursts, on the other hand, wouldn’t catch Iraq because Iraq is already so prevalent in the news.
It's funny -- Jon [Kleinberg] was just saying that he thought the next step in word bursts is to combine terms and then I see that you had already started implementing that. What's next, phrases?
Phrases are possibly the next evolution of the burst algorithm. Combining terms right now is done at the simplest level, based on a high percentage of shared pages that the terms occur on. To create phrases, the word clustering algorithm needs to take into account word order.
What I find more exciting is the use of semantic information.
What’s currently missing from the word burst groupings is context. This was evident from the algorithm’s grouping of WABI, a television station call sign, and wabi, a Japanese term. See #9:
http://www.daypop.com/burst/archive/2003/02/20030227150001.htm
If we were to be able to determine the _context_ of any particular word burst, then we’d be able to _separate_ “OPRAH makes the Forbes Billionaires list” from “OPRAH gets a new hair style” (just a made up example). We’d also be able to _group_ “OPRAH makes the Billionaires list” with “WINFREY makes the Billionaires list”.
All Daypop Scores are normalized now such that 1.00 is the average score. I'm working on a method for webloggers to query for their score.
Dave Aiello at CTDATA wrote about Unearthing Dirt in Weblogs Still a Black Art by Mark Glaser. Mark wants to search weblogs. How do you search weblogs? He wanted to search for Martin Sheen and he didn't want mainstream news articles. So he got on Daypop and searches for "Martin Sheen blog" in order to filter out the news sites! Maybe the pull-down menu on the front page was not a good design choice... There's also the link in the blue box at the right side of the screen to narrow your search! Narrow it to weblogs, news sites, headlines, narrow by language, sort by date and search only titles. There's the advanced search page for everything else. The point is: you can restrict searches to weblogs.
Dave then goes on to say an RSS based search engine for weblogs is needed. Daypop does this too. Check out the pull-down menu! The fourth option is RSS Headlines. These headlines are spidered according to a NewsIsFree changes.xml file.
The only problem with searching RSS headlines is you can't restrict your search to weblog RSS. But that shouldn't even matter considering you can search the full text of weblogs on Daypop (12,500 of them).
How many people out there are still confused about Daypop's function in the weblog world?
Several people suggested that words on the word bursts page be combined if they are related. I must admit, it's something I should have done before even launching the page. I've changed it. The list may not number up to 20 anymore, because some spots have more than 1 word. There's still 20 words total on the list.
I've also read some comments on word bursts written by webloggers and there are some out there who believe there are too many proper nouns, that the algorithm seems to favor proper nouns, while others out there think the list should be all proper nouns, a better indicator of true "memes". All the algorithm does is rank words based on increased usage during the last few days. What words show up on the list is entirely determined by the weblogging community.
I adapted the weblogging word burst feature to construct a Top News Burst page. The algorithm detects word bursts in headlines from all the front pages of English news sites in the Daypop index. The list of articles accompanying each word burst is limited to 10. Headlines that are exact (case sensitive) duplicates are not shown, so it's possible to have very few articles in certain lists, if news sites all carry the same headline.
I list the number of occurences of each word, but it's important to note that the list is not ranked by occurences, but rather by a function that also takes into account time of discovery.
The News Burst page gives a good indicator of what the media is just beginning to feed us. You'll notice there aren't any stories about Iraq, which probably commands most of the headlines at any point in time, because there hasn't been an increase in usage.
I'm also thinking about standardizing on the "navigation bar" that's on both word bursts pages.
The first one is actually from Reuters:
http://www.reuters.com/newsArticle.jhtml?type=topNews&storyID=2293758 NEW YORK (Reuters) - Visitors to Daypop, an index of personal journalism sites known as Weblogs, were treated Wednesday to a new feature called "word bursts," an automated attempt to identify the hottest words at the moment.
"They are indicators of what Webloggers are writing about right now," boasts the site, at http://www.daypop.com/burst/.
The "word burst" concept was borrowed from a New Scientist magazine article about a Cornell mathematician who came up with the idea. It has taken on a life of its own, making the featured words popular if only because the Daypop site said so and major Web sites were all pointing to the site for the latest buzz.
It's just the latest example of the power of Weblogs to shape perception among a growing audience of online readers.
http://www.rhetorica.net/ I'm a big fan of Daypop, the news and weblog search engine. They have a new feature I find fascinating. It's called "Top Word Bursts," and it tracks popular words across the blogosphere.
http://www.abe1x.org/blog/ Daypop Top Word Bursts is a pretty cool social indicator.
http://blog.networkcomputing.com/ Although google.com can help you find popular pages, it really is not able to capture the most important memes that are floating around the net from moment to moment. Is such a goal even achievable? I'm not sure. But given the speed at which blogging moves, I have faith in a project put together by the folks at Daypop, called Word Bursts.
This tool attempts to "bubble-up" popular words and phrases that appear throughout blogdom over the past few days. A casual peek reveals some very strange results, such as the word "pans" (as in bread pans). But its ability to see the importance of a single the word is amazing. Take the word "Tristan", as in Tristan Taormino from the Village Voice, who wrote what is apparently a very popular story about recently deceased Great White guitarist Ty Longley. Word Bursts was able to quickly gather together the disparate posts concerning Tristan and his story. Very nice indeed.
http://www.kennethhunt.com/ Yes! All the daypop goodness with word bursty flavor!
http://www.buzzmachine.com/ Very nice work from Daypop on its new Word Burst feature: Here is what blogs are talking about now; here's where the buzz is.
http://www.gamalei.net/syaffolee/ This is the first time that I actually consider some piece of metadata actually interesting. Witness the collective (un)conscious!
http://jdmx.blogspot.com/ Word burst implemented: Last week there was theoretical discussion about measuring frequency changes of particular words in blogs to reveal hot trends. This week Dan Chan of Daypop has already implemented it. I think this is signficant work, because one of the big tasks businesses have these days is extracting info from data, of pulling signal from noise. This wordburst implementation, like the general Daypop and Blogdex link-popularity engines, works on weblogs-as-a-whole... useful because you can scan a range of rated news from left to right (both links are from today's Top 40). But we also need to implement this on a filtered-source basis, so you can tell discussion trends in a given community or business... both the server and the client have roles to play in this type of work. Still, I'm amazed that Dan pulled off this existence proof so quickly, two thumbs up.
http://www.observer-reporter.com/weblog/ Daypop Top Word Bursts tells you what's hot in the world of blogging by listing frequently-used words among blogs. It's a good indicator of what webloggers are writing about right now, which provides a good idea of what current events are of interest.
http://www.memeufacture.com/ Further solidifying its position as the best weblog aggregator, Daypop implements a Word Bursts feature (first.) And a damn good one at that. Daypop is, of course, what inspired the creation of Memeufacture.
Hats off. I dabbled in this with limited success, and the Dan's implementation is as close to perfect as I could have imagined. The issue at hand is what time delta should be used in determining word bursts, Memeufacture's engine doesn't really lend itself to this (my fault.)
http://www.ctdata.com/ It's amazing that Daypop produced this function so quickly. But they had the infrastructure to do it quickly, took a shot at it, and will undoubtedly improve it.
 |
Archive
9/2001
10/2001
11/2001
12/2001
1/2002
2/2002
3/2002
4/2002
5/2002
6/2002
7/2002
8/2002
9/2002
10/2002
11/2002
12/2002
1/2003
2/2003
3/2003
4/2003
5/2003
6/2003
7/2003
8/2003
9/2003
10/2003
11/2003
12/2003
1/2004
2/2004
3/2004
4/2004
5/2004
6/2004
7/2004
8/2004
9/2004
10/2004
11/2004
12/2004
1/2005
2/2005
3/2005
4/2005
5/2005
6/2005
7/2005
8/2005
9/2005
10/2005
11/2005

|