Lucene

I’ve recently been working with Lucene, an open source full text search engine. I haven’t been looking for search engine technology for a while, but all of a sudden I keep hearing about it from all different contexts.

Its a really nice index. Its amazingly simple to use, it appears to be blazingly fast – reasonable for writes and reads of the index.

My Assistant

OK. Well, this is really silly, but it was kind of fun. I ran across the Microsoft Agent Wizard today. And so, I had to create my own wizard. You can find him here; click on the link which says “Have my 24×7 assistant guide you through this page)”. Its cute. You’ll need to be running IE for it to work. It may take a minute to load. But, hopefully, it is worth the wait!

Building blocks for RDF

If you were going to create the RDF classifieds, there are some RDF building blocks you’d like to have.

Each for-sale item will have:
– A price.
This is a semi complex item. What is the actual price? What currency is it in?
– Shipping terms
Paid for by seller? How much? Paid for by buyer?
– Category
Presumably, robots will want to know how to categorize this. There is an RDF Taxonomy module which leverages the DMOZ categorization scheme. I hope dmoz promises to never change the taxonomy? 🙂
– Contact info for seller
Contact him by phone? By email? FOAF is probably the answer for this one.
– Location
Where is the item?

And, these aren’t new ideas.

I wish there were standard building blocks for things like prices and such though.

RSS Classifieds

I’ve been discussing the “RSS classifieds” idea (for lack of a better name) more with a few coleagues. I think the idea has merit.

The basic concept:

A rich RDF format is developed for specifying “I have something to sell”.
People that are selling can create this once, and register with search engines. Search engines pick look at the item, determine if its worth “accepting” into their system, and then post it for sale.

Why do I want this? Can’t I do it on EBay? No, you can’t. EBay, for all its greatness, is a closed system. Fortunately, for most sellers, EBay is currently the largest online marketplace, so its a safe bet. But, if you want to advertise your forsale item elsewhere, you have to do that manually – reposting your data into each system separately. You probably have to register on each system separately, etc etc. Its a real pain in the neck. And, in exchange for you locking in exclusively to EBay, EBay also takes a small fee from you!

OK. So, moving on. What we need to have such a system:

1. A format to specify for-sale items in RDF and RSS
2. Search engines that recognize the format. An ability to make sure that search engines “stay fresh” with the current status of the item

Well, thats it to get to phase I.

But, there is more. One really handy thing about EBay is its rating service for users. With RSS, I think this is reproducible in a distributed way. Lets say Bob and Charlie are about to engage in a transaction, where Bob is selling to Charlie. After the transaction, Bob puts a review of Charlie into his feed, which says “Good”. Charlie puts a review into his feed which says “Bob” was “Bad – late with payment”. As Bob and Cherlie enter into many transactions over time, each will be reviewed by others many times. These reviews can be found by search engines, and an overall composite score can be generated for each user – in effect replicating what EBay has done. Of course, as will everything on the web, we’ll have to take some time building anti-Spam features. We don’t want Bob to be able to boost his ratings by just creating lots of fake reviews of himself. That is probably solvable with a few heuristics, much like what search engines use today. The harder case is the case where Bob wants to maliciously accuse Charlie of being “Bad”. These may be solvable by using anti-spam techniques and also by allowing Charlie to post his own “Review-rebuttal” within his own feed. Lastly, this mechnism has problems with individual changing their reviews over time. Sure, Bob initially gave Charlie a good review. But after Charlie gives Bob a bad review, Bob goes and changes his review of Charlie to say “Bad”. It may be that a webservice is in order here for verifying the authenticity of reviews.

One other interesting point is anonymous email. I don’t think blogs will be using open email addresses forever. Once the spam bots figure out how to parse these little gems, we’ll all be spammed in our mailboxes. Plus, we really don’t want the general public to browse and see that Bill Lee is selling his collection of fancy dolls. So each forsale item that gets posted in RDF/RSS format may want to include contact info which goes through a one-way email anonymizer service. Many (most?) of the classifieds services online today already provide this. Craigslist is a great example.

From what I’ve seen, nobody has really done this so far. (Let me know if you’ve seen otherwise) I’m not sure why. Maybe there is no money in it. Also, I think the level of complexity in RDF for this type of thing is substantially more complex than anything in any of the RSS specifications today. Today’s RSS is about as simple as it gets.

Bots

A friend pointed me at a Scientific American article today, titled “Baffling the Bots“. Its a fun read, I suppose. It sort-of credits Yahoo as having pioneered this stuff in 2000. But we totally did this at Remarq in 1998/1999.

The reason we had to do it was because we had tons of images on our site, and people were sending bots to go find them all. This ate up a fair amount of bandwidth. So, we just required users to type in the number in the picture before they could view pictures, and they had to redo this every 50 pictures or so. Interestingly, we just displayed a simple 3 digit number. The implementation was cheap – we pre-generated the numbers and actually only ever displayed about 50 different numbers. As far as we knew, though, it worked 🙂

RSS Book

I just finished reading the O’Reilly RSS book by Ben Hammersley. Its a decent book. Published in March of this year, so its pretty up to date. I have to admit its got me excited about RSS modules.

As I mentioned before, who wants to read this?

I do have an interesting application. I want to sell stuff. I want people to read my forsale items via my RSS feed. Why? Because I want Google and all the other search engines to help me disseminate my goods. I don’t want to pay Ebay 2% and then pay another 2.5% to Paypal (oh wait, thats ebay too, isn’t it?). Why can’t we all just share this info via RSS? Indexers and robots of the world can parse it out and become our delivery vehicle.

Are Weblogs the same as usenet?

Usenet never fully grabbed me. I was what they called a ‘lurker’. I would read some things, but never got too involved in any groups. That was the status of most of us out there. As I read more blogs and get to know them better, I realize that its the same application, with a new face. The question really is, whats the difference between Usenet and Blogs? The technology is a little different, but the concept is very similar.

Right now, I’m writing to my own moderated newsgroup called alt.belshe.mike. I’ve set it up so that only I can create new threads, and anyone can follow up my threads. And, of course, every Joe in the universe may want to create his own alt.focker.joe newsgroup so he can control his own corner of the web too.

From the User’s perspective
From the user’s perspective, usenet and blogs are pretty similar. Blogs is a little more ‘open’ than usenet was, in that anyone can create a new group or start cross-linking to another group. Usenet was more closed in that there was a process for creating new groups. Weblogs are also a bit more “free” in that users can post whatever they want to. Usenet newbies in the 90’s were often scolded for putting that awful-HTML stuff into usenet posts. There are no references to “Netiquette” when you start to write your own blog.

But these are a little different too. Usenet had the problem of many-to-many authorship. This meant that each group could have many authors. Sure, you can do this in blogs, but it isn’t really common practice. Blogs take the approach of each user being more of a broadcaster or publisher. I create the topics, you read them.

Technology
Usenet was a distributed system. Each server decided which newsgroups to pick up and disseminate. Each server could be an origin server for a post, so each server had the capability to create a unique ID for a post (e.g. the Message-id header). Each server then pushed its content to its peer servers at some interval as setup by the system administrator. Thus, each posting made it to many servers around the globe – distribution.

A Blog is not a distributed system by itself. Its just a posting of my content, whatever I want, in whatever format I want it to be in. I can link to other blogs, and other blogs can link to mine, but its still not distributed. There is only one copy of my content anywhere. However, the interesting part is that aggregation and search technologies are starting to emerge which create the ‘distributed’ part of blogs. Imagine a world where everyone has their own blog crawler roaming around the net finding interesting blogs. In essence, you’ve created a polling-based distributed system.

Its interesting. If I were setting out to design a distributed system, a polling based system is not one I would create. But is that where this stuff is headed? The web already works this way, more or less. You can post HTML content on the web, and then how is it found? Well, you get linked into a well known place like dmoz and wait, or you manually submit your site to a bunch of robots. Then these robots come and crawl your site. In my own little world, I find that robots hit my site more than people do. All so that if someone might *want* to find my site, they will find it when they search for it.

But search engines still don’t really create a distributed system. Sure, google may cache my content and allow searchers to read my pages without ever visiting my site. But, thats not really distributed, thats a single replica. Should these crawlers running about be collecting copies of our blogs and regurgitating them in new formats? On one hand, it makes the overall system more robust. It creates copies for everyone to have. But on the other hand, did I just lose control over my content?

So, in this regard, the two systems are different.

If this is so much like usenet, where is the porn?
This is the real question of course. But the answer lies in the fact that blogs are more of a publishing system than a conversation/messaging system. With usenet, the servers were hard to administer and maintain, so only schools and companies afforded to run their own servers. As such, we schemers and scammers out there discovered we had a virtually “free” pipe to share our much needed porn. With weblogs, if I create a porn-of-the-day blog, its my own bandwidth that gets usurped by the hoardes of one-handers out there that are looking for that stuff. And there is one other reason too. Blogs aren’t yet very discoverable. It was all to easy to discover alt.binaries.images.XXX on a news server. The mechanisms for finding a good blog about porn are still very limited.

Blog strengths and weaknesses
Some strengths:

  • Freedom of formatting/creative choice
  • Ease of topic-creation
  • Simple for neophytes
  • Integrates well with web technologies
  • I can control my content more tightly
  • For individual blog reading, no need for a complex server.

Some weaknesses:

  • No central repository for lookups or searches
  • Users broadcast rather than converse on topics
  • Client applications are weak – I just want to know what topics are new since yesterday. There is no inherent way to do this today.

I ate at Stoddard’s today.