Garbage In/Garbage Out

Lots of programmers have moved from using languages that primarily don’t do Garbage Collection to languages that do primarily do Garbage Collection. In fact, I’m probably a late comer to using it seriously. Sure, I’ve used some amounts of java on the side a bit over the past few years – enough to be dangerous, at least. But I haven’t used it enough to really care how the GC was working or even notice bugs where the GC was masking things for me.

In the good old C++ days, every major programming effort that I’ve been involved with had lots of memory allocator debugging techniques employed. We’d use macros for malloc/free, override new/delete, use purify, zeroify memory when its deallocated, create safety zones on each side of the buffers, etc. After you’d done it for a while, these techniques served you pretty well, and with very little effort, you could debug all your memory usage patterns.

Now, fast forward to the land of Garbage Collection. With the language naturally figuring out what you intended to free and not free, you shouldn’t need any of these tools, right? Well, sort of. So far, in my short experience with GC’d languages, it seems pretty common that you need to reference *something* that isn’t written in the GC’d language. For example, java calling out to C++. In this case, you are passing objects back and forth. Sometimes pointers, sometimes not. But either way, you’ve got references to objects that are not going to be GC’d held by objects that are GC’d. Unless you have a perfectly neat little program that can be 100% java, you may run into this. And, debugging it is a pain!

Why is it hard to debug? Well, in C & C++, you can employ all sorts of tricks to allocate/deallocate memory differently. But in the GC’d world, once you drop your references to the object, its going to get cleaned up eventually. And – you don’t know when! When does the GC run? When does it not run? Not much you can do.

Finally, I found one trick which helped a bit. That was to create a simple thread that sits in the background (development mode only) and initiates the GC collection process every second or so. This way, if I’ve got some dangling reference somewhere, the GC will collect the object, and I’ll notice the bug a *lot* sooner than I would have otherwise.

Anyway, this probably isn’t interesting to most folks, but I found it an interesting problem. I like the benefits of not worrying about GCs. But my stodgy old C++ side really likes understanding exactly when my objects are coming and going. Maybe I’m a control freak.

Lucene

I’ve recently been working with Lucene, an open source full text search engine. I haven’t been looking for search engine technology for a while, but all of a sudden I keep hearing about it from all different contexts.

Its a really nice index. Its amazingly simple to use, it appears to be blazingly fast – reasonable for writes and reads of the index.

My Assistant

OK. Well, this is really silly, but it was kind of fun. I ran across the Microsoft Agent Wizard today. And so, I had to create my own wizard. You can find him here; click on the link which says “Have my 24×7 assistant guide you through this page)”. Its cute. You’ll need to be running IE for it to work. It may take a minute to load. But, hopefully, it is worth the wait!

Building blocks for RDF

If you were going to create the RDF classifieds, there are some RDF building blocks you’d like to have.

Each for-sale item will have:
– A price.
This is a semi complex item. What is the actual price? What currency is it in?
– Shipping terms
Paid for by seller? How much? Paid for by buyer?
– Category
Presumably, robots will want to know how to categorize this. There is an RDF Taxonomy module which leverages the DMOZ categorization scheme. I hope dmoz promises to never change the taxonomy? 🙂
– Contact info for seller
Contact him by phone? By email? FOAF is probably the answer for this one.
– Location
Where is the item?

And, these aren’t new ideas.

I wish there were standard building blocks for things like prices and such though.

RSS Classifieds

I’ve been discussing the “RSS classifieds” idea (for lack of a better name) more with a few coleagues. I think the idea has merit.

The basic concept:

A rich RDF format is developed for specifying “I have something to sell”.
People that are selling can create this once, and register with search engines. Search engines pick look at the item, determine if its worth “accepting” into their system, and then post it for sale.

Why do I want this? Can’t I do it on EBay? No, you can’t. EBay, for all its greatness, is a closed system. Fortunately, for most sellers, EBay is currently the largest online marketplace, so its a safe bet. But, if you want to advertise your forsale item elsewhere, you have to do that manually – reposting your data into each system separately. You probably have to register on each system separately, etc etc. Its a real pain in the neck. And, in exchange for you locking in exclusively to EBay, EBay also takes a small fee from you!

OK. So, moving on. What we need to have such a system:

1. A format to specify for-sale items in RDF and RSS
2. Search engines that recognize the format. An ability to make sure that search engines “stay fresh” with the current status of the item

Well, thats it to get to phase I.

But, there is more. One really handy thing about EBay is its rating service for users. With RSS, I think this is reproducible in a distributed way. Lets say Bob and Charlie are about to engage in a transaction, where Bob is selling to Charlie. After the transaction, Bob puts a review of Charlie into his feed, which says “Good”. Charlie puts a review into his feed which says “Bob” was “Bad – late with payment”. As Bob and Cherlie enter into many transactions over time, each will be reviewed by others many times. These reviews can be found by search engines, and an overall composite score can be generated for each user – in effect replicating what EBay has done. Of course, as will everything on the web, we’ll have to take some time building anti-Spam features. We don’t want Bob to be able to boost his ratings by just creating lots of fake reviews of himself. That is probably solvable with a few heuristics, much like what search engines use today. The harder case is the case where Bob wants to maliciously accuse Charlie of being “Bad”. These may be solvable by using anti-spam techniques and also by allowing Charlie to post his own “Review-rebuttal” within his own feed. Lastly, this mechnism has problems with individual changing their reviews over time. Sure, Bob initially gave Charlie a good review. But after Charlie gives Bob a bad review, Bob goes and changes his review of Charlie to say “Bad”. It may be that a webservice is in order here for verifying the authenticity of reviews.

One other interesting point is anonymous email. I don’t think blogs will be using open email addresses forever. Once the spam bots figure out how to parse these little gems, we’ll all be spammed in our mailboxes. Plus, we really don’t want the general public to browse and see that Bill Lee is selling his collection of fancy dolls. So each forsale item that gets posted in RDF/RSS format may want to include contact info which goes through a one-way email anonymizer service. Many (most?) of the classifieds services online today already provide this. Craigslist is a great example.

From what I’ve seen, nobody has really done this so far. (Let me know if you’ve seen otherwise) I’m not sure why. Maybe there is no money in it. Also, I think the level of complexity in RDF for this type of thing is substantially more complex than anything in any of the RSS specifications today. Today’s RSS is about as simple as it gets.