Google Safe Browsing Implementation Requirements

Jun 22, 2007 at 7:07 AM
Edited Jun 22, 2007 at 7:08 AM
Copy/Pasted from private email discussion with Thommi and Phil:

Here is a list of requirements that I could create from the specs:
  • A local data format to store list data. We can use a simple text format very similar to what Google returns from API and also can use an XML format which helps us to work with local data easier. Though I prefer XML format because helps us to organize list items based on versions and add/remove them easier. Any thoughts?
  • A local client factory class to get data from services. There is already a client in Subkismet but it supports POST requests. We can extend it for GET requests.
  • A good local synchronizer class which synchronizes local data with latest list updates on server. It should also be smart enough to manage the intervals for update requests as described by Google specs.
  • A class which performs lookups based on Google recommendations and string manipulations.
  • A class to perform RFC2396 canonicalization. I and Thommi had some discussions about this and he sent me some useful links. I still couldn’t find a good built-in solution for this in .NET framework (because some comments from MSFTs on Uri methods spec on MSDN say that these methods don’t work fine for all characters). It also seems that there isn’t any third party component for this purpose. Probably we have to go ahead and investigate a new working component for this purpose with an open license in a separate project and use it in Subkismet. Any hints on this?
Jun 22, 2007 at 7:30 AM
Regarding the local data format, let’s consider a provider model for that, since some users won’t have the ability to write to a file and will want to write to a table (even if it’s just a table with only one column and one row that gets overwritten). If the choices is between a simple text format and XML, I choose whichever is easier to program against and will be more robust for us.

RFC2396 canonicalization sounds like a pain. I thought this was something that will be solved in .NET 3.0, but of course we can’t rely on that. We live in the here and now, not in the future. ;) Can we do a simple “good enough” implementation based off the Uri class? Otherwise we can blog about it and see if someone has one up their sleeve already.
Jun 22, 2007 at 7:50 AM
A provider model makes sense, since Subkismet is still a framework. Integrating this library into a real application would require to implement a decent persistence service, but we could provide at least a plain vanilla provider leveraging an XML File provider or SQL db.

The link [1] I sent to Keyvan points to an article on the Mozilla dev site. Among other things it describes how they canonicalize URLs. Dunno if we need a full-featured RFC2396-compliant component, or just fiddle around with simple string manipulation plus some regex as described in that article.