A recent discussion on the GLASS mailing list presented an interesting problem. Say you have a collection of domain model objects that each have a start date and an stop date, and you want to find all the objects that are within a date range. I was first exposed to this problem 25 years ago in a healthcare system tracking patient admit and discharge dates, but it applies to many common situations (from hotels to video or car rentals).
For this discussion we will consider the following example: from the collection of [A-H] we want to get the subset from begin to end, or [D-F].
Of course, one could iterate over the entire collection and test each object, but this will be inefficient for large collections. At the other extreme would be to have an end-of-day job that create a collection for that date and add items to the day’s collection. This would give very quick lookup, but would have a cost in storage.
A typical attempt to solve this problem would be to create an index on the start date and the stop date and then build a result set of items that start before the end point and stop after the begin point. The problem with this approach is that it likely creates two intermediate results, [A-F] and [D-H], that together are larger than the original collection (before returning the intersection of those intermediate results). In GemStone, this can generally be done in such a way that the intermediate results hold the object IDs (or OOPs), but the objects are not actually read from disk. This is typically a good solution, but others are possible.
Decades ago my good friend, Carl Zimmerman, came up with another approach that uses one index and avoids the large intermediate results. The following is adapted from his design to work with GemStone’s indexing system.
Each object has an ‘indexableDateRange’ instance variable (with an equality index defined) that encodes the duration (in days) and the start point (a Date after 1900) as a SmallInteger. The value is set as follows:
indexableDateRange := start isNil ifTrue: [0] ifFalse: [stop isNil ifTrue: [start asDays] ifFalse: [(stop subtractDate: start) + 1 * 100000 + start asDays]].
Note a few things about this expression. If the value is zero, then the activity has not started. If the value is less than 100000, then the activity is ongoing. If the value is greater than 100000, then the activity has finished. Note that this design handles dates from 1901 to 2173. Handling dates outside that range or other units, such as minutes, is left as an exercise for the reader (or an opportunity for consulting!).
Let’s start by creating some test data. The following creates 1000 objects that have a start date from 2000 through most of 2009 and durations of up to 10 days (or still in progress).
| set random origin | set := UserGlobals at: #'James' put: IdentitySet new. set createEqualityIndexOn: #'key' withLastElementClass: SmallInteger. System commitTransaction. random := Random new. origin := Date newDay: 1 monthNumber: 1 year: 2000. 1000 timesRepeat: [ | start stop key | start := origin addDays: (random integerBetween: 1 and: 3650). stop := start addDays: (random integerBetween: 0 and: 10). stop = start ifTrue: [stop := nil]. key := start isNil ifTrue: [0] ifFalse: [stop isNil ifTrue: [start asDays] ifFalse: [(stop subtractDate: start) + 1 * 100000 + start asDays]]. set add: key -> (Array with: start with: stop). ]. System commitTransaction.
The following demonstrates a “brute-force” search that tests every object:
| start stop set resultSet | set := UserGlobals at: #'James'. start := (Date newDay: 1 monthNumber: 1 year: 2005) asDays. stop := start + 6. resultSet := set select: [:each | each value first asDays <= stop and: [each value last isNil or: [start <= each value last asDays]]. ]. resultSet
The following code demonstrates the Zimmerman search algorithm:
| start stop set resultSet duration | set := UserGlobals at: #'James'. start := (Date newDay: 1 monthNumber: 1 year: 2005) asDays. stop := start + 6. resultSet := IdentitySet new. resultSet addAll: (set select: {:each | each.key <= stop}). "Items that are in-progress" duration := 100000. "start at one-day duration" [true] whileTrue: [ | low high stream | low := duration + start - (duration // 100000). high := duration + stop. resultSet addAll: (set select: {:each | (each.key >= low) & (each.key <= high)}). duration := duration + 100000. stream := set selectAsStream: {:each | each.key >= duration}. stream atEnd ifTrue: [^resultSet]. duration := stream next key // 100000 * 100000. ]. resultSet
This approach starts with all the events that started before the stop point and are still in progress. Then it iterates over the completed items starting with those that have a one-day duration. A range of keys is computed based on the start, stop, and current duration, and the equality index is used to get the matching items for that duration. For domains in which there are a limited number of durations (such as video rentals), there are relatively few intermediate result sets and none have any unneeded objects. Given GemStone’s efficient handling of IdentitySets and intersection operations, I suspect that this is not worth the trouble, but it is an interesting approach. This approach does fault in one extra object for each duration after 1, but since the needed information is actually in the indexing data structures, this could be avoided.
Anyway, it is something to think about (and measure).
1 comment
Comments feed for this article
January 12, 2012 at 12:18 am
Paul DeBruicker
Hi James,
There’s also the Centered Interval Tree. (https://en.wikipedia.org/wiki/Interval_tree). I put one up on squeaksource.com here: http://squeaksource.com/CenteredIntervalTree. Here’s a mailing list message about it: http://forum.world.st/red-black-and-other-trees-tp3899519p3902647.html