I’ve been working with Google Appengine a little lately and thinking about how I might go about randomly selecting records (entities) from a larger logical group of related records (a table in RDBMS terminology, or a set of entities of the same kind in datastore terminology). The datastore is not really structured to easily or efficiently enable this out-of-the-box. To be fair, neither are RDBMS systems. Yet, there are a range of reasons why you might want to get records at random from a stored set of data. For instance, you might want to take a representative sample out to:
- perform some statistical modelling;
- allocate records to some testing groups (e.g., for split testing); or
- process changes to the set in chunks that fit within some processing or quota limit.
One of the mantras of Appengine data modelling is to ‘stop worrying about disk space and denormalise‘ (yes, there are other reasons to worry about denormalisation, but you are also forced to get over those if you are developing on BigTable).
So, rather than attempt to deal with random selection down the line when I actually need the random records, my approach is to support this functionality up-front in the design of my data models. How? By allocating a set of random numbers to every entity created. Specifically, I’m setting up the following properties on all models (entity kinds) I might conceivably want to sample from in future:
randomnum = db.FloatProperty()
randomnum1000 = db.IntegerProperty() # entities will be randomly allocated to 1 of 1000 bins in this set
randomnum10000 = db.IntegerProperty() # entities will be randomly allocated to 1 of 10000 bins in this set
…And then allocating the random number and associated bins when the entity is first saved to the datastore. Note, you’ll need to import floor from the standard math module, and import the random module.
self.randomnum = random.random()
self.randomnum1000 = int(floor(self.randomnum*1000))
self.randomnum10000 = int(floor(self.randomnum*10000))
This will provide for simple 1/1000 or 1/10000 random selection from entities of the same kind in my datastore; for 1/1000, pick a random number between 0 and 999 and select all records that have that number in the random1000 property. It should also scale fine, and will be tolerant to the deletion of entities since deletes will be at random with respect to the random groups. This means each random bin will stay roughly the same size relative to the other bins in each set over time. I’ve kept the full random number in case I ever want to create more random bin sets, but other sample sizes could also be accommodated within the bin sets I have. For instance, if I want to select one entity at random I could first select a bin from the 1/10000 bin set and, once I have those back from the datastore, randomly select an entity from the returned bin.
Of course, this technique won’t generate perfectly random selections because the random number generator is only pseudo-random and the bin that an entity is initially allocated to affects its chances of being individually selected from the whole. Nevertheless, it will be close enough for what I can imagine I might want to do with the data.
If anyone reading has an alternative solution to random selection from the datastore I’d be really interested to hear it.
ShortURL for this post: http://wp.me/pnqr9-7m