Bringing Classifiers Alive in PySAL

I’ve talked a lot to fellow developers about making PySAL objects more than containers for the results of a statistical procedure.

One way I think we can do this is to focus on methods like predict, find, update, or reclassify.

So, here, I’ll show the way I’ve implemented a simple API to update map classifiers by defining their __call__ method.

In [2]:
import pysal as ps

The patch I applied to mapclassify should be in this github branch. To get it, you’ll need to git fetch my repository and check out the reclassify branch. Alternatively, what I added to Map_Classifier is so small, it’s easy to show:

First, I added a call method:

def __call__(self, *args, **kwargs):
"""
    This will allow the classifier to be called like a
    function *after* instantiation
    """
if inplace:
self._update(new_data, **kwargs)
else:
new = copy.deepcopy(self)
new._update(new_data, **kwargs)
return new

This will allow us to do something like:

classifier = pysal.Quantiles(data)
classifier(k=4)
classifier(k=9)
classifier(new_data, inplace=True)

and proceed to interact with the classifier object over and over again. Since there’s an inplace toggle (False by default), users can choose when to mutate or when to copy.

In theory, the __call__ method can support all of the different __init__ declarations possible. I’ve defined it this way because most of the mapclassify methods I can think of use a mandatory data argument and optional keyword arguments. The only one that varies from this is User_Defined, which I overwrote to handle correctly.

The main point here is that this enables users to quickly reclassify and view new classifications using the object they created! Thus, a common use case might be something like this:

In [4]:
df = ps.pdio.read_files(ps.examples.get_path('south.dbf'))
In [5]:
df.head()
Out[5]:
FIPSNONAMESTATE_NAMESTATE_FIPSCNTY_FIPSFIPSSTFIPSCOFIPSSOUTHHR60BLK90GI59GI69GI79GI89FH60FH70FH80FH90geometry
054029HancockWest Virginia5402954029542911.6828642.5572620.2236450.2953770.3322510.3639349.9812977.89.78579712.604552<pysal.cg.shapes.Polygon object at 0x7fc5495eb…
154009BrookeWest Virginia540095400954914.6072330.7483700.2204070.3184530.3141650.35056910.9293378.010.21499011.242293<pysal.cg.shapes.Polygon object at 0x7fc5495eb…
254069OhioWest Virginia5406954069546910.9741323.3103340.2723980.3584540.3769630.39053415.62164312.914.71668117.574021<pysal.cg.shapes.Polygon object at 0x7fc5495eb…
354051MarshallWest Virginia5405154051545110.8762480.5460970.2276470.3195800.3209530.37734611.9628348.88.80325313.564159<pysal.cg.shapes.Polygon object at 0x7fc549565…
410003New CastleDelaware100031000310314.22838516.4802940.2561060.3296780.3658300.33270312.03571410.715.16948016.380903<pysal.cg.shapes.Polygon object at 0x7fc549565…

5 rows × 70 columns

In [7]:
data = df['HR60'].values
In [8]:
classifier = ps.Quantiles(data)
In [9]:
classifier
Out[9]:
                Quantiles

Lower Upper Count

x[i] <= 2.497 283 2.497 < x[i] <= 5.104 282 5.104 < x[i] <= 7.621 282 7.621 < x[i] <= 10.981 282 10.981 < x[i] <= 92.937 283

Once estimated, the user can reclassify based on the same API as the constructor:

In [10]:
classifier(k=3)
Out[10]:
                Quantiles

Lower Upper Count

x[i] <= 4.265 471 4.265 < x[i] <= 8.679 470 8.679 < x[i] <= 92.937 471

In [11]:
classifier(k=9)
Out[11]:
                Quantiles

Lower Upper Count

x[i] <= 0.000 180 0.000 < x[i] <= 2.836 134 2.836 < x[i] <= 4.265 157 4.265 < x[i] <= 5.628 157 5.628 < x[i] <= 7.137 156 7.137 < x[i] <= 8.679 157 8.679 < x[i] <= 10.600 157 10.600 < x[i] <= 13.924 157 13.924 < x[i] <= 92.937 157

It doesn’t mutate the object unless inplace is provided and is true:

In [13]:
classifier
Out[13]:
                Quantiles

Lower Upper Count

x[i] <= 2.497 283 2.497 < x[i] <= 5.104 282 5.104 < x[i] <= 7.621 282 7.621 < x[i] <= 10.981 282 10.981 < x[i] <= 92.937 283

In [14]:
classifier(k=6, inplace=True)
In [15]:
classifier
Out[15]:
                Quantiles

Lower Upper Count

x[i] <= 1.993 236 1.993 < x[i] <= 4.265 235 4.265 < x[i] <= 6.245 235 6.245 < x[i] <= 8.679 235 8.679 < x[i] <= 11.850 235 11.850 < x[i] <= 92.937 236

This also enables users to add new data to the classifier.

Now, I bet there are better updating equations for the different classifiers than reestimating the entire classifier, like there are for running median problems. I anticipated extending this work with more sophisticated updaters than just reclassifying the entire set. This is why I split the __call__ method from what really does the updating:

def _update(self, data, *args, **kwargs):
if data is not None:
data = np.append(data.flatten(), y)
else:
data = self.y
self.__init__(data, *args, **kwargs) #this is the most naive updater

As the comment denotes, this is the most universally-acceptible updater, hence it’s definition in the Map_Classify baseclass. Fortunately, this means that any new classifier defined as a subclass of this gets a very naive in-place reclassification method for free.

Thus, you can do stuff like:

In [17]:
new_data = df['HR90'].values
In [19]:
classifier(new_data)
Out[19]:
                Quantiles

Lower Upper Count

x[i] <= 3.228 565 3.228 < x[i] <= 5.912 565 5.912 < x[i] <= 8.710 564 8.710 < x[i] <= 12.735 565 12.735 < x[i] <= 92.937 565

In [20]:
classifier(new_data, k=14)
Out[20]:
                Quantiles

Lower Upper Count

x[i] <= 0.000 296 0.000 < x[i] <= 2.200 108 2.200 < x[i] <= 3.469 201 3.469 < x[i] <= 4.483 202 4.483 < x[i] <= 5.394 202 5.394 < x[i] <= 6.282 201 6.282 < x[i] <= 7.297 202 7.297 < x[i] <= 8.266 202 8.266 < x[i] <= 9.348 201 9.348 < x[i] <= 10.628 202 10.628 < x[i] <= 12.217 202 12.217 < x[i] <= 14.603 201 14.603 < x[i] <= 18.544 202 18.544 < x[i] <= 92.937 202

In [21]:
classifier(new_data, k=6, inplace=True)
In [22]:
classifier
Out[22]:
                Quantiles

Lower Upper Count

x[i] <= 2.691 471 2.691 < x[i] <= 5.069 471 5.069 < x[i] <= 7.297 470 7.297 < x[i] <= 9.736 471 9.736 < x[i] <= 13.736 470 13.736 < x[i] <= 92.937 471

So, this is what I mean by “responsive” classes. They should:

  1. support updating/reuse w/ new data
  2. support augmentation of initial/init-time options/parameters
  3. provide __call__ methods that consistently either update or use.

In map classification, I think __call__ would be better suited to find_bin than update_bins. In spatial regression, I think __call__ would be better suited to predict than something else.

__call__ should never alias summary() methods, which probably belong in __repr__, anyway.

Originally posted on yetanothergeographer.tumblr.com.

Last modified 2016.03.24