I’ve had to take a break from the spatial hierarchical linear modeling kick I’ve been on recently to get back to some GSOC work.
Today, I had my weekly call with my mentors.
On today’s call, my mentors and I discussed a few things.
Testing & Merging of project code
Since many of the improvements I’m making to the library are module-by-module, I was advised to submit PRs when a logical unit of contribution is ready to ship. As I’ve already been trying to use Test-Driven Development principles for my project, writing a test of what I want the API changes to look like then writing to that specification, this is relatively simple: the module is ready to submit when the spec tests pass. So, now that much of the initial foray into the labelled array API is done, I can beging to connect the tests & submit PRs where possible.
Appropriate Targets for Labelled Array interfaces
We also discussed how to extend the labelled array interface to other
submodules. For example, I have a good idea how a consistent labelled array
interface could look for Map Classifiers in the exploratory spatial data
analysis module. Other elements of that module should also be relatively
straightfoward to implement on labelled arrays, since all that’s generally
needed is input interception: dataframe+column name needs to get correctly
parsed into numpy vectors. This is quite simple. The spatial regression module
also seems like a relatively straightforward place to add a labelled array
interface, using a similar strategy to what I’ve already been doing in
pysal.weights. Defining a from_formula classmethod for spatial regression
modules would allow for specification of regressions using patsy & pandas
dataframes.
But, in other parts of the library, like region or spatial_dynamics, it’s
less clear as to what the labelled array interface should look like, so
I’ll have to gain some perspective there.
Remaining Confusion in weights construction
I’m running into some minor confusion because I’m trying to make a call like
from_dataframe(df, idVariable='FIPS') equivalent to from_shapefile(path,
idVariable='FIPS'), and can’t figure out when PySAL considers things ordered
vs. unordered.
For background, a spatial weights object in PySAL encodes the spatial information in a geographic dataset, allowing estimation routines for various spatial statistics or spatial models. In doing this, it relates each observation to every other observation, using information about the spatial relationship between observations. In our library, these are used all over the place.
But, in building a new, abstract interface to the weights constructors, I got
quite confused. Particularly, I was expecting to be able to write a pair of
classmethods, say, Rook.from_shapefile() and Rook.from_dataframe(), that
have similar signatures and generate similar results. Something like
from_dataframe(df, idVariable='FIPS') being equivalent to
from_shapefile(path, idVariable='FIPS'). Unfortunately, it’s somewhat
confusing to figure out how to make this work correctly, without making the API
incoonsistent. This is because PySAL handles ids in weights objects across its
various weights construction functions and classes in different ways. I think,
overall, we expose four different variables or flags at different points in the
API that deal with how observations are indexed in a spatial weights object:
ids- ostensibly, a list of the ids to use corresponding to the input data, considered in almost every weighting function.idVariable- a column name to getidsfrom when constructing weights from file used in existingfrom_shapefilefunctions to generateids.id_order- a list of indices used to re-index the names contained inidsin an arbitrary order, impossible to set fromfrom_shapefilefunctions but used in the weights class’s__init__id_order_set- a boolean property of the weights object denoting whetherid_orderhas been explicitly set.
To me, this is rather confusing, despite some conversation trying to flesh this out.
First, all lists in python are ordered. So, when a user passes a list of ids in
as ids, its confusing that the order of this list is silently ignored. Second,
when we construct weights from shapefiles using an idVariable, the resulting
weights object has some peculiar properties: the id_order is set to the file
read order, but the id_order_set flag is always False. This is confusing for
a few reasons. First, shapefiles & dbf files are implicitly ordered, so a column
in the dbf should correspond exactly to the order in which shapes are read,
barring data corruption. So, if I use a column of the dataframe to index the
shapefile, this should be considered ordered. Second, our docstring below seems
to imply that either id_order_set is False and id_order defaults to
lexicographic ordering, or id_order_set is Trueand id_order has special
structure:
id_order : list
An ordered list of ids, defines the order of
observations when iterating over W if not set,
lexicographical ordering is used to iterate and the
id_order_set property will return False. This can be
set after creation by setting the 'id_order' property.
But, one can easily generate an example where id_order is not lex ordered and id_order_set is False:
import pysal as ps
Qref = ps.queen_from_shapefile(ps.examples.get_path('south.shp'), idVariable='FIPS')
Qref.id_order_set
False
Qref.id_order
[u'54029',
u'54009',
u'54069',
u'54051',
u'10003',
...]
Qref.id_order_set
False
This is important because, when we construct weights from Dataframes, we need to
make a decision about what gets picked as an index and how to treat that index.
Right now, I’ve made the executive decision to choose consistency in beahvior,
so that from_dataframe(df, idVariable='POLYGON_ID') will consider that column
to be ordered from the start. This means that the resulting weights will have
the same iteration order as weightsl from_shapefile(filepath,
idVariable='POLYGON_ID'), but the dataframe call will set the id_order_set
flag, while the shapefile classmethod does not.
Originally posted on yetanothergeographer.tumblr.com.