Clarifying a Core Data Model for PySAL
From my perspective, I see my proposal for GSOC as containing two oblique foci. My hope in drafting this notebook for my fellow authors is to make this clear.
I think at this point, many people who use & develop PySAL would like to see better tooling for Pandas$\leftrightarrow$PySAL interaction. This was started by Luc, Dani, & myself in the pysal.pdio contrib module. At its most basic, pdio reads/writes dbfs using PySAL IO. In addition, all shapes in a shapefile are read in and stuffed into a column of the dataframe. Unlike Geopandas, this occurs on a standard Pandas dataframe, rather than constructing a special subclass.
In addition, interfaces between shapely and PySAL exist via the shapely_ext contrib module. The shapely extension provides an interface between pysal and shapely using geointerface at the individual polygon level.
My GSOC proposal’s main point was that I’d like to see PySAL gain a labelled array interface, and that a specification for this interface should make it easy to do GIS, regardless of source. How we want this done remains an open question. I think there are really two aims in my proposal that I’d like to come to a consensus on about proirities:
The directions are what I’d like to call:
- NOGR: A pure python/cython GIS
- Pandas as Idiom: A Tabular data idiom for PySAL
The NOGR concern perennially hits a wicket for PySALers. I’m sure those reading may have engaged in this no-win discussion before. The cycle is something like this:
- PySAL targets minimal dependencies from the scientific python stack.
- Spatial Analysis sometimes has tabular operations & GIS as a prerequisite.
- GeoPandas+Shapely/Pandas are the standard for GIS/tabular in Python.
The NOGR project is really providing a method of solving 2 within the constraints of 1.
The Pandas as Idiom direction leverages 3 within the constraints of 1 to provide a consistent labelled array interface.
These largely oblique to each other:
Pure Python GIS would not, by itself, define a tabular data model for PySAL
A tabular data model would make it much easier to interface with current GIS tools in python, but still means that users who need GIS would need GDAL.
Thus, I’d like the mentors’ & broader PySAL community’s perspectives on how whether & how urgently a NOGR GIS package is needed before commencing my GSOC work. This feedback will inform and direct what parts of the proposal get emphasized.
Just to be clear, though, I’d like to show exactly how useful the shapely extension is. I don’t really see it mentioned a lot, so I’d like to show what’s already possible using just it and the pdio-style PySAL + Pandas Dataframes:
import pysal as ps
from pysal.contrib import shapely_ext as psh
import shapely.geometry as sh
import shapely.ops as GIS
import numpy as np
data = ps.pdio.read_files(ps.examples.get_path('columbus.shp'))
Basic GIS:¶
We have wrappers for almost all of the GIS operations in Shapely. They all take/recieve PySAL shapes, but use shapely for the actual operations:
(set([x for x in dir(GIS) if not x.startswith('_')]).difference(dir(psh)))
We can also use most shapely methods as standalone functions:
null = geom.Polygon()
[x for x in dir(null) if (not x.startswith('_') and x in dir(psh))]
#merge
full_CO = psh.cascaded_union(data.geometry.values.tolist())
Since Shapely Polygons have a visual __repr__, I’ll be casting to shapely to quickly show the shapes. But, they’re still python shapes:
full_CO
sh.asShape(full_CO)
print(psh.contains(full_CO, geom.Point(0,0)))
print(psh.contains(full_CO, data.geometry[0]))
west_CO = (data[data.geometry.apply(lambda x: x.centroid[0] < 9)])
diff = psh.difference(full_CO, psh.unary_union(west_CO.geometry.tolist()))
sh.asShape(diff)
psh.buffer
buffer_query = psh.buffer(data.geometry[1], .6)
data[data.geometry.apply(lambda x: psh.intersects(buffer_query, x))]
Note that none of this raw shapely/pandas stuff uses any spatial indexing, which is something Geopandas uses.
US = ps.pdio.read_files(ps.examples.get_path('NAT.shp'))
states = US.groupby('STATE_NAME').geometry.apply(psh.unary_union)
states.head()
states.apply(lambda x: GIS.validate(sh.asShape(x)))
Weights¶
from functools import partial
This is a very naive way to construct spatial weights, since it doesn’t use any prefiltering or spatial query and doesn’t recognize symmetry in touches. So, don’t use this on something big, like NAT.shp, because it will take forever.
But, for the purposes of illustration, Queen Weights can be constructed using chained method calls on tabular data. This pattern is how most of the Queen Weights functions in PostGIS I’ve seen work, using array queries.
Since, in PostGIS, the ST_Touches naturally uses a GIST, it does a similar kind of binning reduction we use in PySAL, and could use if we wrapped arbitrary geometry columns in a pandas dataframe. The following does not, but still yields Queen weights:
Wt = (data
.geometry #for all geometries
.apply(lambda x:
data
.geometry #go back through geometry
.apply(partial(psh.touches, x))) # check if any touch current x, partial to ensure closure
.values.astype(int)) #cast bools to ints
Wps = ps.queen_from_shapefile(ps.examples.get_path('columbus.shp')).full()[0]
np.testing.assert_allclose(Wt, Wps)
So, in my vision, a NOGR focus would be to make the above code run without the lines:
import shapely.geom as geom
import shapely.ops as GIS
from pysal.contrib import shapely_ext as psh
In short, this means implementing a GIS module using only python/cython that can do what shapely.ops does on shapely.geometry objects. I would probably target replicating most of the functions provided in sections 8.5, 8.6, and 8.9 of the PostGIS documentation, ensuring they work when supplied with a dataframe with a column of known spatial type (i.e. containing shapely polygons (geopandas), pysal shapes (pdio), or object.__geointerface__ for all objects).
Ultimately, returning to the cycle of discussion mentioned above, this would reflect PySAL prioritizing point 2 to achieve point 1:
Need: (1) PySAL needs its own tabular/GIS Module with minimal dependencies
Reason: (2) Spatial Analysis often requires tabular/GIS operations
This is where the proposal that suggests novel python/cython implementations of required algorithms from De Berg’s Computational Geometry or Xiao’s GIS Algorithms be the focus of the project. But, implementing a fast NOGR GIS library would be the core here. If there were time at the end, the datamodel could be extended to use the GIS operations provided here.
The Pandas as Idiom direction is centrally about PySAL’s interface with Pandas dataframes with geometric columns (or Geopandas Geodataframes) as if they were another native datamodel for PySAL. That is, if I prioritize in this direction, I would be attempting to make the boundary between PySAL, Pandas, & GeoPandas/Shapely a little smoother, without introducing new hard dependencies. This focuses on leveraging against packages in 3 within the constraints of 1:
Need: (1) PySAL needs to abstract its datastructures from its data representations/IO framework
Reason: (3) It is unclear how to leverage existing Tabular & GIS tools alongside of PySAL
It may end up looking like an answer to Issue #474: Add contrib module to explore interface with geopandas. But, instead of an exploration it would be an instantiation, would support both the PySAL geometry-based pdio & the Shapely-geometry Geopandas.
Instead of focusing on pure python/cython implementations of core GIS operations, this would focus on making PySAL’s interfaces become more polymorphic. It would reduce the gap between PySAL & Geopandas/Shapely, while keeping a bright line between what uses GDAL and what does not. I imagine this happening through extensive use of optional import strategies.
In it, I see three components:
An adapter module in contrib,
geotable, that does a few things:- manages transitions between Geopandas dataframes and
pdiodataframes - minimizes the distance between GDAL-dependent libraries and PySAL saftely & clearly
- extends
shapely_extto perform common tabular operations
- manages transitions between Geopandas dataframes and
A contrib module,
pda. This would:- subsume current
pdiocode - extend the
pdiostrategy to use PySAL’s full IO suite - If datatype isn’t supported, loudly call into
geotable’s GeoPandas interface & fail safe if unavailable - contain the necessary bridge code for PySAL to use pandas/patsy for any purpose
- subsume current
In core:
- expose consistent labelled array interfaces across core modules
- enforce consistent unlabelled array rules across core modules (i.e. resolve shaping & array/list inconsistencies)
- finish & merge polymorphic weights constructors for arbitrary shape iterables
I think
- the tabular data model is more important
- GIS is best left to Shapely & GeoPandas
- If NOGR is necessary, it will be easier to use it as a drop-in replacement for the shapely hooks in
geotablethan it would to build it first.
Originally posted on yetanothergeographer.tumblr.com.