I looked into using spatial autocorrelation functions in my dissertation to characterize the ``scale’’ at which processes operate electorally. I did an analysis of presidential vote by county, trying to identify where, exactly, clusters of votes tend to become decorrelated. The typical diameter at which the so-called “spatial autocorrelation function” goes to zero denotes how wide a typical spatial cluster might be, and the partial spatial autocorrelation function gives an anticipated order at which spatial autocorrelation may hold.

This will be published along with my dissertation when it becomes unembargoed. I also gave a talk on this in the 2017 AAG. So, below is the initial exploration of what a spatial autocorrelation/partial autocorrelation function might look like.

import pysal as ps
import numpy as np
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

To talk about the spatial (partial) autocorrelation function, which is kind of like a mixture in concept between the geostatistical variogram and the (partial) autocorrelation function in time series analysis, let’s use presidential vote choice results at the county level for 2008, 2012, and 2016.

To do this, I’ll first grab the results from a github repo I’ve been tracking. Thanks to user @tonmcg for making this available in plaintext, so we can grab it using pandas without downloading it.

votes = pd.read_csv('https://raw.githubusercontent.com/tonmcg/'
            'County_Level_Election_Results_12-16/master/'
            'US_County_Level_Presidential_Results_08-16.csv')

Since the spatial autocorrelation function (and variogram) are related to the spatial positions of our data (or, in the least, a topological arrangement of our data), we need to merge these county-level results with the actual geometries of each county. To do this, I’ll use the example county dataset in PySAL, the Python spatial analysis library.

geoms = gpd.read_file(ps.examples.get_path('NAT.shp'))

votes.head()

	fips_code	county	total_2008	dem_2008	gop_2008	oth_2008	total_2012	dem_2012	gop_2012	oth_2012	total_2016	dem_2016	gop_2016	oth_2016
0	26041	Delta County	19064	9974	8763	327	18043	8330	9533	180	18467	6431	11112	924
1	48295	Lipscomb County	1256	155	1093	8	1168	119	1044	5	1322	135	1159	28
2	1127	Walker County	28652	7420	20722	510	28497	6551	21633	313	29243	4486	24208	549
3	48389	Reeves County	3077	1606	1445	26	2867	1649	1185	33	3184	1659	1417	108
4	56017	Hot Springs County	2546	619	1834	93	2495	523	1894	78	2535	400	1939	196

Then, to merge things up, I’ll create a common key based on the FIPS code of the county and merge the data

votes['FIPS'] = votes.fips_code.apply(lambda x: str(x).rjust(5,'0'))

votes = pd.merge(votes, geoms[['FIPS', 'STATE_NAME', 'geometry']], how='right', on='FIPS')
votes = gpd.GeoDataFrame(votes)

Finally, since I’m mostly interested in two-party vote shares, rather than raw votes, I’ll construct the two party vote share in each year as:

\[ tpv_{it} = \frac{d_{it}}{d_{it} + r_{it}} \]

where \(d\_{it}\) is raw vote cast in county \(i\) for the Democrat candidate in time \(t\), and \(r\_{it}\) is the comparable raw vote cast for the Republican candidate. We can just do simple series operations to get this done:

votes['tpv_2008'] = votes.dem_2008 / (votes.dem_2008 + votes.gop_2008)
votes['tpv_2012'] = votes.dem_2012 / (votes.dem_2012 + votes.gop_2012)
votes['tpv_2016'] = votes.dem_2016 / (votes.dem_2016 + votes.gop_2016)

These distributions tend to appear Gaussian, if not a slightly skewed Gaussian. But, we’re not really making any distributional analyses (like we were in my exploratory spatial regression notebook), so I’ll let this sit for now.

f,ax = plt.subplots(1,3, figsize=(2*3*1.6, 2))
for i,col in enumerate(['tpv_2008','tpv_2012','tpv_2016']):
    sns.kdeplot(votes[col].values, shade=True, color='slategrey', ax=ax[i])
    ax[i].set_title(col.split('_')[1])

/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

png

Next, since I’ll be mapping our data, I’ll use geopandas to reproject the raw data from PySAL (in Plate Caree projection) into a better projection for choropleth mapping, the Albers Equal Area Conic projection:

votes.crs = {'init':'epsg:4326'}
votes = votes.to_crs(epsg='5070')

Now, before we get any further, let’s make some maps of the two-party vote shares in 2008, 2012, and 2016 (alongside the vote distributions), and explore what spatial distribution dynamics might be going on:

f,ax = plt.subplots(3,2, figsize=(1.6*6 + 1,6*3), gridspec_kw=dict(width_ratios=(6,1)))
for i,col in enumerate(['tpv_2008','tpv_2012','tpv_2016']):
    votes.plot(col, linewidth=.05, cmap='RdBu', ax=ax[i,0])
    ax[i,0].set_title(col.split('_')[1] + ' Two Party Vote (% Dem)')
    ax[i,0].set_xticklabels('')
    ax[i,0].set_yticklabels('')
    sns.kdeplot(votes[col].values, ax=ax[i,1], vertical=True, shade=True, color='slategrey')
    ax[i,1].set_xticklabels('')
    ax[i,1].set_ylim(0,1)
f.tight_layout()
plt.show()

/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in greater
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kde.py:454: RuntimeWarning: invalid value encountered in less
  X = X[np.logical_and(X>clip[0], X<clip[1])] # won't work for two columns.
/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j

png

One thing that’s super clear when you do these maps of two-party vote is that more counties tend to vote Republican than Democrat. In the KDE plots, you see this as the mode of the vote share distribution is well below .5, even in 2012, when President Obama won reelection handily. While the best analysis might be to drill all the way down to the voter tabulation district level, that data attached to its geographies is pretty hard to find, and often to large for most to work with on a national scale. I’ve been working on putting it together in an sqlite dump, but that takes time :)

The second thing that’s clear is that the collapse of the “blue wall,” Minnesota, Wisconsin, Michigan, looks like it was actually a gradual process at the county level. Lots of marginally-blue counties flipped, resulting in a statewide flip. As a geographer, another thing that’s interesting about the electoral mosaic is almost how indistinguishably rural Illinois is from its surrounding areas in MO and KY. I think (if I were to finish my PhD and move into some electoral modeling), I would seriously look into markov random field models (say a hierarchical SAR/CAR model) of this process, since the state-based hierarchical models like will miss this type of proximity-based correlation entirely.

This is a reasonable first question. It’s well known that past two-party vote share tends to predict future two-party vote share quite well at an aggregate level and over all ranges of vote share. Of course, what really matters in the end are how well the wins in each state correlate over time, which is a different question. While we could address this with county-level vote, I’m using the county-level data to look at distribution dynamics, so I’ll let that slide for now.

First, we drop the counties where we’re missing data:

votes.dropna(subset=['tpv_2008', 'tpv_2012', 'tpv_2016'], inplace=True)

And, if we make a scatterplot of the past vote (on X axes) and the future vote (on the Y axes), we see that the correlation is very strong, both when comparing 2008 vs. 2012 and 2012 vs. 2016.

However, what’s also clear is that 2012 vs. 2016 has lower correlation, especially in the range of competitve counties (between ~.4 and ~.6). I’ll be looking into competitve counties (and legislative seats) later.

f,ax = plt.subplots(1,2, figsize=(4*2.1,4))
votes[['tpv_2008', 'tpv_2012']].plot.scatter('tpv_2008', 'tpv_2012', ax=ax[0])
ax[0].set_xlabel('2008 Two Party Vote (% Dem)')
ax[0].set_ylabel('2012 Two Party Vote (% Dem)')
ax[0].axis([0,1,0,1])
r = np.corrcoef(votes['tpv_2008'].values, votes['tpv_2012'].values)[0,1]
ax[0].text(.6,.2, s=r'$\rho = {:.3f}$'.format(r), fontsize=20)
votes[['tpv_2012', 'tpv_2016']].plot.scatter('tpv_2012', 'tpv_2016', ax=ax[1])
ax[1].set_xlabel('2012 Two Party Vote (% Dem)')
ax[1].set_ylabel('2016 Two Party Vote (% Dem)')
ax[1].axis([0,1,0,1])
r = np.corrcoef(votes['tpv_2012'].values, votes['tpv_2016'].values)[0,1]
ax[1].text(.6,.2, s=r'$\rho = {:.3f}$'.format(r), fontsize=20)
f.tight_layout()
plt.show()

png

Since we only have two time periods, the autocorrelation plot of this would look rather uninteresting. I’ve been working on this at the congressional district level over the 20th (and now 21st) centuries using some data I grabbed from the CLEA, mixed with a little ICPSR6311, and merged with the UCLA collection of congressional districts, and hopefully getting released through the research cluster I work with at UChicago. Again, this all takes time, but the data is ready to go, so ask me if you’re interested.

How about in space?

Of course, an interesting question also might be to look for clusters in vote. We know about rural/urban divides and regional divides in American voting, so we would expect some pretty strong correlation between neighbors at a county level.

However, what’s the order of this process? That is, how far away are counties related to one another?

This has a pretty clear analogue in time-series autocorrelation analysis. The autocorrelation function for a serially-correlated signal computes the correlation between the signal at time \(t\) and the signal at time \(t-k\), where \(k\) is some arbitrary lag. A related concept, the variogram in spatial statistics, computes the variance of the difference between locations as they get further and further apart. The partial autocorrelation function (which relates the signal at \(t\) and \(t-k\) when accounting for all lags between), is also available in a geostatistical context by conditioning the variogram on adjacent pairs below the range. But, this is incredibly computationally intensive (and the variogram is sufficient for all kinds of geostatistical models), so the partial variant is much less well used.

Unfortunately, the scale of the US county system in terms of the distances between places gets much larger as we get west than when we are in the east. One way this is handled in spatial econometrics is to use the adjacency matrix to define neighborhoods. In this case, adjacent counties are considered neighbors, regardless of the actual distance between counties. This allows the connectivity graph relating observations to have a similar density when the polygons being related dilate but keep the same topology. I’ll plot this graph over the counties below. Here, I use rook contiguity, which means two counties are adjacent if they share a boundary.

W = ps.weights.Rook.from_dataframe(votes)
f = plt.figure(figsize=(1.6*8, 8))
ax = plt.gca()
votes.plot(linewidth=.1, color='white', ax=ax)
for idx, neighbors in W:
    centroids = votes.ix[neighbors].geometry.apply(lambda pgon: (pgon.centroid.x, pgon.centroid.y))
    centroids = np.vstack(centroids.values)
    focal = np.hstack(votes.ix[idx].geometry.centroid.xy)
    for neighbor in centroids:
        ax.plot(*zip(focal, neighbor), color='firebrick', linewidth=.1)
plt.xticks([])
plt.yticks([])
plt.title('Rook Contiguity for US Counties')
plt.show()

png

With this adjacency matrix, we can compute a few interesting spatial statistics. The first, the Bivariate Moran statistic (from Wartenburg (1985), a kind of Mantel statistic), relates a set of observations to the spatial lag of another set of observations.

To be clear, the spatial lag is analogous to the temporal lag of a variate. In this case, the spatial lag refers to the average of the neighboring values around each observation. Using a row-standardized adjacency matrix \(\mathbf{W}\), the lag of \(Y\) is expressed simply as \(\mathbf{W}Y\).

This means that the bivariate Moran’s I statistic is stated for centered attribute vectors \(y\) and \(x\): \[ \frac{x’\mathbf{W}y}{x’x}\]

This results in a single statistic (and accompanying \(p\)-values computed using permutation methods) that relates the values of attribute \(x\) to the lag of \(y\). We can use this statistic to relate votes between two times. In the following, we see that county vote in the previous year is a good predictor of the vote in the next year:

bvi = ps.Moran_BV(votes['tpv_2008'], votes['tpv_2012'], w=W)
bvi.I, bvi.p_sim

(0.59905458256875777, 0.001)

bvi = ps.Moran_BV(votes['tpv_2012'], votes['tpv_2016'], w=W)
bvi.I, bvi.p_sim

(0.57711482669716285, 0.001)

Another way to look into this might be to look for clusters of volatility in how the vote changes betwen year to year. To do this, we’ll be using the quadrants of the Moran Scatterplot to interpret local indicators that show whether some counties are swinging together with their neighbors, or if some counties are swinging in opposition to their neighbors. Moran statistics, computable in PySAL, allow us to determine both the relative direction (in terms of more or less Republican) and the neighborhood dynamics (in terms of how the nearby counties move). The local moran statistic for a vector of centered observations \(z\) is computed:

\[ I_i = \frac{z_i W_z z}{z’z} \]

mli = ps.Moran_Local(votes.tpv_2016 - votes.tpv_2012, w=W)
votes['mlocal_1216'] = mli.Is
votes['mlocal_p_1216'] = mli.p_sim
votes['mlocal_quad_1216'] = mli.q

sns.distplot(mli.y)
plt.title('Change in two-party vote between 2012 and 2016')

/home/ljw/anaconda3/envs/py3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j





<matplotlib.text.Text at 0x7f99eec38ac8>

png

lmos = votes.sort_values(['mlocal_quad_1216', 'mlocal_p_1216'], ascending=False)[['county', 'STATE_NAME',
                                                                                 'tpv_2012', 'tpv_2016',
                                                                                   'mlocal_1216',
                                                                                   'mlocal_p_1216', 
                                                                                   'mlocal_quad_1216',
                                                                                   ]]

Then, we’ll do some examination by the quadrant of the scatterplot. To make everything more clear, you might want to read this alongside this plot. We’ll step by quadrant of that scatterplot:

Spatial Clusters in Vote Swing

Quadrant I

Observations in quadrant I are counties where both the focal county and its neighbors had large increases in democratic vote share. These would represent counties where both the focal and its neighbors intensified in support for Democrats. Unsurprisingly, we see that the largest cluster strengths (in terms of the size of the mlocal_1216 statistic) occur in Utah & Virginia, states that were known to swing pretty strongly towards Clinton.

Surprises might be that Montgomery County, MD and those Georgia counties where Clinton was rumored to be surging in October show up here as well. Interestingly in some, cases (like the Utah counties or Georgia counties) this improvement is sometimes from very low (say 14% in Cache county, UT) to much better (30% in Cache county, UT). Thus, a lot of this is probably occuring among weak partisans who might be swayed by the (supposedly hefty) respectability bias about Trump, but who might otherwise vote Republican with a clear conscience.

And, unsurprisngly, you also see some consolidation going on, where county shifts towards the Democrats intensified in states typically won by Democrats, like the California counties in the list below.

(lmos.query('mlocal_p_1216 < .01 and mlocal_quad_1216 == 1')
   .sort_values(['mlocal_1216'], ascending=False)
   .head(25))

	county	STATE_NAME	tpv_2012	tpv_2016	mlocal_1216	mlocal_p_1216	mlocal_quad_1216
2125	Salt Lake County	Utah	0.400350	0.580997	13.797329	0.001	1
538	Davis County	Utah	0.184858	0.328788	11.869800	0.001	1
101	Falls Church City	Virginia	0.700434	0.814524	10.397833	0.001	1
942	Wasatch County	Utah	0.234328	0.332265	9.125809	0.001	1
376	Arlington County	Virginia	0.702165	0.820099	9.110773	0.001	1
2078	Weber County	Utah	0.264168	0.365917	8.375549	0.001	1
2289	Cache County	Utah	0.149895	0.304994	7.994395	0.001	1
466	Morgan County	Utah	0.089851	0.153433	7.805697	0.001	1
924	Alexandria City	Virginia	0.721844	0.811886	7.589085	0.001	1
522	Charlottesville City	Virginia	0.772119	0.858234	7.025966	0.007	1
601	Fairfax County	Virginia	0.600144	0.691918	6.617549	0.001	1
1447	Utah County	Utah	0.099862	0.212752	6.593047	0.001	1
2737	Summit County	Utah	0.474938	0.588766	5.726012	0.001	1
3074	Tooele County	Utah	0.236622	0.292895	5.218250	0.001	1
2917	San Diego County	California	0.526027	0.589864	5.021068	0.001	1
1894	Montgomery County	Maryland	0.720859	0.788706	4.783064	0.001	1
1488	Forsyth County	Georgia	0.180902	0.251287	4.768546	0.001	1
731	San Mateo County	California	0.728657	0.799851	4.744300	0.003	1
2328	Gwinnett County	Georgia	0.452505	0.529853	4.712442	0.001	1
2622	Fulton County	Georgia	0.649931	0.718616	4.625660	0.001	1
2727	Orange County	California	0.457686	0.527198	4.503098	0.001	1
789	Teton County	Idaho	0.439325	0.499537	4.346220	0.001	1
487	Ventura County	California	0.527095	0.585385	4.309279	0.002	1
130	Santa Barbara County	California	0.587822	0.650300	4.283089	0.003	1
1700	Box Elder County	Utah	0.102815	0.149448	4.258581	0.001	1

Quadrant III

On the opposite side of the origin in the Moran scatterplot, quadrant III would indicate “low-low” clusters, areas where the two-party vote decreased significantly in both the focal and neighboring units. Thus, these would be clusters of intensifying Republican support.

In these, you see precipitous drops in Democrat support in Missouri, Ohio, Iowa, and a cluster centered around Calhoun county, WV. This largely comports with the narrative that Ohio dropped out of being truly contested this cycle, with eventual vote totals falling well below the expected contestible range. Notably, this was spatially correlated, so not only did this affect counties in Ohio, but these clusters indicate that there was spillovers.

Like, for Monroe, Guernsey, Noble, Morgan County, OH, bordering WV, this analysis indicates that that group of counties swung hard towards the Republicans, and did so in a way that’s statistically nonrandom in terms of the spatial location of those counties in Ohio. Notably absent from this are counties in the northwest of Ohio, abutting Michigan, that might indicate nascent spillovers between those states.

It also seems that heartland areas (Henderson county, IL, as well as the counties in southern Iowa and eastern MO, also swung together in a spatially-cluster for Trump.

(lmos.query('mlocal_p_1216 < .01 and mlocal_quad_1216 == 3')
   .sort_values(['mlocal_1216'], ascending=False)
   .head(25))

	county	STATE_NAME	tpv_2012	tpv_2016	mlocal_1216	mlocal_p_1216	mlocal_quad_1216
1764	Clark County	Missouri	0.446931	0.227530	5.669215	0.001	3
797	Howard County	Iowa	0.606798	0.390665	4.854837	0.001	3
315	Vinton County	Ohio	0.459846	0.259599	4.456021	0.001	3
2560	Iron County	Missouri	0.425657	0.227228	4.443903	0.001	3
2882	Lee County	Iowa	0.581437	0.414187	4.399783	0.001	3
1937	Monroe County	Ohio	0.459294	0.256223	4.358251	0.001	3
1080	Harrison County	Ohio	0.423803	0.248803	4.301950	0.001	3
3012	Pike County	Ohio	0.497959	0.309845	4.243976	0.001	3
52	Guernsey County	Ohio	0.450583	0.277411	4.211854	0.001	3
2321	Appanoose County	Iowa	0.482011	0.310233	4.145644	0.001	3
2478	Reynolds County	Missouri	0.374757	0.183549	4.143634	0.001	3
1087	Scioto County	Ohio	0.489968	0.309310	4.123989	0.001	3
533	Henderson County	Illinois	0.562447	0.348366	4.090308	0.002	3
1912	Worth County	Iowa	0.573421	0.384014	4.032455	0.002	3
2976	Adams County	Iowa	0.479962	0.288560	3.943708	0.002	3
2926	Noble County	Ohio	0.373118	0.212868	3.918989	0.001	3
1136	Monroe County	Iowa	0.460944	0.285521	3.801247	0.001	3
122	Morgan County	Ohio	0.469194	0.283936	3.775632	0.003	3
1785	Union County	Iowa	0.519643	0.352876	3.721560	0.001	3
752	Washington County	Missouri	0.404518	0.214644	3.665785	0.001	3
2113	Calhoun County	West Virginia	0.383437	0.183945	3.654698	0.002	3
431	Davis County	Iowa	0.415776	0.263998	3.581305	0.001	3
1753	Chickasaw County	Iowa	0.556043	0.377145	3.534017	0.001	3
2551	Hancock County	Illinois	0.402929	0.233874	3.505805	0.001	3
2870	Ringgold County	Iowa	0.463683	0.292654	3.498472	0.001	3

Quadrant II

(lmos.query('mlocal_p_1216 < .01 and mlocal_quad_1216 == 2')
   .sort_values(['mlocal_1216'], ascending=False)
   .head(25))

	county	STATE_NAME	tpv_2012	tpv_2016	mlocal_1216	mlocal_p_1216	mlocal_quad_1216
1425	Linn County	Oregon	0.411225	0.348152	-0.072804	0.008	2
2163	Lake County	Colorado	0.624395	0.558659	-0.150249	0.002	2
58	Carbon County	Utah	0.313157	0.248662	-0.159067	0.001	2
665	Lake County	California	0.585876	0.520347	-0.175529	0.001	2

There aren’t many observations in this quadrant. This indicates quadrants whose swing was more Republican than average while their neighbors’ swings were more Democrat than average. Note that this relates to the mean national swing as the unweighted average swing at the county level. So, this captues the counties who swung more Republican than average over counties while their neighbors swung more Democrat than average.

We see some pretty counter-iintuitive counties here. Linn county, OR and Lake county, CO weren’t things that were on my radar, but it seems they’ve moved anomalously towards the Republicans while their neighbors intensified in Democratic support.

Quadrant IV

(lmos.query('mlocal_p_1216 < .01 and mlocal_quad_1216 == 4')
   .sort_values(['mlocal_1216'], ascending=False)
   .head(25))

	county	STATE_NAME	tpv_2012	tpv_2016	mlocal_1216	mlocal_p_1216	mlocal_quad_1216
1598	Nicollet County	Minnesota	0.540244	0.483152	-0.041772	0.006	4
3059	Ohio County	West Virginia	0.385058	0.329845	-0.107641	0.002	4
1240	McDonough County	Illinois	0.492355	0.437391	-0.153496	0.001	4
1713	Kanawha County	West Virginia	0.439592	0.391678	-0.188802	0.009	4
2443	Cass County	North Dakota	0.484879	0.440538	-0.308489	0.002	4
2153	Linn County	Iowa	0.589217	0.548273	-0.359697	0.009	4
2541	La Crosse County	Wisconsin	0.587588	0.551186	-0.428908	0.008	4
1230	Jackson County	Illinois	0.553239	0.516216	-0.509392	0.002	4
172	Eau Claire County	Wisconsin	0.568727	0.539251	-0.680042	0.001	4
2747	Monongalia County	West Virginia	0.450422	0.443524	-0.984685	0.008	4
2968	Olmsted County	Minnesota	0.516450	0.504157	-1.047390	0.001	4

Like the second quadrant, these are also outliers. However, these are areas whose swings are more Democrat than average while their neighbors had swings that are more Republican than average. This seems to pick up a couple of Democratic areas where the general trend towards Trump failed to spill over between counties as strongly as it would have otherwise.

So, for example, Nicollet county, MN swung towards Trump at slightly more than the unweighted average shift towards Trump at the county level. But, this indicates its neighbors swung even more strongly.

To match this up, a map of the swing and the cluster statistics might be helpful. First, the swing map:

f,ax = plt.subplots(1,1,figsize=(1.6*8,8))
votes['swing_1216'] = votes.tpv_2016 - votes.tpv_2012
votes.plot('swing_1216', cmap='RdBu', ax=ax, linewidth=.1, alpha=.8)
plt.title('Swing in Two-Party vote in 2016', fontsize=20)
plt.show()

png

And then the cluster map:

votes['stat_quad'] = votes.mlocal_quad_1216 * (votes.mlocal_p_1216<.01)

cp = sns.crayon_palette(['White', 'Cerulean', 'Tropical Rain Forest', 
                                 'Scarlet', 'Vivid Violet'])
import matplotlib.colors as cmaps
mymap, _ = cmaps.from_levels_and_colors(np.arange(-.5,5.5, 1), cp)

f,ax = plt.subplots(1,1,figsize=(1.6*8,8))
votes.plot('stat_quad', linewidth=.1, cmap=mymap, ax=ax, alpha=.7)
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.title('Local Indicators of Partisan Swing', fontsize=20)
plt.show()
sns.palplot(cp, size=1.5)
plt.show()
print('NSD from Mean   Dem Cluster   Rep Outlier   Rep Cluster   Dem Outlier')

png

NSD from Mean   Dem Cluster   Rep Outlier   Rep Cluster   Dem Outlier

More specifically for these labels:

NSD from Mean: swing is about the national average in both the neighboring counties and the focal county.
Dem Cluster: Swing is better for Democrats in both the focal and the neighboring counties than it is nationally. (Dems overperform in this spatial cluster)
Rep Outlier: Swing is better for Republicans in this county than it is nationally and better for Democrats in neighboring counties than it is nationally (Reps overperform in this county vs. nearby counties and the nation)
Rep Cluster: Swing is better for Republicans in this county and the neighboring counties than it is nationally (Reps overperform in this spatial cluster)
Dem Outlier: Swing is better for Democrats in this county than nationally, but better for Republicans in neighboring counties than nationally (Dems overperform in this county vs. nearby counties and the nation)

Gauging typical Cluster size

With this, we can try to identify the “range” at which counties are related to one another. If we can identify this, we might be able to tell the graph distance at which counties tend to be come uncorrelated with one another.

To compute this, we can use the (partial) spatial autocorrelation functions to identify this. In a similar manner to the (partial) temporal autocorrelation function, the (partial) spatial autocorrelation function relates each observation to its \(k\)th order neighbors. In the spatial context, the \(k\)th order neighbors of observation \(y\_i\) is the set of observations \(y\_j\) that are first reached in \(k\) steps. This means that the graph distance between observation \(y\_j\) and \(y\_i\) is exactly \(k\): \[ {y_{ik} : min(||y_j - y_i||) = k ~ ~ ~ ~ ~ ~ \forall j= 1, 2, \dots, n}\] Thus, the \(k\)th order spatial autocorrelation function is:

\[\rho_k = cor(y, \mathbf{W}^ky)\]

where \(\mathbf{W}^k\) is the adjacency matrix for \(k\)-minimal neighbors. The \(k\)th-order partial spatial autocorrelation function is:

\[ \dot{\rho}_k = cor(y, \mathbf{W}^ky | \mathbf{W}^{k-1}y, \mathbf{W}^{k-2}y, \dots, \mathbf{W}^{1}y )\]

I plot these for the vote shares in 2016 below,

import spacf

First, the full spatial autocorrelation plot:

lags = spacf.spacf(votes[['tpv_2016']].values - votes[['tpv_2012']].values,W, order=20)
plt.figure(figsize=(1.6*4*2,4))
plt.plot(lags, linewidth=5, color='slategrey')
plt.title('Spatial ACF', fontsize=20)
plt.hlines(0,0,len(lags), linestyle=':', color='k')
plt.axis([0,25,-.2,1])
plt.show()

png

Interpreting this, we have to move around 16 counties out before the autocorrelation between counties becomes negative. Remember, this statistic considers only \(k\)-minimal neighbors, not all observations below \(k\)th order neighbors. If you don’t consider \(k\)-minimal neighbors (rather than \(k\)th order), then sets of higher-order neighbors will contain the set of lower-order neighbors.

In a time series context, this would be akin to considering both the observation from 2 periods ago and the previous observation in the set of 2nd order neighbors. In contrast, this graph shows the correlation as the set of “considered” counties radiates uniformly outwards from each focal county.

Thus, a typical ``cluster’’ in the sense of counties being more related to each other than not, is a subgraph somewhere south of 15-counties in radius. If this seems too big to you, you’re right. We need to account for the whole neighborhood contained with the \(k\)-radius cluster:

Conditional width of cluster size

The partial correlation plot does condition on the neighbors below \(k\)-th order. So, the correlation between the \(k\)-minimal neighbors and the source observations is conditional on 1st through \(k-1\)-minimal neighbors. We can use this plot to adequately identify the ``order’’ of the spatial process, if we treat it as a spatial markov random field.

plags = spacf.sppacf(votes[['tpv_2016']].values -votes[['tpv_2012']].values , W, order=10)
plt.figure(figsize=(1.6*4,4))
plt.plot(plags, linewidth=5, color='slategrey')
plt.title('Partial Spatial ACF')
plt.hlines(0,0,len(plags), linestyle=':', color='k')
plt.show()

png

Interpreting this, we see that the first order neighbors are typically sufficient to capture the full extent of correlation between counties. Conditional on the first-order neighbors, higher-order neighbor correlation pretty-much disappears..

Spatial Autocorrelation Functions

How are counties related in time?