Back to school – part 1

The improvement of governmental services with data science is widely acknowledged, but success stories in Germany are still scarce. idalab has been involved in various projects in this regard, most recently —together with Technologiestiftung Berlin and Stiftung Neue Verantwortung — regarding the way children get assigned to primary schools.

While seemingly mundane on the surface, those problems actually revolve around some core data science issues. How are catchment areas of primary schools ideally districted in order to minimize the commuting time to school? This problem class is also of high relevance in other areas of application. What are optimal catchment areas for emergency dispatch? Where should new fire stations or supermarkets be set up? Where should sales offices be placed?

This blog post will explore, how far we can delve into this problem using only Open Data. It is the beginning of an internal project to get familiar with the available data and to learn how we can use this and similar datasets in future projects.

Open Data

Though a neglected topic in German policy for long, Open Data has recently been getting more recognition from the governing coalition. After years of standing idle, Germany finally is about to join the Open Government Partnership and the Minister of the Interior will present a draft for an Open Data law as early as this September. For idalab, Open Data has always been a source to fuel our algorithms and models. Just recently we’ve given a workshop on how to use Open Data for predictive modeling at Berlin Open Data Day.

Turning to the problem at hand, primary school districting, there is also quite some data available, although not all is rainbows and unicorns.

This post, as a starter, will be limited to exploring essential Open Data sets that would be relevant to aid the re-definition of catchment areas for primary schools. Expect to read more on this topic from us in the future.

Primary School Catchment Areas

There is a dataset of the catchment areas for primary schools in Berlin’s Data Portal at https://daten.berlin.de/datensaetze/grundschuleinzugsbereiche-geometrien-berlin-2012-mit-demographischen-merkmalen. This dataset has some problems though as we’ll see in a moment. But let’s have a look first.

wget -nv https://www.statistik-berlin-brandenburg.de/opendata/ESB2012_WGS84_EWR2012-12.zip
unzip ESB2012_WGS84_EWR2012-12.zip -d ESB2012_WGS84_EWR2012-12

To explore the datasets we are using R. We mostly use Python for analysis and machine learning at idalab but R has some great libraries for data wrangling and visualization and also has nice capabilities to work this spatial data so we opted for R in this case.

The catchment areas come in the form of a shape file which can be loaded into a SpatialPolygonsDataFrame like this:

catchment_areas = readOGR('ESB2012_WGS84_EWR2012-12/ESB2012_WGS84_EWR2012-12.shp', layer='ESB2012_WGS84_EWR2012-12')
catchment_areas = catchment_areas[,c("ID")]
crs = CRS(proj4string(catchment_areas))
row.names(catchment_areas) = as.character(catchment_areas$ID)
plot(catchment_areas)

catchment-areas

Schools

Coordinates of Schools in Berlin are also available and can be downloaded from http://www.stadtentwicklung.berlin.de/geoinformation/fis-broker/.

The easiest way to do this is using ogr2ogr from GDAL. Here we are saving the data into a GeoJSON file.

ogr2ogr -s_srs EPSG:25833 -t_srs WGS84 -f geoJSON schulen.geojson WFS:"http://fbinter.stadt-berlin.de/fb/wfs/geometry/senstadt/re_schulstand" fis:re_schulstand
Now we can read it into a SpatialPointsDataFrame the same way we did with the shapefile previously. Note that we are reprojecting the data into the same coordinate reference system as the catchment areas, so that we can later join data without problems.
schools = readOGR('schulen.geojson', layer = 'OGRGeoJSON')
schools = schools[schools$SCHULART == 'Grundschule',]
schools_df = schools %>% as.data.frame() %>%
  rename(s_long=coords.x1, s_lat=coords.x2) %>% filter(SCHULART == 'Grundschule') %>%
  mutate(school_id=factor(gsub("G", "", spatial_name))) %>%
  select(school_id, s_lat, s_long, SCHULNAME)
map = get_map('Berlin')
ggmap(map) + geom_point(aes(s_long, s_lat), data=schools_df)

schools

Mapping schools to catchment areas

To see if we can work with this data, we first wanted to understand how primary schools map to catchment areas. Since every school has an identifier called spatial_name in the original dataset that we renamed to school_id and every catchment area also has an ID attribute of almost the right shape — just one letter in the spatial_name needs to be removed — it was obvious to join on this key. A resulting plot connecting every school (black dots) with it’s matching catchment area polygon centroid (red dots) connecting through a dotted line looks like this:
catchment_areas = spTransform(catchment_areas, CRS(proj4string(schools)))
bbox_catchment_areas = bbox(catchment_areas)

catchment_area_centroids = coordinates(catchment_areas) %>% as.data.frame() %>% rename(long=V1, lat=V2)

catchment_area_centroids_df = cbind(catchment_area_centroids, catchment_areas@data) %>% rename(school_id=ID)
catchment_areas_df = fortify(catchment_areas)
ggplot() +
  geom_polygon(aes(long, lat, group=group, order=order), size=0.5, color='red', fill='white', data = catchment_areas_df) +
  geom_segment(aes(x=long, y=lat, xend=s_long, yend=s_lat), size=0.5, alpha=0.6, linetype=2, color='black', data=inner_join(catchment_area_centroids_df, schools_df)) +
  geom_point(aes(long, lat), size=0.5, color='red', data=catchment_area_centroids_df) +
  geom_point(aes(s_long, s_lat), size=0.5, color='black', data=schools_df) +
  scale_color_continuous(low='blue', high='red') +
  coord_map() +
  theme_void() +
  guides(color=F)

catchment-schools-1

But this looks a little odd especially in Berlin Mitte. There are a few dotted lines clearly crossing catchment area borders. We can visualize which schools don’t fall into their own district, to get a better grasp of what’s going on. We’ll also visualize which schools don’t have a corresponding catchment area as blue dots:

schools_df$catchment_area_id = as.character(over(schools, catchment_areas)$ID)

ggplot() +
  geom_polygon(aes(long, lat, group=group, order=order), size=0.5, color='red', fill='white', data = catchment_areas_df) +

  geom_point(aes(long, lat), size=0.5, color='red', data=catchment_area_centroids_df) +
  geom_point(aes(s_long, s_lat), size=0.5, color='black', alpha=0.4, data=schools_df %>% filter(school_id==catchment_area_id)) +
  
  # shade
  geom_rect(aes(xmin=X1,ymin=X2,xmax=X3,ymax=X4), fill='white', alpha=0.8, data=data.frame(matrix(bbox_catchment_areas, 1, 4))) +

  # lines
  geom_segment(aes(x=long, y=lat, xend=s_long, yend=s_lat), size=0.5, alpha=0.6, linetype=2, color='black', data=inner_join(catchment_area_centroids_df, schools_df) %>% filter(school_id!=catchment_area_id)) +
  
  # area centers
  geom_point(aes(long, lat), size=0.5, color='red', data=inner_join(catchment_area_centroids_df, schools_df) %>% filter(school_id!=catchment_area_id)) +

  # schools
  geom_point(aes(s_long, s_lat), size=0.5, color='black', data=left_join(schools_df, catchment_area_centroids_df) %>% filter(!is.na(long) & school_id!=catchment_area_id)) +

  # orphaned schools
  geom_point(aes(s_long, s_lat), size=0.5, color='blue', data=left_join(schools_df, catchment_area_centroids_df) %>% filter(is.na(long))) +

  scale_color_continuous(low='blue', high='red') +
  coord_map() +
  theme_void() +
  guides(color=F)

catchment-schools-2

A few schools are really on the border of their catchment area so they are caught by the algorithm as false positives. But especially in Mitte there is something rotten. A few lines just seem too long. Let’s have an even closer look:

schools_df_wrong_ca_df = left_join(schools_df, catchment_area_centroids_df) %>%
  filter(!is.na(long) & school_id!=catchment_area_id) %>%
  rowwise() %>%
  mutate(air_dist=distHaversine(c(long, lat), c(s_long, s_lat))) %>%
  filter(air_dist > 1500) %>%
  ungroup()
schools_df_wrong_ca_df %>%
  ggplot() +
    geom_polygon(aes(long, lat, group=group, order=order), size=0.5, color='red', fill='white', data = catchment_areas_df) +
  
  # shade
  geom_rect(aes(xmin=X1,ymin=X2,xmax=X3,ymax=X4), fill='white', alpha=0.8, data=data.frame(matrix(bbox_catchment_areas, 1, 4))) +

  # lines
  geom_segment(aes(x=long, y=lat, xend=s_long, yend=s_lat), size=0.5, alpha=0.6, linetype=2, color='black') +
  
  # area centers
  geom_point(aes(long, lat), size=0.5, color='red') +

  # Schools
  geom_point(aes(s_long, s_lat), size=0.5, color='black') +
  geom_text_repel(aes(s_long, s_lat, label=SCHULNAME), size=2.6, color='black') +
  coord_map(xlim=c(min(schools_df_wrong_ca_df$long)-0.01, max(schools_df_wrong_ca_df$long)+0.01), ylim=c(min(schools_df_wrong_ca_df$lat)-0.01, max(schools_df_wrong_ca_df$lat)+0.01)) +
  theme_void() +
  guides(color=F)
catchment-schools-3

 

Does Open Data get updated if reality changes?

One thing that could have happened here is, that the world kept turning while the catchment areas dataset just started aging and not in a graceful way. But the locations of schools are also quite old. The last update happened in April 2013. Maybe there is something else going on. Also how should we handle all those schools where the identifiers don’t match to catchment areas? This can only be solved by talking to people from the respective administration. As often, just having the data is rarely sufficient. Access to domain knowledge is crucial.

Is all hope lost?

This of course doesn’t stop us. We’re trying to get an updated data set from the “Amt für Statistik Berlin-Brandenburg” right now and there is another source of relevant data available. The city of Berlin provides a form for parents to enter their address into and get a list of schools in their proximity. Data returned from that form should be correct.

Can we use this for our purposes? Sure we can! In fact we have already submitted this form 306019 times. This will be the starting point for the next blogpost in this series. Stay tuned!

scraped

Contact the author
Daniel Kirsch
+49 (30) 814 513-16
Subscribe
Share

Leave a Comment

Your email address will not be published. Required fields are marked *