Introduction

GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark / SparkSQL with a set of out-of-the-box Spatial Resilient Distributed Datasets (SRDDs)/ SpatialSQL that efficiently load, process, and analyze large-scale spatial data across machines.

Source code

I implemented GeoSpark into Apache Spark and SparkSQL. Source code is hosted on Github: Source code, Project website

Reputation

  • GeoSpark is the defacto spatial data processing framework on top of Apache Spark.

  • GeoSpark had been recognized by Apache Spark Official Third Party Projects List since Sept.2016. The link was removed in Aug. 2018 due to the conflict with Spark trademark (see this commit)

  • GeoSpark has > 200K overall website visits and > 10K monthly downloads.

  • Users and contributors include Facebook, Apple, Uber, MoBike, and numerous startups

  • GeoSpark in production (video), from Gyana, a British Location Inteligence company

  • GeoSpark received an evaluation from PVLDB 2018 paper How Good Are Modern Spatial Analytics Systems?, written by Varun Pandey, Andreas Kipf, Thomas Neumann, Alfons Kemper (Technical University of Munich), quoted as follows:

    GeoSpark comes close to a complete spatial analytics system. It also exhibits the best performance in most cases.

Publications

I published 8 papers under this project.

  • Demonstrating GeoSparkSim: A Scalable Microscopic Road Network Traffic Simulator Based on Apache Spark (Demo paper)
    • Zishan Fu, Jia Yu, Mohamed Sarwat. SSTD, 2019
  • Building a Large-Scale Microscopic Road Network Traffic Simulator in Apache Spark (Research paper)
    • Zishan Fu, Jia Yu, Mohamed Sarwat. MDM, 2019
  • GeoSparkViz in Action: A Data System with built-in support for Geospatial Visualization (Demo paper)
    • Jia Yu, Anique Tahir, Mohamed Sarwat. ICDE, 2019
  • Geospatial Data Management in Apache Spark (Tutorial)
    • Jia Yu, Mohamed Sarwat. ICDE, 2019
  • Spatial Data Management in Apache Spark: The GeoSpark Perspective and Beyond (Research paper)
    • Jia Yu, Zongsi Zhang, Mohamed Sarwat. Geoinformatica Journal, 2018
  • GeoSparkViz: A Scalable Geospatial Data Visualization Framework in the Apache Spark Ecosystem (Research paper)
    • Jia Yu, Zongsi Zhang, Mohamed Sarwat. In Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM 2018
  • A Demonstration of GeoSpark: A Cluster Computing Framework for Processing Big Spatial Data (DEMO paper)
    • Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceedings of the IEEE International Conference on Data Engineering, ICDE 2016
  • GeoSpark: A Cluster Computing Framework for Processing Large-Scale Spatial Data (Short paper)
    • Jia Yu, Jinxuan Wu, Mohamed Sarwat. In Proceedings of ACM International Conference on Advances in Geographic Information Systems, ACM SIGSPATIAL GIS 2015