Geospatial Data Management in Apache Spark: A Tutorial


This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three novel Spatial Resilient Distributed Datasets (SRDDs) which extend regular Apache Spark RDDs to support geometrical and spatial objects. GeoSpark provides a geometrical operations library that accesses Spatial RDDs to perform basic geometrical operations (e.g., Overlap, Intersect). System users can leverage the newly defined SRDDs to effectively develop spatial data processing programs in Spark. The Spatial Query Processing Layer efficiently executes spatial query processing algorithms (e.g., Spatial Range, Join, KNN query) on SRDDs. GeoSpark also allows users to create a spatial index (e.g., R-tree, Quad-tree ) that boosts spatial data processing performance in each SRDD partition. Preliminary experiments show that GeoSpark achieves better run time performance than its Hadoop-based counterparts (e.g., SpatialHadoop).

In IEEE International Conference on Data Engineering, ICDE
Jia Yu
Jia Yu

Jia Yu is a co-founder of Wherobots Inc. and leads its engineering team. Jia is the creator of Apache Sedona and was a Tenure-Track Assistant Professor of Computer Science at Washington State University from 2020 to 2023. Jia’s research interests include database systems, distributed data systems and geospatial data management.

Mohamed Sarwat
Mohamed Sarwat
Assistant Professor

Mohamed Sarwat is an assistant professor of computer science at Arizona State University. His general research interest lies in developing robust and scalable data systems for spatial and spatiotemporal applications.