Research | Jia Yu (于嘉)

Background

The volume of geospatial data increased tremendously. Such data includes but is not limited to weather maps, Internet-of-Things sensors, and geo-tagged social media. Many data-intensive geospatial analytics applications, such as Machine Learning algorithms, highly rely on the underlying data infrastructures such as database management systems (DBMS) to efficiently manipulate, retrieve and manage data. Unfortunately, classic database management systems, such as MySQL, PostgreSQL, PostGIS, and ArcGIS, suffer from a significant performance drop when handling large-scale geospatial data.

Agenda

My research focuses on crafting database systems to accelerate large-scale geospatial data analytics. In particular, I am interested in

building large-scale / distributed data systems for geospatial data and data streams. This will involve dramatic new changes to existing big data systems such as Apache Hadoop, Spark, Flink, Storm, and Kafka
designing Machine Learning-enhanced spatial data structures such as indices or new physical data layouts to facilitate spatial query processing. Therefore, the user can see analysis results with lower storage cost yet at a higher speed.
creating geospatial visualization techniques for geospatial data or data streams. The interactive visualization interfaces such as Google Maps will be able to update every minute or even every second to reflect the actual movement of millions of spatial objects.

Philosophy

System-oriented research. Building data systems that really work benefits both academia and industry. My open-source system Apache Sedona is one of the most popular spatial data systems on top of Apache Spark and has helped many companies.
Research collaboration. Seeking the knowledge from and collaborating with experts in different places is the way to solve and recognize challenging problems. In the past, I collaborated with / worked at Microsoft Research, IBM Almaden Research Center and Apple.
Diversity of research areas. Working in several research areas gives a broader vision of interdisciplinary opportunities and inspires more practical research ideas. My current interdisciplinary research that connects database systems and GIS contributes to a range of relevant disciplines such as geography and urban planning.

The “ecosystem” of my research

I worked on several projects in two research streams: large-scale geospatial data management and lightweight database indexes (for regular data and spatial data). Over the time, the research projects built up two funny “ecosystems”. I am so proud of my contribution to the community!

A full-fledged big geospatial data analytics system that provides
- Data generation (GeoSparkSim, MDM 2019)
- Data managemenet and query processing (Apache Sedona, formerly GeoSpark, Geoinformatica 2019)
- Visulization (GeoSparkViz, SSDBM 2018)
- Middleware for interactive analytics front-end (Tabula, ICDE 2020)
Lightweight data indexing techniques for different real-world scenarios including
- Sparse index for disk-oriented databases (Hippo, VLDB 2016; Hippo-Spatial, SSTD 2017)
- Machine-Learning based index for clustered attributes with intensive data updates (ALEX, SIGMOD 2020, with MSR and MIT)
- Machine-Learning based index for non-clustered attributes (Hermit, SIGMOD 2019, with IBM - Almaden)