Biography

Jia Yu is an Assistant Professor at Washington State University School of Electrical Engineering and Computer Science. He obtained his Ph.D. in Computer Science from Arizona State University (advisor: Mohamed Sarwat) in Summer 2020. His research focuses on large-scale database systems and geospatial data management. In particular, he worked on distributed geospatial data management systems, database indexing, and geospatial data visualization. Jia’s research outcomes have appeared in the most prestigious database / GIS conferences and journals, including SIGMOD, VLDB, ICDE, SSTD and VLDB Journal. He is the main contributor of several open-sourced research projects such as Apache Sedona (incubating), a cluster computing framework for processing big spatial data, which receives 200,000 downloads per month and has users / contributors from major companies (e.g., Facebook, Uber, AT&T, and MoBike).

Here is a one-page summary of my research.

I am actively looking for Computer Science PhD students to join me in Spring 2021. Please read this page.

Call For Papers and Call For Participation

News

Interests

  • Database systems
  • Distributed data systems
  • Geospatial data management

Education

  • Ph.D. in Computer Science, 2020

    Arizona State University

  • BEng in Software Engineering, Outstanding Graduate, 2013

    Northwest Agriculture and Forestry University, China (西北农林科技大学)

Experience

 
 
 
 
 

Research Intern

Microsoft Research, Database group

Jun 2019 – Aug 2019 Redmond, Washington
– Microsoft is the birthplace of Micrsoft SQL Server
– Mentor / Collaborators: Umar Farooq Minhas, David Lomet, Jaeyoung Do, Yinan Li, Chi Wang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann
– I worked on a realistic design of updatable learned indices
SIGMOD 2020 research paper ALEX: An Updatable Adaptive Learned Index
 
 
 
 
 

Research Intern

IBM Almaden Research Center, Database group

May 2018 – Aug 2018 San Jose, California
– IBM-Almaden is the birthplace of relational model, SQL and DB2 DBMS
– Mentor / Collaborators: Vijayshankar Raman, Yingjun Wu, Yuanyuan Tian, Ronald Barber, Richard Sidle
– I participated in Hermit project to design a succinct secondary index. I also explored the code generation issues on compressed database tables and implemented a preliminary code generator with JIT execution using LLVM, for IBM HTAP system
SIGMOD 2019 research paper Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations
VLDB 2019 demo paper HERMIT in action: Succinct secondary indexing mechanism via correlation exploration
 
 
 
 
 

Software Development Intern

Apple, Maps team

Aug 2016 – Jun 2016 Cupertino, California
– Apple is the birthplace of Apple Maps
– Mentor: Huang-Hsiang Cheng; Manager: Alex Radeski
– I deployed and improved distributed computing frameworks and resource management systems such as Apache Spark and Apache Mesos. I also developed internal evaluation tools to assist large-scale geospatial analysis

Projects

*

ALEX

ALEX is a new class of learned indexes which addresses issues that arise when implementing dynamic and updatable learned indexes.

Tabula

Tabula is a middleware that runs on top of a SQL data system with the purpose of increasing the interactivity of geospatial visualization dashboards.

GeoSparkSim

GeoSparkSim is a scalable traffic simulator which extends Apache Spark to generate large-scale road network traffic datasets with microscopic traffic simulation.

Hermit

Hermit is a succinct secondary indexing mechanism for modern RDBMSs. It judiciously leverages the rich soft functional dependencies hidden among columns to prune out redundant structures for indexed key access.

GeoSparkViz

GeoSparkViz is a large-scale geospatial map visualization framework. GeoSparkViz extends Apache Spark to provide native support for general cartographic design.

Hippo

Hippo is a fast, yet scalable, database indexing approach. It significantly shrinks the index storage and mitigates maintenance overhead without compromising much on the query execution performance.

GeoSpark

GeoSpark is a cluster computing system for processing large-scale spatial data. GeoSpark extends Apache Spark / SparkSQL to efficiently load, process, and analyze large-scale spatial data across machines.

Awards

Third Place of Student Research Competition

Student Travel Grant

IEEE ICDE (3 times), ACM SIGSPATIAL (5 times = 4 NSF + 1 Microsoft)

Outstanding graduate

Only 200 out of 5600 students were selected

First-class Scholarship, Merit Student

2 times, only top 10% students (in terms of GPA) were selected

Services

Program Committee member

ACM SIGSPATIAL 2020

Invited reviewer

VLDB Journal (VLDBJ)
ACM Transactions on Spatial Algorithms and Systems (TSAS)
International Journal of Geographical Information Science (IJGIS)
Geoinformatica Journal
IEEE Transactions on Cloud Computing (TCC)
Computers and Geosciences (CAGEOS)
IEEE Transactions on Parallel and Distributed Systems (TPDS)
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Frontiers in Big Data
See certificate

External reviewer

SIGMOD: 2017, 2018, 2019
SIGMOD demo: 2016, 2018
PVLDB: 2016, 2017, 2018, 2019, 2020
ICDE: 2020
ICDE demo: 2017, 2018
SIGSPATIAL: 2016, 2017, 2018
SSTD: 2017
MDM: 2016

Teaching

CptS 415 Big data

Instructor, Senior undergraduate level, Computer Science, Washington State University

CSE 511 Data Processing at Scale

Instructor, Graduate level, Computer Science, Arizona State University

ASU Online Master of Computer Science - Data Systems

Designer, Graduate level, Computer Science, Coursera (over 10000 learners)

Recent & Upcoming Talks

Slides of my talks are usually available unless forbidden by Non-Disclosure Agreements

Spatial Data Wrangling With GeoSpark - A Step-by-Step Tutorial
GeoSpark and Geospatial Data Management in Apache Spark
ALEX - An Updatable Learned Index
Designing Succinct Secondary Indexes by Exploiting Column Correlations

Contact