Researchers and practitioners have widely studied road network traﬃc data in diﬀerent areas such as urban planning, traﬃc prediction and spatial-temporal databases. For instance, researchers use such data to evaluate the impact of road network changes. Unfortunately, collecting large-scale high-quality urban traﬃc data requires tremendous eﬀorts because participating vehicles must install Global Positioning System(GPS) receivers and administrators must continuously monitor these devices. There have been some urban traﬃc simulators trying to generate such data with diﬀerent features. However, they suﬀer from two critical issues (1) Scalability: most of them only oﬀer single-machine solution which is not adequate to produce large-scale data. Some simulators can generate traﬃc in parallel but do not well balance the load among machines in a cluster. (2) Granularity: many simulators do not consider microscopic traﬃc situations including traﬃc lights, lane changing, car following. This paper proposed GeoSparkSim, a scalable traﬃc simulator which extends Apache Spark to generate large-scale road network trafﬁc datasets with microscopic traﬃc simulation. The proposed system seamlessly integrates with a Spark-based spatial data management system, GeoSpark, to deliver a holistic approach that allows data scientists to simulate, analyze and visualize large-scale urban traﬃc data. To implement microscopic traﬃc models, GeoSparkSim employs a simulation-aware vehicle partitioning method to partition vehicles among diﬀerent machines such that each machine has a balanced workload. The experimental analysis shows that GeoSparkSim can simulate the movements of 300 thousand vehicles over a very large road network (250 thousand road junctions and 300 thousand road segments) and outperform the existing competitors.