TP5.2: Geospatial Big Data Exploration

Subproject manager
Prof. Alfons Kemper, Ph.D.
Varun Pandey

In the past few years, massive amounts of location-based data has been captured. Numerous datasets containing user location information are readily available to the public. Analyzing such datasets can lead to fascinating insights into the mobility patterns and behaviors of users. Moreover, in recent times a number of geospatial data-driven companies like Uber, Lyft, and Foursquare have emerged. Real-time analysis of geospatial data is essential and enables an emerging class of applications. There has been a rapid advancement in research areas such as machine learning and data mining, which can be attributed to the growth in the database industry and advances in data analysis research. This has resulted in a need for systems that can extract useful information and knowledge from data. Data scientists use various data mining tools on top of databases for this purpose. To achieve lower latencies and minimize transmission costs between the database and external tools, it is necessary to move computation closer to the data. The current trend in database research is to integrate these various analytical functionalities that are useful for knowledge discovery into the database kernel. The goal is to have a full-fledged general-purpose database that allows big data analysis along with conventional transaction processing.

The aim of this subproject is to integrate analytical functionalities for geospatial data into a main memory database system to facilitate big data analysis. We have developed a prototype HyPerSpace[1], a geospatial extension for HyPer, which was presented at SIGMOD 2016. In addition to the geospatial extension a web-based user interface called HyPerMaps was also developed. HyPerSpace and HyPerMaps combined together allows a user to interactively explore the New York Taxi dataset.

Figure 1: HyPerMaps

Based on recent research trend [2, 3, 4, 5, 6] our aim is to introduce a full geospatial extension to HyPer which will include range queries, distance queries, geospatial join, k Nearest Neighbor (kNN) queries and kNN join.

[1] Pandey, V., Kipf, A., Vorona, D., Mühlbauer, T., Neumann, T. and Kemper, A., 2016, June. High-Performance Geospatial Analytics in HyPerSpace. In Proceedings of the 2016 International Conference on Management of Data (pp. 2145-2148). ACM.
[2] Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X. and Saltz, J., 2013. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11), pp.1009-1020.
[3] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatial data. In ICDE, 2015.
[4] You, S., Zhang, J. and Gruenwald, L., 2015, September. Spatial join query processing in cloud: Analyzing design choices and performance comparisons. In Parallel Processing Workshops (ICPPW), 2015 44th International Conference on (pp. 90-97). IEEE.
[5] Yu, J., Wu, J. and Sarwat, M., 2015, November. Geospark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (p. 70). ACM.
[6] Xie, D., Li, F., Yao, B., Li, G., Zhou, L. and Guo, M., 2016, June. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data (pp. 1071-1085). ACM.