The VLDB Journal ( IF 2.904 ) Pub Date : 2019-12-20 , DOI: 10.1007/s00778-019-00594-5 Silu Huang, Liqi Xu, Jialin Liu, Aaron J. Elmore, Aditya Parameswaran
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free.” We develop and evaluate multiple data models for representing versioned data, as well as a lightweight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average \(10^3\times \) faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to \(20\times \) relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by \(10\times \) on average.