This interview is part of the Decibel OSS Spotlight series where we showcase founders of fast-growing community-led projects that are solving really unique problems and experiencing strong community adoption.
Sudip Chakrabarti spoke to Sarah Wooders and Paras Jain, co-creators of Skyplane, the open source project that makes blazingly fast bulk data transfers between object stores in the cloud possible. Skyplane provisions a fleet of VMs in the cloud to transfer data in parallel while using compression and bandwidth tiering to reduce cost.
Sarah and Paras shared with us their inspiration behind creating Skyplane and their vision to make it a widely adopted project.
Sarah: I am a third year PhD student at the Sky Computing Lab at UC Berkeley, where I work with Professors Ion Stoica and Joey Gonzalez. The previous incarnations of the Sky Computing lab were the RISELab and the AMPLab, where projects like Ray and Spark were developed. Prior to attending Berkeley, I studied at MIT. After finishing my undergraduate studies, I started a YC company called GlistenAI. We did automated product tagging for e-commerce products, using computer vision and NLP to categorize products based on color and product type. This experience sparked my interest in data management, as well as how to leverage best-of-breed services across clouds for ML applications.
Paras: I am also a PhD student at the Sky Computing lab working with Ion and Joey. Before my PhD, I was a founding engineer at DeepScale, which was a startup building autonomous vehicle perception systems. In 2018, DeepScale was acquired by Tesla following which I started my PhD at Berkeley. My research is at the intersection of ML/AI and Systems - I work on Large Language Model systems and how to efficiently train and serve these models at scale. This involves working with big datasets and understanding the robustness challenges of building such large models. However, I found that the tools for working with these datasets were very painful to use and inefficient, which led me to focus on making the data infrastructure more efficient. To address these real-world problems, I started working on the Skyplane project with Sarah. Our aim is to make the transfer of big datasets much faster and cheaper.
Sarah: We started the Skyplane project about a year and half ago. Besides Paras and myself, two other graduate students are on the core team for the project - Shu Liu and Simon Mo. We also collaborate with a few other graduate students and get help from several undergraduate students.
The goal of the Sky Computing Lab is to remove the barriers of vendor lock-in so that multi-cloud computing becomes a reality. This mission resonated with me because in my previous work in applied ML, I often had to switch between different cloud providers like AWS and GCP. Each provider had its own unique services, hardware and offerings, making it difficult to decide which one to use. I was drawn to the idea of being able to easily combine the best-of-breed services from different clouds without constraints.
Within this problem space, the biggest challenge we identified was data gravity. While it's relatively easy to launch instances or call applications across clouds, accessing and moving data between different cloud providers and regions becomes a complex problem. To enable true multi-cloud computing, we realized that addressing this data gravity issue was crucial. This led us to develop Skyplane, initially a simple tool for data synchronization across clouds, but with the larger vision of providing a unified data layer that goes beyond bulk transfers. Our goal is to create a shared data layer and a unified storage interface for the Sky Computing ecosystem.
Sarah: Skyplane provides fast and cost-effective data transfers across different cloud providers. Whether you need to move a terabyte of data from AWS to GCP or between multiple AWS regions, Skyplane offers a simple CLI tool and Python API. Using Skyplane for data transfers is significantly faster and more affordable compared to using cloud-specific tools like AWS DataSync or similar options. Most importantly, Skyplane is really fast - blazing fast! As per our benchmarks, Skyplane can be as fast as 110x faster than AWS DataSync.
Currently, the only functionality we have released is the point-to-point transfer feature, which supports transferring data between a single source and destination region. As a result, most of the use cases we have seen so far revolve around data migrations. Additionally, some users have found value in using Skyplane when they need to process data stored in one provider (e.g., an S3 bucket) using resources on a different provider (e.g., VMs on GCP). For instance, we had a user who regularly needed to move data from S3 to GCP for processing because their permissions for creating VMs were on GCP. Since they had a large volume of data, Skyplane was the perfect solution for them. We are working on offering the multicast functionality soon, which will significantly expand the use cases.
Overall, our goal is to allow data to be accessible from any provider or region to enable multi-cloud and multi-region applications. We’re doing this by first solving the problem of data gravity, and next solving the problem of data management across clouds.
Paras: Yes, indeed. While generic LLMs are great, people are quickly realizing that, to accomplish a specific task they need to connect and communicate with other data sources within their company. This involves gathering data from diverse platforms like Snowflake, S3, GitHub, Google Docs, and Notion, and using LLMs to embed and store these features in a vector store or feature store, enabling convenient querying in the future. This requires a robust data platform spanning multiple clouds and cloud services, something that Skyplane offers.
In addition, as models continue to grow in size, we all are noticing a significant challenge in their mobility. Deploying these large models across multiple regions becomes bottlenecked and costly due to high cross-region egress costs. We are also seeing an increasing trend of cross-region training, where high throughput and cost-effective data transfer are essential. It is crucial to find efficient solutions for moving and distributing these models to overcome these hurdles effectively.
Finally, ChatGPT has transformed the way people interact with models. With users now expecting real-time, word-by-word responses, the demand for low latency has become more stringent. This presents a unique challenge of serving models closer to users to ensure a high-quality user experience. Looking ahead, we envision Skyplane being utilized for global deployments of these products. While training the model might occur in one region, distributing it to various continents becomes crucial to achieve low latency for users in India, China, the EU, and beyond. It's like creating a CDN for models, ensuring consistent, high-quality user experiences regardless of geographical location.
Sarah: Skyplane uses overlay networking, a technique that isn’t all that new and has been used to enhance network resilience in the past. However, in the cloud, we have the flexibility to create virtual machines wherever we need them - AWS, GCP, Azure or on premise. This allows us to establish overlay networks that span multiple clouds. And once I have that overlay network, I can configure the routing of the data to both optimize cloud throughput and cost. That is really the key breakthrough behind Skyplane. In addition, to achieve efficient data routing, we utilize algorithms for unicast and multicast transmissions, aiming for the lowest cost and highest performance possible.
Paras: As for why this approach hasn't been tried before, when we started working on this, we were similarly surprised. We couldn’t believe that the state of the art was so slow, the user experience so poor and the costs so high. When we dug deeper, we found that existing tools were built only to move data into a cloud, not out of it. I imagine that it is not in AWS’s interest to make it easy to get data out of AWS. Also, the tools were built for point-to-point transfer by specific vendors. No one had really looked at building a source/destination-agnostic data transfer tool by looking at the problem holistically and applying a data-driven approach to optimize for both performance and cost.
Sarah: We open sourced Skyplane with the intention of understanding why people are moving data across regions and clouds. This allows us to gain insights into various workloads and data requirements. Additionally, our benchmarks, known as throughput grids, help us profile the performance between different cloud region pairs. By analyzing the throughput, we can route data through the highest performing paths and avoid slower links between regions. Having more users of Skyplane provides the advantage of collecting anonymous metrics, such as transfer speeds and region pairs. This data enables us to improve transfer performance for everyone using the tool.
Paras: One thing I've learned is the importance of gathering user feedback to continually optimize our product. It's amazing to see how open sourcing has fostered collaboration and innovation in the development of new file systems and distributed storage solutions. Our goal is to make Skyplane universally accessible, high-performance, and cost-effective across all clouds. Being at Berkeley, we have the unique advantage of being in a neutral and open environment and welcome contributions from all major cloud providers, including IBM. This level of collaboration and ecosystem building wouldn't be possible if we were a closed, proprietary system. Open sourcing has truly facilitated remarkable progress and engagement from cloud providers themselves, contributing valuable connectors and advancements.
Paras: Right from day one, we all had a strong interest in making Skyplane a project that welcomes contributors. However, as a PhD project, the first few releases were certainly a bit rough around the edges and I admit we made some mistakes. Looking back, I feel that we had developed the project with a structure and architecture that worked well for us, but that didn't necessarily make it accessible for new contributors to come in and contribute new connectors, clouds, or components. That's something really easy to fix early on, but it gets much, much harder to do so as the project matures.
Paras: I'm truly fascinated by this incredible trend of generative AI unfolding before us. It's like a whole new world where massive models, which require immense capital investment, are being made available for application developers like me. The best part is, I don't have to spend a single penny upfront. It's reminiscent of how Intel pours billions into developing CPUs, and then everyone else benefits from using off-the-shelf x86 CPUs. This paradigm shift is opening up boundless opportunities, with open applications, efficient interfaces, and powerful abstractions built on top of these foundation models. Whether these models are pre-trained and closed or open source, the entire AI ecosystem is expanding rapidly. It feels like we're on the cusp of a new Moore's Law moment, ushering in a transformative era for AI. Whether the foundation models are closed or open source, a massive new ecosystem is already opening up on top of it which expands the market opportunity for everyone. And, that ecosystem, in my view, will be open to a large extent.
Paras: I feel that open sourcing our project and releasing it into the real world was an essential step for personal growth for both of us. It can be intimidating to expose our work to real users and witness how they push the boundaries and uncover unforeseen issues. However, these experiences are invaluable for improving and refining our research. Seeing firsthand how people interact with our systems, break them, and even use them in crazy, unexpected ways provides valuable insights and motivates us to make our work stronger. Sarah's paper on multicast work is a prime example of how real people's experiences directly shape our research, reminding us of the profound impact of engaging with users in the wild. So, my only advice to anyone sitting on the fence to start an open source project would be: just do it!