By clicking “Accept All Cookies”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

LineaPy: Move Fast from Data Science Prototype to Pipeline

This interview is part of the Decibel OSS Spotlight series where we showcase founders of fast-growing open-source projects that are solving really unique problems and experiencing strong community adoption.

Sudip Chakrabarti spoke to Doris Xin, creator of LineaPy, an open-source tool that helps data scientists rapidly move from development to production by automatically translating development code into data pipelines ready for deployment in production.

Doris shared with us her inspiration behind creating LineaPy and how she is keeping up with all the fast growth of the project.

Hey Doris, let me start with a question I have to get off of my chest first - what’s with the name LineaPy? Where does it come from?

Doris: Glad you asked! The name LineaPy combines two concepts - lineage and Python - central to the project. Everything in data science is about figuring out how things are connected and understanding lineage is key to that - what data was used to train the model, what process was followed to experiment, which iterations were tried before selecting the final model, etc. All of those combined represent the lineage of a model which encapsulates pretty much everything that happens in the data world. And, the Py in LineaPy is a nod to the fact that LineaPy is a low-code solution and requires only two lines of Python code to use it. Finally, linea also means line in Spanish and Italian which is a happy coincidence!

What was the motivation behind creating LineaPy?

Doris: LineaPy has been in the making for a decade, if not more. The initial inspiration had come from my very first job out of college. I was an ML engineer at LinkedIn circa 2012 on a high-profile team of over thirty ML engineers with PhDs. My role on this team was that of data plumbing. While others got to build models - which is the fun side of data science - I was building pipelines and dashboards so that my colleagues could operationalize their models. My job was critical, but also very mechanical with a lot of repeating parts, reusable components and the same process followed over and over again. That is really when I started to think about building better abstractions, better infrastructure and better automation to support the work I was doing - critical, but something that would greatly benefit from automation, if done right.

My work experience subsequently inspired the topic of my PhD dissertation at UC Berkeley which was usable and efficient systems for Machine Learning. My research focused on developer experience - building systems to help developers iterate faster and get their code to production quickly.

With the LineaPy project, I am carrying forward my decade-long mission of helping data scientists automate the mechanical (but critical) parts of their workflow and getting them to production much faster than ever before.

Doris’s labmates from UC Berkeley with their advisors Prof.Joe Hellersetein and Prof. Aditya Parameswaran; most have since started companies in the data space

So, what problem does LineaPy solve? Why does one need LineaPy?

Doris: LineaPy helps data scientists rapidly move from development to production by automatically translating development code into clean data pipelines for deployment in production. In data science, going from development to production is full of friction, with only one in ten projects making it to production. A proliferation of libraries, tools, and technologies means data teams spend countless hours building and managing production pipelines, and this drastically reduces the team’s ability to deliver actionable insights in real time. LineaPy automates code translation and rapidly creates analytics pipelines with a simple API - no refactoring or new tools are needed and you could go from your messy Jupyter notebook to an Airflow pipeline in minutes and with two lines of Python code!

The LineaPy tool lives in the background of a data science development environment, capturing everything while a data scientist iterates through hundreds of different models and data sets. Once the data scientist settles on the final model, LineaPy analyzes all that messy development code and extracts the essential operations with one simple API call. It then automatically refactors the development code and translates it into data pipelines - the Airflow DAG with all required dependencies - needed to run in production.

LineaPy enables data scientists to move at the speed of their thoughts while allowing data engineering teams to work with production-grade pipelines and not messy development code, and it does all of that without forcing anyone to learn new frameworks.

Have you brought forward any innovation from your PhD research into LineaPy?

Doris: The LineaPy open-source project has benefitted from a collection of insights from my graduate research instead of a specific innovation, per se. The chief among those is the idea to use compiler techniques to analyze data science programs. Data scientists need to constantly rewrite their workflows - to experiment with different models - which they subsequently convert into data pipelines. The data pipeline conversion process is onerous but redundant in theory since the workflow logic remains the same. So, if you could figure out how to use compiler techniques to analyze a data science program and then distill what the core workflow logic looks like, you could do a lot of interesting things like code cleanup, code refactoring, automatic pipeline generation, etc. That key insight from my PhD research is a major construct in the design of the LineaPy open-source project.

What was your experience of starting an open-source project out of academia like?

Doris: We are actually a bit different from some of our open-source peers who had started their projects while in grad school and hence, have had several years to build the community. We have taken a different path - some would say a harder path - because we started to build after I had graduated. So, our challenge is that we are doing three things - building an open-source project, building an open-source community, and building a company - all at the same time. But, there are some benefits too. As graduate students, people usually have limited industry experience and are measured by their academic output, i.e. research papers. In contrast, we have had the luxury to be extremely customer focused that helped us quickly cycle through two prototypes before we built something that started resonating with a wider audience. The fact that we can focus only on how to best serve our users without any distraction is a huge advantage when it comes to building a great open-source project.

Are there other projects out there that are solving a similar problem as LineaPy is?

Doris: The two projects closest to LineaPy are Ploomber and ZenML. Those projects, however, have a different philosophy than LineaPy - they treat notebooks (like Jupyter) as both development and production pipeline. At LineaPy, we believe in separation between the data science development environment and the production stack because, asking data scientists to be mindful of engineering concerns during development substantially reduces workflow productivity. In addition, we do not require any change whatsoever in workflows and meet data scientists where they are - this significantly reduces friction in adoption and is a major differentiator for us.

What other OSS projects (besides the usual suspects) do you admire most and why?

Doris: I really admire Airbyte, which is an open-source EL(T) platform, because of the phenomenal awareness and thriving community they have created. The Airbyte team has really mastered the content play in addition to building an easy-to-use product that solves a large pain point.

What advice would you have for someone who is thinking of starting a new open-source project?

Doris: I have two and in hindsight, I wish we had done better on following both. First, it is really important to have a strong opinionated view on the project roadmap right out of the gate. Yes, it is important to listen to the community and build features that the community wants. But, it is equally important to set a clear direction for the project, especially if it is breaking new ground. Otherwise, it is easy to be stuck with a giant wish list and not enough prioritization on any. The other advice I have is that it is never too early to invest in content marketing. Without awareness there is no community; so, I’d highly encourage every open source founder to work on creating awareness through content from day one.