Home Blog

Do we really need data pipelines?

This past year yours truly volunteered to teach at a high school in Costa Rica. The subject was of course computer science, but more simply - the basics of programming. Now, let me tell ya, engaging these young minds is like trying to tame a flock of wild flamingos with marshmallows. It's an adventure, I'll give you that!

Do we really need data pipelines?

Now, I assure you, my friends it was a learning experience for me, I’d never taught in a class setting before, nor had I worked with grade 9s before. BTW for all the teachers out there, much respect! Putting together a curriculum and trying to get students to actually learn is hard work. You are all grossly underpaid. Anyway, back to my programming students, their engagement was lackluster from the beginning and it wasn’t till I stepped back and thought for a moment: what I’m missing. Then, it hit me like a freight train, that in the midst of my pedagogical endeavors I had taken a core concept for granted: abstraction.

As programmers we live in the abstract, we operate in the realm of objects and functions, variables and I/O. Being able to think this way allows us to assemble multiple moving parts into solutions without worrying too much about the parts themselves and their internals. So, my amigos, I set out explaining the concept of abstraction to my 9ers and how these things: variables, objects, functions were abstractions and we made their implementation concrete through code. I then had them design these Rube-Goldberg style machines to perform a given task like taking a pineapple and outputting edible bits. The students realized they needed to put these abstract notions together like a peeler, a cutter, a mover of things etc… and sketch out implementations and assemble them on paper. After this exercise we went back to the world of coding and lo and behold things began to click!

Abstraction is so powerful a concept that it permeates our language and almost everything we set out to build and do. We take it for granted because it’s hard wired into our human brains, we rarely give it a second thought. But sometimes getting stuck with one abstraction or another can be a limiting trap.

This, my tech veterans, finally brings me to the idea of pipelines. Data pipelines are all the rage right now. ETL, Kafka, queues, transformations. In the connected enterprise where data is produced at astonishing rates, pipelines play an important role in turning all that data into useful information and ultimately into insights and actions.

The notion of a pipeline is a natural one when we want to move data from here to go there. Verbs like flow and ingest make sense in this context. These abstractions help us visualize and build a mental model. We can even borrow language from physical pipelines like “flow-control” and “backpressure” .

The problem with pipelines, both in the physical world and the data world is complexity. Pipelines need operational controls, orchestration and management. These things are not trivial especially when talking about data at scale and the demands of modern data-science applications. Also consider that modern data pipelines are increasingly pushing data out for actioning in other systems, called “reverseETL” (hate that term) . Data pipelines don’t just feed records into lakes and warehouses anymore.

Really it’s just a matter of getting data from here to there. What if we could do that and not be so concerned with the plumbing (plumbing - it’s so hard to escape the “pipelines” abstraction). It could be that I’m descending into a pit of meaningless nihilism and oversimplifying the problem. After all, half of all IT problems revolve around getting data in and out, we call it I/O. But I can’t help thinking that the pipeline abstraction over data confines us. It begs the question, if not pipelines, what is a better paradigm?

Think for a moment about your favorite streaming service (mine used to be Neflix but lately I’m all about the Disney+). You’re watching a movie and hundreds if not thousands of devices play a role in bringing you that stream. Netflix is pulling it out of some cache in their infrastructure (maybe disk if you found something deep within the unwatched vault). Then it makes its way, via the internet and some of its nodes to your ISP who then sends it to your home router and finally to your TV. All of this is made possible through protocols, agreements on how to send and receive data. Contracts between devices and operating systems. None of it do we have to orchestrate ourselves, it all just works.

To my fellow compatriots over the age of 45, those who remember the first PCs. You may recall that to run a game it took a few steps, you had to grab a floppy disks or two, insert them, run a few commands to load the disk, then run the game exe (I tried to search for an example command sequence but alas came up empty. You can watch someone do it here https://www.youtube.com/watch?v=e6BXzBF0_uc yes, obsolete PCs are a thing now and they go for big money on eBay). Today you just click a button, we came up with various new abstractions to make life easier, a graphical user interface to name one. And so, dear tech denizens, it begs the question, can we use a better paradigm than pipes and plumbing?

As I alluded to in my previous article, I believe that it is the network+protocol paradigm which can replace the data pipeline and be our salvation. Ultimately it’s better suited to what we are trying to achieve, which is, at the end of the day, the sharing of information between systems and people. The nodes of our big-data network are the engines that do the work, the computation, the enrichment, the transformation. With one or more protocols in place to standardize communication, we have more efficient and seamless data movement and processing, thereby reducing the burden of managing complex pipelines.

Now I know, dear reader, that the devil is in the details, and a communication protocol for big-data-systems is no easy feat, let alone the adoption. But I do believe that by looking beyond traditional data pipelines and exploring alternative paradigms, data engineers and scientists can open up possibilities for simplification, automation, and enhanced data-driven applications, all at lower cost. And so my compatriots, I urge you to embrace the spirit of open-mindedness, challenge established abstractions and seek a path less traveled where possibility knows no bounds.


Stay in the loop.

Get the latest Lassoo news directly in your email box.

Nice. You're now registered for the Lassoo Newsletter.


Max Kremer
Max Kremer  Co-founder & CTO @ Lassoo. Startup guy with multiple exits. Lover of technology and data.