I was first exposed to Hadoop in 2010 when I assumed responsibility for Electronic Arts’ (EA) enterprise data platform, which was a mix of Teradata and Cloudera. Having inherited this Hadoop environment, I quickly came up to speed on the capabilities it offered.
At the time, there was a ton of confusion on what Hadoop was and what it would become. After wading through the conflicting narratives and, more importantly, experiencing Hadoop in active projects, I came to the following conclusions:
Hadoop was excellent at economically harnessing data types that were constantly evolving.
We had a lot of data coming from game telematics that would change constantly. One day the game designers would append a new data element to a game log, and the next day that data element may disappear – only to be replaced by three new ones. That kind of variable volatility would break the kind of design patterns I cut my teeth on. A new age of “big data” required new engineering, and this fit the bill.
Hadoop was poor at managing the core data of an enterprise.
When it comes to managing data in a way that is shared across the enterprise, nothing beats a database – and Hadoop is no database. There was no data type safety and no workload management. There were also performance issues when multiple joins were introduced, a subset of ANSI SQL that restricted usability.In the case of EA, Hadoop was the right tool for the job. We created a platform that the development team was proud of and that delighted the user community, all at a pivotal time in EA’s digital transformation.
I later joined Teradata, supporting our clients in the Silicon Valley who were going through a similar transformation of figuring out the right mix of technologies. I also got tasked with being the marketing lead for Teradata’s strategic partnerships with Cloudera, Hortonworks, and MapR. We did some work together that was very professionally rewarding, including a trusted advisor webinar that had more than 5,000 attendees, on the topic of Hadoop and the Data Warehouse: When to use Which. We also had some valuable product collaborations, such as QueryGrid, that allowed Teradata clients to execute queries that ran across Teradata and HDFS.
It didn’t take long for these three Hadoop distribution vendors to take a path that would eventually become their demise. Rather than building the market for a technology that was excellent at economically harnessing constantly evolving data, they all took the position that they were a replacement for the data warehouse.
Rather than creating a market around evolving big data types, and helping enterprises learn to derive value from this new data, the Hadoop distros took a more short-sighted approach and positioned themselves as a cheaper alternative to the data warehouse. There was already recognition of the value of a data warehouse with commensurate budget behind it, so it was easy to say, “Hadoop is a cheaper and more flexible alternative to MPP databases.” This led to many failed engagements.
But the final nail for Hadoop was object storage. I hear most people saying “the cloud” was the undoing of Hadoop, and I worry about what people really mean when they say that. The cloud is just a deployment option – servers and software. There are some cloud databases that are entirely inappropriate for managing an enterprise’s core data in a way that promotes reuse and the elimination of silos. But one undeniable engineering innovation born in the cloud (and now available across multiple deployment options) is object storage. Object storage is what was great about Hadoop: cheap storage and support for flexible data types. But even better than Hadoop, object storage is 3X cheaper and it supports the kinds of data types needed for the age of Artificial Intelligence, such as audio, video, and image files.
Teradata will continue to plug into legacy Hadoop clusters through QueryGrid, and already allows clients to seamlessly plug into Object Storage with QueryGrid. So while it may be the end of Hadoop, the era of pervasive data intelligence that enables any data type across any deployment option is just getting started.