Is distributed analytics the solution to a connected world of data?
It's hardly breaking news to suggest that monolithic approaches to data management and analytics are not fit for purpose in a connected, big data world. The interesting question is how to extend the data and content architecture, and the accompanying analytics, to encompass new and unfamiliar sources without throwing away millions of dollars of existing investment. Step forward distributed analytics, a new way of thinking about how to extend capabilities out into the data landscape by providing appropriate data management and analytics "in the moment" while only transporting the data and insights back to the core that are necessary.
A concert of machine learning, analytical agents, and the cloud
I am not going to claim that distributed analytics is going to happen overnight, nor even that all the technology required to achieve the vision is available yet, at least not in an enterprise-friendly package. However, I believe that as a collection of design and approach principles it has some of the most solid foundations for how enterprises (across the industries) should start planning for a much more connected, data-dependent world.
Briefly, the idea that already difficult to manage data warehouses and inflexible data governance processes can be adapted to incorporate much greater amounts of data, at much higher speeds, is largely discredited. While it might be theoretically possible, the costs associated with applying a legacy approach to a relatively new and growing problem would be too high. The conundrum is that legacy has cost enterprises a small fortune in investment, a fortune that is still far from paying its returns, and core data-related business processes such as financial management and regulatory reporting require a heavily governed approach to ensure accuracy (if not always timeliness). Some argue that it is time to throw away these technologies and start afresh with data lakes and the cloud, among other options. I argue it's possible to have both: a locked-down core of technology matched to much more flexible technologies that augment it – something I referred to as the elastic architecture in our research agenda for this year.
But what does that mean? The – slightly disappointing – answer is that no two distributed analytics solutions will look the same, but they will share characteristics. In the short term, the data lake will become a standard among most organisations. By its very nature, it should be a mix of both data landing zone and longer-term data storage, as well as, increasingly, home to some pretty complex data science-led analysis. Likely best located in the cloud (public, private, or – much more likely – a hybrid deployment that helps span the bridge between on- and off-premises), the data lake could be considered a buffer between a relatively well-organised internal data architecture and the more chaotic world beyond. This would, in essence, augment existing investments in information management technologies and provide the route to "bring the data home" if deemed necessary.
Longer term, I see a future in spreading analytic capabilities outside that core, beyond the warehouse and the data lake and out toward the machines and devices generating the vast amounts of data that are causing the "problem." I talk in terms of analytical agents, packages of software loaded locally that don't rely on heavy compute or memory. Using local resources backed up by machine learning-driven optimisation running at the centre, these analytical agents would only send the data back to the core that is necessary for things like exceptional events, improving the machine learning algorithms and regulatory reporting. Immediate proximity means that optimisation of physical processes could happen in near real time, without relying on the transport of vast quantities of data from and to the edge.
Orchestrating these capabilities is only just the beginning, and the technology is still emerging and maturing, but as an approach that looks to the future without abandoning existing requirements and the investments that support them, distributed analytics provides practical steps forward for the enterprise.