How the Savviest Companies Integrate All That Data into Data Lakes
By John Thielens, Chief Technology Officer at Cleo
The research firm Ovum predicts that the appeal of greater customer and business insights will help grow the big data market to $9.4 billion by 2020, and the need for tightly integrated and quickly accessible data flows enabling these insights will only grow with it.
Because companies manage and consume data with increased variety, volume, and velocity than in the past, existing data architecture is evolving beyond traditional databases, data stores, data warehouses, and the like into a more unfiltered repository known as the data lake.
The reality is, modern big data projects are massive undertakings spanning an ever-growing network of complex systems and applications, and driving the data lake movement is the demand to connect source technologies in a highly efficient and time-sensitive manner for increased information analysis, especially to accommodate unstructured information.
But organizations leveraging big data are discovering that the incumbent integration tools they're counting on to support the file types, volumes, connectivity, and other tools needed for such analytics and business intelligence projects are coming up short. That's why traditional repositories like SQL databases and enterprise data warehouses will be augmented by data lakes.
Data lakes, when well-integrated, deliver the modern flexibility required for organizations to aggregate all of their internal and external sources to extract value from the right data. At this scale, however, traditional big data ingestion approaches often take months to review results on the information.
Modern enterprises will need a secure, reliable method to seamlessly integrate and pipe all of the data into a data lake, but this "big data gateway" must be more advanced than ever before, facilitating absolute integration with all of its data sources, setting policy and control over who has access and how, and scaling to grow with the company.
So. Much. Data.
Today's explosion of data can feel like an unmanageable reality for many organizations, and the persistent data challenges organizations face - variety, volume, and velocity - aren't going away. If anything, these challenges will escalate.
But it's important to take a breath and understand where all of this data comes from and know that it can be controlled. Consider many of the ways data currently flows through the enterprise:
- B2B: Digital communication with partners, suppliers, customers, and other external parties -
- which includes invoices, advance shipping notices, and other EDI transactions - happens across a variety of protocols and data formats.
- Applications: ERP and other back-end systems generate data with every transaction, which also requires integration to and from internal databases, and even across internal departments.
- Cloud: Modern business necessitates linking B2B flows into cloud applications like Salesforce and Eloqua, and enable the same B2B and traditional MFT flows into cloud storage, including Amazon S3.
- Big data: All of these file transfers supply the big data-centric flow, and organizations are ingesting data through DBMS gateways and APIs, synchronizing all of the data between data storage repositories, and piping partner data and other large data sets into storage and then back out for analysis.
These four points don't even mention all of the ad-hoc ways employees create, share, and store information, something IT teams are desperate to reduce via comprehensive file transfer solutions for every department's needs.
No matter the source, though, enterprises require governance to enforce rules and access, conduct audits, and maintain customer and industry mandates. That's why for all the benefits of data lakes, enterprises must start with a B2B-centric data gateway to integrate, aggregate, and funnel this information where they can access and analyze it.
Advantages of a Data Lake
Managing file ingestion from a data lake into Hadoop and other big data ecosystems - and centrally governing the movement and format in the right sequence for the right purpose - delivers a unique competitive advantage.
By optimizing data-driven discoveries, operational and digital processes can be fully integrated and automated, eliminating wasted time and misaligned efforts. But realizing such advantages of big data, which include enhanced customer engagement, superior forecasting, and streamlined workflows, starts on the shores of the data lake.
The data lake differs from traditional enterprise storage applications in that the files, messages, events, and raw data are captured and retain their native formats in distributed data systems, like Hadoop. Data lakes can include such information as web data, server logs, social media data, geographic information system (GIS) and GPS data, weather information, RFID and machine-generated data, and various media, such as images, audio, and video files, all in their raw states.
The cool thing about the data lake, then, is that it supports the "schema on read" functionality. Schema-on-read advocates keeping raw, untransformed data on ingestion. Without transformation on ingestion, companies can move faster and create new acquisition feeds quickly without thinking about mapping, granting your business data agility now while asking the compelling data-use questions later.
IT teams are able to analyze the use case and the entire data set and make a decision on how to proceed, and this decide-the-schema-later approach makes database design a lot easier on implementation.
Additionally, transformation often results in discarding supposedly worthless information that later may turn out to be useful information after all, so data lakes' schema-on-read functionality proves exponentially more valuable.
So in an era where business value is based largely on how quickly and how analytical you can get with your data, connecting your organization to a modern data repository via a powerful big data integration gateway facilitates lightning-quick decision-making, advanced predictive analytics, and agile data-based determinations.
The Integration Requirements
By capturing largely unstructured data for a low cost and storing various types of data in the same place for analysis when the business is ready, a data lake:
- Breaks down silos and routes information into one searchable structure. Data poured into the lake lives there until it is needed, when it flows back out again.
- Aggregates the information in one place without having to qualify whether a file is useful or not.
- Enables analysts to easily explore new data relationships. Data extraction can be performed on demand based on the current needs of the business, allowing for identifying new patterns and relationships in existing data.
- Helps deliver results faster than a traditional data approach. Data lakes provide a platform to utilize heaps of information for business benefits in near real-time.
But the security, access, and the extensibility required to accommodate such varying data streams cannot be understated, and that's why an optimized data delivery gateway is critical to ensuring a proper return on a data lake investment.
Some of the key big data gateway characteristics to support a data lake investment, then, include:
- Elastic scalability: Your enterprise's platform will need to scale to handle large volumes and content. Think zillions of little messages but also singular, massive files, too. An ideal solution certainly would handle both of these extremes.
- Security and governance: You'll need to secure and track data feeds originating from within and beyond your enterprise. If your organization is to derive critical business value from the end result, it will need governance, tracking, auditing, and alerts on these data flows and feeds to ensure continuity of service.
- Community management: New technology blurs the lines between traditional data integration and human interaction, and dynamic interaction between people and information must occur more often. Collaborative management functionality enables a central platform for easy data routing and expanded workflow capabilities.
- Easy: Easy to implement. Easy to operate. Easy to scale. Much like the decide-schema-later aspects of your big data initiatives, you'll need a big data gateway solution to interoperate with any communication protocol, data formats, OS platform, and more so you can decide later how things will be integrated. And it's got to be technology that's easy enough to manage internally so you can skip the expensive consulting fees, stay agile, and deliver business value faster.
And modern integration systems should move away from traditional ETL (Extract, Transform, Load) architectures and support the schema-on-read principle to give organizations the flexibility to decide when and how they will use the data.
Thus, file transfer and integration systems whose core capabilities are carrier-grade scalability, secure data transfers, and the ability to connect to non-traditional storage repositories (Hadoop, NoSQL, etc.) will alleviate the security, control, and expansion challenges and deliver a better data lake experience.
Companies that deploy an integration platform with these key functions significantly increase the benefits of a data lake investment.
How to Get There
Ultimately, the predictive organization is naturally proactive. Faster insight leads to clearer foresight, and actionable intelligence drawn from a data lake provides that clear and decisive competitive edge that companies always look for.
Data lakes deliver added ability to monitor and analyze the historical performance of organizations to better achieve future results, but the promise of improved analytics and business agility is broken when the data is incomplete or not easily accessible.
Companies undoubtedly must have connected data, and strategically addressing how to get all of it into data lakes and between data lakes is incredibly important, as legacy technologies simply can't meet modern data requirements.
Organizations must focus on a centralized file transfer platform that enables B2B, cloud, application, and big data integration, supporting the advanced data initiatives of today's modern enterprise. After all, a data lake with stagnant information flows becomes a data swamp, leading to a reactive business environment and little competitive edge at all.