Data lakes!
A repository for unstructured, structured and semi-structured data. These lakes permit data to rest in their most natural form without having to be transformed and analysed initially. In this aspect they are very different from data warehouses.
In more understandable terms, the different types of data generated by machines and humans can be loaded into a data lake for analysis and classification at a later time.
Properly structured data is required in a data warehouse before any work can be done on the data.
To understand properly as to why data lakes are the ideal candidates to house big data, it is very crucial to understand why how they are different from data warehouses.
The difference between a Data Warehouse and a Data Lake:
Probably the only similarity between a data warehouse and a data lake is the fact that they are both data repositories. Let’s now have a look at some of the key differences:
- In most cases, data warehouses make use of highly structured data whereas data lakes are designed in such a way that they support all types of data.
- All the data that may be analyzed at a future date is stored in data lakes. Due to limited storage being an issue, irrelevant data is eliminated in a data warehouse.
- With reference to the above discussed points, it is evident that the scale between a data warehouse and a data lake is vastly different. A data lake needs to highly scalable because it supports all types of data and stores it even if it’s not for immediate use.
- Metadata (data about data) being available allows users who work with data lakes to gain basic insights about the data really quickly. In the case of data warehouses, a member of the development team is required to access the data which in turn can create a bottleneck.
- Another key difference is that the intense data management required for data warehouses implies that they’re very expensive to maintain when compared to data lakes.
The use of data lakes!
The advantage with data lakes is that advanced analytics tools and mining software take the raw data and turn them into useful insights. Structured and clean data is what data warehouses depend on, whereas data lakes let data rest in its raw and natural form.
Now that you know the importance of Data lakes, let’s look how most of the businesses implement Big Data which helps to increase their revenue.
Big Data Analytics
- In order to uncover patterns, customer preferences and market trends with the objective to help business make informed decisions faster, big data analytics makes use of the data in a data lake. This is achieved through 4 different types of analysis:
- Descriptive analysis – Retrospection is the nature of descriptive analysis. A look at “where” the problem may have occurred. Big data analytics today is actually descriptive in nature because analytics can be generated quickly.
- Diagnostic analysis – Retrospective in nature again. Diagnostic analysis looks at “why” the specific problem occurred in the first place. This is more detailed than descriptive analytics.
- Predictive analysis – Analysis can provide an organization with models which are predictive in nature of when an event might occur next when AI and machine learning models are applied. Predictive analytics models are now widely adopted because of the amazing insights they generate.
- Prescriptive analysis – This is the future of big data analytics as it does not only assist with decision making but also provides a set of concrete answers. A high level of machine learning usage is involved in this analysis.
Architecture of the data lake!
The question arises, how can data lakes store such massive and diverse amounts of data? For these massive repositories, what is the underlying architecture?
The data model that data lakes are built on is the schema-on-read model. A schema is essentially like a blueprint – the structure of the database outlining its model and how the data is structured within it.
When you can load your data in the lake without having to worry about structure, it is a schema-on-read data model. This model allows for a lot more flexibility.
On the other hand, data warehouses comprise of schema-on- write data models. This is a rather traditional method adopted for databases.
All sets of data with their relationship and index must be clearly pre-defined. This in turn, limits flexibility, especially when new data sets are added, or new features are added which may potentially create gaps in the database.
The backbone of a data lake is the schema-on-read data model. However, the processing framework is how the data actually is loaded into one.
The processing frameworks that ingest data into data lakes are explained below:
- Stream processing – Small batches of data processed in real-time. For businesses that harness real-time analytics stream processing is the very valuable.
- Batch processing – Processing many million blocks of data over long periods of time. In order to process big data, this is the least time sensitive method.
Stream-vs-Batch Processing
Apache Spark, Apache Storm and Hadoop are some of the commonly used big data processing tools which are capable of Stream and Batch processing.
Processing of unstructured data such as internet clickstream data, social media posts, sensor activity etc can be done only by a certain set of tools. Other tools on the market make use of machine learning programs to prioritize processing of speed and usefulness.
After data processing, once it is ingested in the data lake, it is time to make use of it.
Data Lake Challenges
Advantages of a data lake are that they are scalable, quick to load and flexible. However, they come at a cost.
- Unstructured data ingestion requires a lack of data governance and processes to make sure the right data is being looked at. For many businesses – especially those that haven’t adopted big data – possessing uncleaned and unorganized data isn’t an option.
- You could end up in a data swamp if the metadata or processes are misused to keep the data lake in check.
- Data security is always an issue which needs to be kept in check.
- Data lakes are being widely used in IT today. A few tools are still working out the security kinks. One major kink is to ensure that only the right people have access to the data which is sensitive in nature which is loaded into the lake.
These issues get resolved with time with any new technology. But like any new technology, these issues will resolve with time.
Big data – The role data lakes play!
Even though data lakes have a few challenges, it is no secret that 80 percent of the data is unstructured in the world. As more businesses start adopting big data, the applications of data lakes are bound to rise.
Data warehouses are strong in security and structure, but big data needs to be unconfined so that it can flow into data lakes freely.