The Ed-Fi Tech Congress in Phoenix, of April 2018, was a sink or swim moment for me, as I had just started working for the Ed-Fi Alliance. Among the first people I met was a representative from one of the big technology companies. The conversation quickly turned to the question of how to deal with data when the vendor would not send it directly into the Ed-Fi ODS/API. He asked me, “why not just put it in a data lake?” To which I had no reply. Nearly four years later, at last I can give a reasonable reply.
First, what is a data lake? Amazon defines a data lake as
“… a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.”
Starting from around 2015, data lakes were one of the “next big things” in the cross-section of business and technology, positioned as an alternative to costly and time-consuming data warehouse maintenance. Dump all your data into files, and let your smart people sort through and discover amazing, profit-driving insights about your business.
Presumably this worked well for some companies. But it turns out that trying to create reports, dashboards, and analytics “without having to first structure the data” is no easy task. In the data lake, workload shifts from the Database Administrator (maintaining the warehouse) to the Data Analyst (creating reports). Furthermore, the data warehouse likely had a degree of quality control that would be missing with unfettered access to all the company data in the lake. Governance processes had to be tightened so that the Data Analysts would know which data were trustworthy before presenting them to decision makers. To relieve the Data Analyst of the work of preparing the data, a new role was born: that of the Data Engineer, who specializes in cleaning, transforming, and moving data.
Today, there are many companies selling data lakes and related tools. Most of them are trying to ease the data engineering burden. Furthermore, many of them are shifting their focus to hybrid solutions where data lakes become more of a commodity tool. Interestingly, the destination for all that data is… a new style of enterprise data warehouse. In other words, the tools of a data lake continue to provide value, but data analysts might be better off running most of their processes from a well-curated warehouse rather than direct from the lake. On occasion, when some data are not yet available in the warehouse, the analyst might run experimental analyses on those data in the lake. If the data are useful, then they might be loaded into the warehouse going forward.
Data lake tools can provide real value for storing and exploring data from many disconnected sources. They remain the ideal storage mechanism for machine learning workloads. However, pushing educational data files into a lake is not a silver bullet for analytics and interoperability. In many cases it will require substantial investments to retrain or hire staff, migrate to new software tools, and develop internal standards. When the business case does support use of a data lake, the Ed-Fi ODS/API platform can ease the adoption burden in at least two ways:
- It creates a well-defined and described data standard for most if not all your data. A good data standard goes a long way toward enabling data analyst productivity.
- The ODS/API sets a baseline standard for data integrity. The data analyst can more quickly trust the data, and less cleaning will be required.
Integrating the Ed-Fi ODS/API platform into a data lake requires some ability to copy data from the Ed-Fi system into the data lake file structure. The best way to achieve this is to build a tool that will call the API to request all available resources. That tool would then write the requests out to disk. Run this tool periodically, for example on a nightly basis.
Get the white paper on Tech Docs for more information on the potential benefits and techniques for integrating the Ed-Fi Data Standard and Ed-Fi ODS/API platform into a data lake.