Definitions of Data warehouse, Data lake, Data Mart, Operational Data Store
Data warehouse is also known as Enterprise Data Warehouse (EDW). Data warehouse is used as source for Business Intelligent’s reporting and analysis. Data Warehouse system collects data from multiple sources and contains historical data for trend analysis reporting. ETL tool is used mostly to build Data Warehouse and interfaces around it. Data Warehouse acts as Single Version of truth.
2. Operational Data Store (ODS)
Operational Data Store is frequently confused and definition is overlapped with Data Warehouse. Some of my clients had used word ODS instead of Data Warehouse, which got me confused on number of occasion. As per my understanding & research, ODS is used to integrate data from multiple systems and feed it to Data Warehouse. Data Warehouse consist of complete history of data, whereas ODS contains latest or recent data (short window of data). Data load frequency in ODS is mostly hourly whereas data load frequency in Data Warehouse mostly is nightly because of data volume. Most important reason to have ODS in your company is ability to run report realtime, where source system does not have required reporting capabilities.
3. Data Mart
Data warehouse can contain many Data Marts. Mostly Data mart is created per business line or system that needs data from Data Warehouse. Indirectly we can say, Data Mart is access layer used to get data out of Data Warehouse by other systems.
4. Data Lake
Term Data Lake was coined by James Dixon, CTO of Pentaho to compare with Data Mart. As per James, Data Mart have several problems mostly related to data silos. Data Lake is method of storing data from sources in its actual or raw format that could be Relational Data, XML, flat files or even binary files. Other tools like ETL, access Data Lake as per need for reporting or analysis purposes.
For this tutorial we will need two data sources. I am taking one flat file and another excel file so that new user can understand that extraction can be done from different types of datasources.
Text file contains “CustomerId, Subscription Start Date, Subscription End Date” and Excel file Contains “CustomerId, Segment, CustomerIdSpace” columns.
Observe that both data sources contains one common column named “CustomerId”. For a while don’t count the column “CustomerIdSpace” of Excel Data Source.
Create New Business Intelligent project in Visual Studio 2005 and Drag “Data Flow Task” from tools to package.
Double click on “Data Flow Task” and new tab will opened. Drag “Flat File Source” and “Excel source” from tool box.
Double click on “Flat File Source” and “Excel Source” and create new Connection for excel file and text file respectively.
Drag two “Sort” control from “Data Flow Transformation” section of the toolbox below both sources.
Double click on “Sort” control and select the column “CustomerId” for both “Sort” Control.
Now Drag the “Merge Join” control from tool box and drop green arrow from both “Sort” Control to the “Merge Join” control. Select the type of Join, in this case we have selected “inner join“, also select the columns which should be exported in Output .
Here, we are not going to write the result in file, but we will use “derived column” control after merge join and add “data Viewer” as discussed in previous article to view the output.
The final snapshot of the ETL package will look like:
As stated, Business intelligence tool is installed with SQL Server 2005 and best part is that it is integrated with Visual Studio 2005. So developer get the benefit of SSIS feature in Visual Studio power to work in disconnected environment.
In this article, we will read the csv file and get the required records and write back into another csv file. csv file is attached in the source code in this article.
To start the project, click new Project in Visual Studio and select “Business Intelligence Projects” and then “Integration Services Project” as shown in below figure.