In my initial post about doing iterative development for BI/DW projects, I mentioned that one of the challenges was developing a solid foundation while doing iterative development, especially in the first few iterations. If you are started from scratch on a new BI initiative, there is often a lot of work to do in getting environments established, developing processes for moving data and code between environments, and exploring and validating the source data to be used. Unfortunately, most of this work does not result in deliverables that an end user would consider valuable. Since, as part of each iteration, you want to have the possibility of delivering working functionality to the stakeholders, this can present a problem.
Since most end users consider working functionality to be something that they can see in a nice user interface, you need to look at ways to minimize the development time required to present data to the end user. Some of the common time-consuming tasks in the first couple of iterations are:
- Establishing the environments
- Exploring and validating the source data
- Developing ETL processes to move and cleanse the data
There are really no quick workarounds to setting up the environments. In fact, I’ve usually found that taking shortcuts on the environments leads to much bigger problems down the road. However, what can be effective is to minimize the number of of environments that you deal with in each iteration. While theoretically you should be able to deploy to production in the first iteration of a project, it’s rare that this is actually needed. So instead of creating a development, QA, and production environment, consider only establishing the development and QA environments. I do think that having at least two environments is important, so that you can begin validating your deployment procedures.
Exploring and validating the source data is definitely important. In the first couple of iterations, though, it’s often necessary to limit and restrict what you explore. For example, a project I was involved in recently had some very serious issues with data quality. The source database did not enforce referential integrity, so a large percentage of the data in the source was not related to the rest of the data correctly. Rather than derailing the current iteration to completely research and resolve the data quality issues, the project team and the stakeholders made the decision to only import data that was related correctly. This enabled the project team to still present a set of working reports to the stakeholders at the end of the iteration, rather than not being able to demonstrate any working functionality. The subsequent iterations were adjusted to better reflect the level of data quality.
ETL processes can be time-consuming to develop, particularly if the organization does not already have a framework in place for the ETL processes. In the first couple of iterations, an alternative approach is to load data in a more manual fashion, using SQL scripts or direct manipulation to get an initial set of data populated. This has a couple of benefits. One, it allows the time for building a full ETL process to be spread across multiple iterations. Two, it allows the end users to get a look at the data (in a friendly user interface, if you follow the advice above) and validate that it is correct.
A key part of developing the foundation, while still providing value in each iteration, is accepting that you can’t do it all in a single iteration. The foundation development will have to be spread across multiple iterations, and it is acceptable to build some scaffolding code in order to deliver functionality in each iteration. Clearly, you want to minimize that amount of code that can’t be reused. Generally, I’ve found that with the Microsoft toolset for BI, it’s pretty easy to build incrementally, with minimal rework. However, even if your tools don’t support this as well, in my experience the downsides of having some code that gets replaced in a later iteration are far outweighed by the benefits of being able to demonstrate tangible progress to the stakeholders in each iteration of the project.