Sunday, September 16, 2007

Data Duplication - Source of all Evil

Several disciplines in the software world have their means to deal with data duplication; database guys use normalization, programmers use refactorings like extract-method. The goal is to have the definition/specification in one place to reduce the maintenance burden.

Other disciplines like data warehousing don't care about the data duplication because they do not have to update the data, only to add.

The problem in programming is that it is hard to detect duplication. (especially when we span the dimensions; large & multiple teams + time). There are known algorithms to detect duplication (eg IntelliJ provides something like this). However I question; do we want to remove all duplication? If there is a single implementation in a large system; won't it become impossible to change the implementation just because doing the impact analysis will take a very long time?

In this case of components it will be hard to prevent duplication. Take for example a simple method that replaces all double quotes with single quotes (one Java 1.5 call). This method occurs in two different components. The cost of extracting this (and thus introducing a new shared component) does not weigh up against living with the duplication.

Another killer is of course 'semantics'; although the implementation is identical; are the semantics identical? Determining this won't be always easy - especially when another guy wrote the other component.

Fact; we have to live with code duplication. It is inevitable.

No comments: