Wednesday, December 19, 2007

Writing a test to verify ODT content

Sample to test the content of an Open Office document containing three lines ('1234', empty line, 'description'):

import javax.xml.parsers.DocumentBuilderFactory;
...
import com.artofsolving.jodconverter.DefaultDocumentFormatRegistry;
import com.artofsolving.jodconverter.DocumentConverter;
import org.w3c.dom.Document;
import org.w3c.dom.Element;

ByteArrayOutputStream output = templateManager.applyModelToTemplate(inputStream, map);

String content = getZipEntry(new ByteArrayInputStream(output.toByteArray()), "content.xml");
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
Document doc = dbf.newDocumentBuilder().parse(new ByteArrayInputStream(content.getBytes()));

Element docElement = doc.getDocumentElement();
assertEquals("office:document-content", docElement.getTagName());

Element bodyElement = (Element) docElement.getElementsByTagName("text:p").item(0);
assertEquals("1234", bodyElement.getTextContent());
// skip empty line
bodyElement = (Element) docElement.getElementsByTagName("text:p").item(2);
assertEquals("description", bodyElement.getTextContent());

Thursday, December 13, 2007

Getting the entry from a ZipInputStream

Finally I found the code to get an entry from an InputStream that contains zipped data. For a ZipFileInputStream this is easy, but not for a regular input stream because the size of the entry can be unknown (-1) and there is no alternative then just go through the stream.

sorry for the messy mark-up...

disclaimer: code is not optimized!

// based on http://java.sun.com/developer/technicalArticles/Programming/compression/

public static String unzipEntry(InputStream zippedInputStream, String entryName) throws Exception {
String result = null;
final int BUFFER = 2048;
BufferedOutputStream dest = null;
ZipInputStream zis = new ZipInputStream(new BufferedInputStream(zippedInputStream));
ZipEntry entry;
while((entry = zis.getNextEntry()) != null) {
int count;
byte data[] = new byte[BUFFER];
StringOutputStream fos = new StringOutputStream();
dest = new BufferedOutputStream(fos, BUFFER);
while ((count = zis.read(data, 0, BUFFER)) != -1) {
dest.write(data, 0, count);
}
dest.flush();
dest.close();
if (entryName.equals(entry.getName())) {
result = fos.toString();
}
}
zis.close();
return result;
}

public class StringOutputStream extends OutputStream {

// This buffer will contain the stream
protected StringBuffer buf = new StringBuffer();

public StringOutputStream() {}

public void close() {}

public void flush() {}

public void write(byte[] b) {
String str = new String(b);
this.buf.append(str);
}

public void write(byte[] b, int off, int len) {
String str = new String(b, off, len);
this.buf.append(str);
}

public void write(int b) {
String str = Integer.toString(b);
this.buf.append(str);
}

public String toString() {
return buf.toString();
}

public int contains(String string) {
return StringUtils.countOccurrencesOf(buf.toString(), string);
}


}



Pairs

While driving to work I realized something funny:

If I have to work on something, for example produce an article, I will write it, put it away and pick it up a couple of days later to look at it with a fresh mind.

Within our team we also (sometimes) practice pair programming. The idea is that two know and see more than one (the power of interdependence ;-).

Effectively these two practices are one and the same; in the first case you use yourself as the second person. Time will make you a different person than you were. The two practices differ in their usage of the dimensions time and resource. Tradeoffzz..

Monday, December 10, 2007

Red/green/refactor & spikes

I've developed a new way of coding new functionality. In the past I distinguished between spikes and production code development. The spike was meant for prototype code, throwaway code.

Today I do it like this:
I start with a new test that has some basic code that will do the basics of the job i'm after. I can very easily run this test from my IDE (IntelliJ in my case). After a while the test(s) will succeed. This proves that I understand the basics of the code needed. Next step is to move some code to the main folder - to production level. I will use Extract methods refactoring to do this.

This way I slowly, but steadily move from prototype to production without redoing work.

Friday, December 7, 2007

The Inner-Platform Effect

The Inner-Platform Effect anti-pattern is a nice description of what I try to avoid with document generation. The task of making a document template becomes so complicated that only an expert can use it.

In a way it reminds me of XP (extreme programming); keep it as simple as possible. Wait till the latest moment adding flexibility till is really needed.

Another analogy is the J2EE / EJB programming model; programmers should focus on writing business logic. However tying the whole thing together on an application server like WAS was a nightmare and required more expertise than solving the original business problem.

Thursday, December 6, 2007

RTFTemplate

RTFTemplate is the Java library that does what i was looking for; it allows replacing MS-Word mail merge fields by actual data on the server side. So users can still define their document in the traditional sense using mail merge, save it as RTF and the server use it as a template to generate new documents.

The issue however is how scalable this solution is in the long term; for example how easy is to add pieces of text to documents? Converting to PDF, which is necessary in my case, is also still not solved. Using Open Office as a templating engine would solve this. However running OO as a (headless) service complicates things; it is an external process, i'm not sure about memory leaks and/or how robust it is.

Another con of using OO is that - since Word will still remain the document editor for the user - the fields have to be typed in as regular test with some markers, for example ${lastname}. This makes it a little less robust, typos are easier made, deleted by mistake, control characters within the field.

Tuesday, November 27, 2007

OpenOffice API sample

A sample on how to use the OpenOffice is described on this blog. It gives the impression that it is not headless (perhaps my mistake). I know OpenOffice can run headless.

Monday, November 26, 2007

PDF Export in Java

This site and Open Directory site mention a list of available Java libraries for PDF export (commercial and open source).

Alfresco uses PdfBox, POI and OpenOffice for its document mgt. (I suspect OpenOffice for the PDF conversion)

Sunday, November 25, 2007

WebDav

I did not realize that you could mount a webdav folder as a drive on your computer. For example WebDrive does this. According to this article even XP SP3 can do it out-of-the-box (SP2 has some issues, see here). This makes remote documents easy to understand for users. (In my pet applications the users are used to shared drive for all their documents. Versioning? They use backup & restore ;-)

I need a simple WebDav implementation that redirects to a folder on the server. This way I can easily link to files from my web application. My hope is that users can also save the files they open this way (should be the case). Perhaps I will use Tomcat's webdav sample application, although I need to direct the content to/from another folder as the web application's root - which is not that easy. According to this blog it is possible using a JNDI factory.

Instead of Tomcat I could use a library like JackRabbit or an application like Alfresco, but they are too heavy. (btw in my experience you should be very careful integration another application into your own). Missing using Tomcat's implementation would be a search engine - which is supported by the two above, although i'm not sure whether JackRabbit would index Word documents out of the box. (btw I do know that JspWiki can search through Word Documents using Lucene)

Friday, November 23, 2007

Executive UML

I do not believe in UML that can be executed one day. Perhaps for a very small, well defined domain it might work - and then only for the first 20%, the rest, the tweaking, small UI enhancements will require manual coding. Hopefully in a language that is powerful enough and does not require tons of code. (Java is an example of how it not should be done, with the writers, outputstreams, etc. You have always to pay the price for flexibility, certainly not a good trade-off)

Alternative:

From the requirements you distill the concepts. For example a requirement could be: "as end-user I should be able to add a contact person". Contact person would be a concept.
The requirements would be translated in automated acceptance tests in terms of these concepts. The automated acceptance test would talk to an API. (call addContactPerson and getAllContactPersons). Next step would be to translate these tests to UI level - probably requiring human interaction.

Obvious benefit would be that you could change the implementation and still verify whether the requirements are met or not.

Modern Times

Ain't it weird: we fear privacy intrusion, but everybody is writing blogs. We dislike traditions like female genital mutilation, but women these days do surgery on their body everywhere.

Word 2003 docx format - mail merge


Hopefully i'm doing something wrong, but look at the screenshot to the left that displays the document.xml of a docx when a mail merge field has been defined. Notice the use of begin and end instead of making use of the hierarchical nature of xml.

But it gets even weirder; if you define a document in Word (i'm using 2003 with the docx plugin) with the mail merge fields and you do a preview before saving the document then the document.xml looks like this:


The xml hierarchy of the last one makes sense, but the first one definitely not. Why the two are completely different, i've no clue...

After doing some research I found out that others have run into a similar issue. See the hyperlink discussion on this blog.

Tuesday, November 20, 2007

Partially copying XML using XSL

Subject: Sample xls that copies entire XML source tree, but skips an element as specified in the xls.

Note: the indentation is messed up due to the xml characters used.

The source xml file includes a link to the xsl for display purposes (xml and xls as stored in the same folder):

<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="display.xsl"?>

<data>
<employee id="100">
<lastname fooAttr="123">Basta</lastname>
<insertion>van</insertion>
<firstname>Marco</firstname>
</employee>

<employee id="102">
<lastname fooAttr="123">Kruif</lastname>
<insertion></insertion>
<firstname>Jan</firstname>
</employee>

</data>


The xsl file (called display.xsl), the employee entry with firstname 'Marco' is skipped:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="*">
<xsl:copy> <xsl:apply-templates/> </xsl:copy>
</xsl:template>

<xsl:template match="/data/employee[firstname='Marco']"/>

</xsl:stylesheet>

Monday, November 19, 2007

Convert Word document to HTML

I've found two ways to convert a word document in docx format to HTML, both using XSLT.
This would make it possible to do some kind of mail merge preview, replacing the mail merge fields and showing the output both using XSLT.


Similar topic, but different implementation (not open source) from Softinterface. Not sure how they get this working without installing any additional software. (perhaps they use the OLE object's preview for images, but how about the entire Word document?).

Wednesday, October 31, 2007

Architectural Quality Attributes

Two quality attributes to which an application architecture should adhere:

I) It is important that an application architecture can accommodate new requirements without big changes in the implementation. Changes should be as local as possible, keeping the impact small. (not everything can be tested in an automated manner).

This means that the delta in requirements should be linear towards the delta in implementation changes. I call this an architectural quality attribute. (see also book Software Architecture in Practice)

II) An architecture should have one way of doing things and allow for exceptions.

For example an architecture that has two different ways of sharing information, for example through a remote service call or storing information in a database, does not adhere to this quality attribute.

Wednesday, October 17, 2007

Using EasyMock2

The recording and playback phase of EasyMock has always drove me away from it, but I got sick of implementing mocks myself, so I gave it a try.

The simplicity is appealing. The record and playback is not as annoying as I thought it would be. However what is a bit frustrating is that i'm not interested so much in the collaboration between the class-to-test and the mock, but just in the-test-to-class. If i'm not mistaken; I want to use the mock just as a stub.

Example: i've a component that uses a dao. I want to mock the dao. For example return a fixed collection of items for a particular method. This is easily feasible with EasyMock.

Other example: i've a component that uses a HttpServletResponse. My component sets the header, content-type and some other stuff. I'm only interested what is written to the response. So I do not want to implement the setContentType and other methods. However afaik you can't tell EasyMock just to ignore calls for all methods - except for the ones you have specified during recording.

Sunday, September 16, 2007

Objects have identity, components don't

A couple of rules:
  • Objects have identity, components don't.
  • Components have an emphasis on behavior, objects on state.
  • Object can expose methods to expose state and validate state.
  • Components with state are considered harmful - but sometimes inevitable
  • Events in components should always make it explicit if they are before or after the fact occurred. To make it explicit in which state the component is and not in some kind of transition
  • Services are components designed for remote access. (coarse granular)
To me, a service is instantiated and accessible at a certain physical location, however this does not match with the situation where a service is deployed in an embedded way. Mmmm.

Data Duplication - Source of all Evil

Several disciplines in the software world have their means to deal with data duplication; database guys use normalization, programmers use refactorings like extract-method. The goal is to have the definition/specification in one place to reduce the maintenance burden.

Other disciplines like data warehousing don't care about the data duplication because they do not have to update the data, only to add.

The problem in programming is that it is hard to detect duplication. (especially when we span the dimensions; large & multiple teams + time). There are known algorithms to detect duplication (eg IntelliJ provides something like this). However I question; do we want to remove all duplication? If there is a single implementation in a large system; won't it become impossible to change the implementation just because doing the impact analysis will take a very long time?

In this case of components it will be hard to prevent duplication. Take for example a simple method that replaces all double quotes with single quotes (one Java 1.5 call). This method occurs in two different components. The cost of extracting this (and thus introducing a new shared component) does not weigh up against living with the duplication.

Another killer is of course 'semantics'; although the implementation is identical; are the semantics identical? Determining this won't be always easy - especially when another guy wrote the other component.

Fact; we have to live with code duplication. It is inevitable.

Friday, September 14, 2007

Programming languages with build-in support for unit testing

How many times have you made a private method package local just for unit testing? Each time you have to add some comment 'package local for testing'. Annoyance!

Java's successor must have some kind of access descriptor dedicated for unit testing.

Wednesday, September 12, 2007

'Class names in Plural' smell

Just ran into a piece of code where somebody was using a class that was in plural, like for example Computers. Well, that smells. If you have a collection/list/bag/set/whatever of computers, then use the appropriate collection class. Since this was not the case probably more is going on. This is a collection of items with a certain characteristic that they share. Probably a bunch of computers forming a domain with additional properties that rise above the collection level. I'm arguing that you should use a class name that reflects the item(s) in common - most likely in singular.

Thursday, September 6, 2007

svn ignore for multiple folder / directories

I was struggling with the svn:ignore property to specify multiple folders. This is how I solved it:

set the environment variable EDITOR. I'm using ultra edit:
>set EDITOR="C:\Program Files\UltraEdit\uedit32.exe" (note the quotes)

Type in (note the period at the end, indicating the current folder is the target)
>svn propedit svn:ignore .

This will open the editor, UltraEdit in my case.
Enter the folder names separated on each line
Save and exit the editor.

That's it. Use svn status to see if indeed the folders are ignored.

Wednesday, September 5, 2007

Import / Export Database from / to XML

I'm using Hibernate to manage my database persistency and I wanted to add export/import to/from XML capabilities. Hibernate does support XML but the documentation is very limited. Next, it is not capable of generating a schema (xsd). XStream has the same problem. So that left only JAXB. Issue with JAXB is that it is very document (web service) oriented. The HyperJAXB - project uses the web services xml as a starting point. I want it the other way; my domain classes are the source of truth; they form the basis from which I want to persist and export/import.

Since i'm neither a JAXB nor a Hibernate expert it has proven to be quite an exercise. Especially Hibernate's merge operation in combination with Spring's IdTransferringMergeEventListener caused a lot of headache. I did not fix this properly, but replaced the merge() with saveOrUpdate().

Anyway the basic steps to export/import a database (aka repository) are as follows:

  • Create a container class that contains collections of the top-level domain classes. My sample is based on two domain classes Person and FiledCase. The class Repository contains two collections of these classes. The collections are extensions of a special HibernateCollectionAdapter class. This class takes care of loading from and persisting to the database for a domain class. The container class is passed to JAXB for import or export
  • Annotate the database id field in each domain class with @XmlTransient, we do not use this field since it is not guaranteed to be unique across all classes. (if it would be you could use this field as xml id)
  • Add an xml id field to the domain classes that can be referred to. Annotate the get method with @XmlID. The get method should return the fully quantified classname appended with the database id. (to make the ID unique across all classes)
  • Annotate the classes that needs to be exported with @XmlRootElement
  • Mark the get methods that refer to other objects that are not aggregated, but have an association type of relationship with @XmlIDREF
  • Make sure that the classes are exported in the order of least referenced, use the annotation @XmlType with field proporder to specify the order of the properties. This is very important during import. JAXB will patch references of objects afterwards when they cannot be resolved immediately (because the object is not yet imported). However when the object is already persisted it will not reflect the updated references.
  • The HibernateCollectionAdapter class implements the Collection interface. In the add() method it will store() an object and in the contains() it delegates to a find() method. The store and find methods are overridden in the descended classes. There are a couple of other methods that should also be overriden.
Sample is uploaded, see comment below

Saturday, August 25, 2007

Controlling Change Requests

One of the challenges facing the software delopment process is controlling the number of change requests (CR). You got tons of exotic enhancement requests that will never make it into the product. Next, you have enormous amounts of small bugs that should be fixed. Among those thousands CRs a bunch is already fixed and it is too expensive to find out which ones are fixed.

I plead that enhancement requests older than 2 years (2 major versions) are removed. Apparently they are not important enough. If they are important, they will come back.

Another recommendation: when doing triage (deciding if and when the CR will be fixed), the corresponding code should be annotated with the CR number. This will make it easier for the programmer to see which CRs are applicable to this part. He/she can decide to fix the CR as well, update the CR text, or perhaps even close the CR as already fixed.

Today's bugtracking systems integrate with version control systems enabling you to see which CRs is fixed in which piece. This should be improved so you can see where future CRs reside. This will enable you to see which parts are the weakest.

Sunday, August 19, 2007

Your own framework

Your own framework that is gonna solve all the problems. It is going to be faster, easier to understand, more robust and lightweight.

There is one problem; it takes time. Probably about 3 major releases to get it right. And guess what happened after 3 years of hard work? A new technology has arisen making the existing technologies obsolete.

Take Lucene - i've been told a very elegant framework. There is new major v5 on the horizon to get everything right. Problem however is that frameworks like GWT and J2S solve a core problem; easy way of writing client side code.

Lesson: as a company don't write your own frameworks unless it is core business. Even then take into account that it is going to be out-dated in 5 years.

Tuesday, August 14, 2007

unit testing

Unit testing aka white box testing means you write a test that is aware of the implementation. It means that you can write fine grained tests specific to a set of requirements. Integration tests (or whatever you wanna call them) test multiple artifacts cooperating together. Theoretically a unit test tests the implementation of one method and nothing else. However this distinction is hard to make when the tests evolve.

Suppose you start with two different test folders: unit-test and integration-test. First couple of days everything is working nicely. One day you decide to refactor a piece of code. Luckily you are disciplined enough to move the tests along with the code to the new packages. However one of the methods that was white box tested had a piece that was extracted to a new method (let's say within the same class). Now suddenly this test you had written is no longer testing solely one method. Hmmm.

Thinking of this; if I write a unit test for a method and this method calls a couple of JVM routines, is it a unit test or an integration test from the beginning?

Point i'm trying to make is that if you want to be really strict about separating unit and integration tests it takes an enormous effort to keep it sound. With large teams - people come and go - in time it will be an impossible task.

Thursday, August 9, 2007

In the last couple of years software code and architecture moved from being porcelain to rubber, but it still not fluid. We need fluidity; we need some kind of dynamic flow that will re-establish balance within the code. For example some packages or classes have become too large, obsolete code that needs to be removed. This is now all manual labor. Organism have something called homestase , a mechanism to restore the internal equilibrium. The premise is already fulfilled; we have the metrics, we have refactorings. Only some kind of automation is needed to put it together. Might be a little be odd to come into the office in the morning and to find out you're whole application is turned upside down.

Monday, August 6, 2007

Unit testing

After a couple of years of unit testing I still haven't found an elegant way to make sure that a method calls another method. For example I want to test method x that calls method y. Method y ensures a business rule, for example it checks whether certain reserved characters are not present in a string that is passed as parameter. I can do some litmus tests against x, indirectly testing whether it calls y. However this gives not a 100% guarantee. In fact, it introduces an extra dependency. If the business rule change, y changes and i've to update some tests of x...
Another approach is of course to use mocks. However this requires interfaces, factories/ioc, etc. A lot of overhead for a simple test.

A futuristic alternative is to define an aspect that checks that at least one method called from x is y. Admittedly, I've no idea how to implement this, but it won't be easy / flexible to do in a TDD manner.

Another desperate attempt; if y would throw an exception in case of failure, you could check the stacktrace. Obivously, still brittle.

So the question remains how to solve this. It gives me an unpleasant feeling that such a basic problem is not yet solved. Perhaps someone else has a bright idea.