You may have seen my complaint via Twitter about how I’ve been working on nothing but documentation for the last couple weeks (a task many of us technical folks consider to be the bane of our existence). I claimed that because of the documentation I had nothing cool or exciting to blog about, but then it was suggested by several folks that I blog about “the documentation” itself. The more I thought about it, the more I realized that some of you may not have had much exposure to this, so an overview is a pretty good idea.

Why do we put ourselves through this pain?

Solid documentation is really the key to success of any major system or platform. You may be the one to build it, you may be the one to manage it, but sooner or later those keys change hands, databases go offline, patching doesn’t go according to plan, Joe Bag-of-donuts “tweaks” some code and breaks it, you name it… things happen. And it’s the documentation that has to come to the rescue (especially if the original architect, developer, or analyst isn’t around). Solid documentation can also help the customer to become more self-sufficient and less reliant on the original development and integration team; though anyone in business will tell you that 90% of the time the first question is “who built this originally?!?”

What should this look like?

Many people have their own approach to documentation. Some like to create one massive document, while others (myself included) tend to break them apart into smaller more manageable volumes that group related information together. The volume based approach also means that you can separate raw technical data (such as a disaster recovery procedure) from the more day-to-day information written for the eyes of the administrators or end users. There’s not necessarily a right or wrong way, but there’s certainly extremes that in my opinion should be avoided. If you’re presenting the customer with a 200 page document, then you probably could have made it a little more pleasant for them by breaking it apart.

What do I need to document?

A solid set of documentation should include a complete disaster recovery plan, and an administrative guide for your “administrative” users. Depending on the scope of the project or the requirements you have been given it’s also sometimes times necessary to provide end-user training material. The way I tend to structure things is to start with your base Disaster Recovery (DR) procedure. This should walk through the recovery of the system both from the perspective that you've lost absolutely everything and have to start over from bare metal, and the perspective that you're recovering from a backup. As you build your DR procedure, identify the steps that can stand alone as their own procedure, split those into their own document and reference that new document within the DR procedure. This allows for easier reuse of the material in the future.

When you want to tell someone to install a feature as an example, you can point them to document "017 - Feature Installation" as opposed to sending them to one massive document that they have to search through looking for the section on feature installation. Use this same approach with your administrative documentation. Is it really necessary for someone to have to skim through the sections on custom CSS styles if they're looking for information on the system's permission model?

How detailed do I have to be?

My general rule of thumb is that someone should be able to walk in off the street (with the right qualifications and skill set), pick up the documentation and be able to perform the tasks described in the documentation. When we’re talking about disaster recovery as an example, any member of your server team should be able to completely rebuild your environment by following your DR plan. Assume “disaster”, so assume the original team isn’t there; that environment has to be restored (sometimes from bare metal) using nothing but the documentation. Also think about the day-to-day use of the platform. An administrative guide as an example should cover all of the tasks that the administrator performs. You don’t always have the luxury of a turn-over period to train a replacement; what happens if you get hit by a bus tomorrow? Is there someone else on the team that’s intimate with the permission model, the various customizations to the UI, the integration with the business process (and those critical paths of information flow and process)? It’s always important to remember that just because your mind works a certain way doesn’t mean that the next person to come along will think the same way.

When do I create all of this?

This depends entirely on your personality. Some people wait until the very end then try to scramble through it (this often results in incomplete documentation). The best practice for this is to create things like your DR procedure while you’re building your development environment. You then can test the procedure by building your production environment from the documentation. If staffing permits, it’s not a bad idea to have someone other than yourself or the original server admin perform the OS configuration and platform installation/configuration. This gives you a good litmus test of potential holes or areas of interpretation that you should clear up (remember, you have to assume you won’t be there in the event of a disaster).

How do I know if all of this works?

Any good set of documentation will have a test plan associated with it; a set of procedures that the server team and administrative team can use to test the functionality of the system. Test plans allow you to ensure that the same sequence of events is followed every time, ensuring the same outcome and allowing you to detect anomalies. While the steps in a test plan will become routine and second nature over time, it’s important to continue to follow them (much like a pilot going through his pre-landing checklist). Test plans are most useful during the User Acceptance Phase of a project and to confirm that a system performs normally after patching or some other maintenance event.

You can often use a test plan to tie back to your other documentation volumes. As an example, a test procedure may require you to install feature XYZ via the system’s feature installation procedure. This will result in that procedure being tested and proven prior to a system’s go-live event. It’s also important to test a system’s disaster recovery procedure. In a lot of cases, DR procedures don’t get tested until something (ummm disasterish) happens, then you find out that there are holes in it, or the data wasn’t being backed up the way it was planned, or the data simply didn’t restore as planned. Prior to a system’s go-live event it’s critical that you put that disaster recovery procedure to the test.

If you’re in the service delivery business, you probably have a service level agreement (SLA) with your customer that specifies how much time you have to perform a restore from backup. If you haven’t tested your procedure chances are the times in your SLA aren’t accurate (something that can potentially burn you down the road). When in doubt test the procedures. If nothing else you’ll gain valuable experience and capture more objective evidence on the structure and performance of the system.