FOSS In A Startup

The Holy Triumvirate

Our story begins with what are usually known as famous last words - "...and I want it done quick, good and cheap!"
This is a story about a small startup-chik (very small) that developed a proprietary product, with very little resources.
In the proprietary world, it is considered possible to get only 2 of "quick", "good" and "cheap".
FOSS changes the equation (in my opinion) - lets see how it worked for us...
...and yet, how it also have hurt us, in some ways...

The Product

In a startup, you first need to have a well-defined product description.
In our company, the buzz-wordish definition was "to support virtualizing services from a data-center to the branch-offices of an organization"
Which mostly amounts to exposing the services via a proxy, that will allow us to support bandwidth reduction (compression), security (which offices get which service, and to which workstations) and centralized management (do it via one station, for the entire organization).
Infrastructural Needs
- We needed a rather large infrastructure (communications software, Tweak-able DNS servers, 3rd-party compression libraries, configuration software, process-control software, Testing software, etc).
- Developing all this ourselves meant taking a lot of time.
- Using proprietary products means "pay a lot of money". It was already decided to do the GUI via Microsoft-based tools, and paying for an MSDN license gave us a small example of what it could cost (and we're not talking about royalties yet).
- Which eventually allowed me to introduce the concept of using available FOSS software (which shall be listed shortly).
The Product's Architecture
- In order to understand what was needed, let me describe the architecture of the product:
The Product's Architecture (Cont.)
- An appliance (running Linux) is set in the data-center, with access to the various services of the organization (web servers, mail servers, DB servers...)
- In each branch office, a similar appliance (also running Linux) is installed.
- Software running on the branch-office appliances exports TCP or UDP services, which are proxied to the appliance in the data-center, and from there to the actual servers.
- The proxy software performs adaptive data-compression (i.e. it adapts to the data being proxied for each service).
The Product's Architecture (Cont. 2)
- To configure the system (add proxied services, set access permissions), a GUI based on win2K's MMC is used.
- This GUI stores the configuration in an active-directory server, and then notifies the data-center appliance to take the configuration. In addition, the GUI can show events, accounting data, etc.
- Each appliance runs (a subset of) several processes - a multi-threaded communications process, a compression adaptation process, a configuration control process, an accounting process, and events-handling process...
How To Select An Infrastructure
- After performing a feasibility study to check the algorithmic part of the system, and after defining the architecture, came the part of choosing 3rd-party software tools.
- The first thing to do, was write a list of infrastructure software that will be needed.
- This list, of-course, cannot be written if you never developed such a system, and are not aware of various products that exist on the market (whether FOSS or proprietary it does not matter).
- Since I had experience with implementing similar architectures, I was given the task of writing the list.
  The Grocery List
  And so I wrote (with the assistance of the boss):
  1. One communications library, to handle everything to do with bits and sockets...
  2. One threading library, One multi-process library.
  3. One database server and a matching database client library.
  4. One LDAP-aware library (to access the active directory server).
  5. A configuration-files library (to store configuration data locally, so the appliance could start even if the active directory server is unreachable).
  6. A DNS server whose source we could easily tweak (to allow exposing the proxied services under the original server names).
  7. A process-manager (to manage process launching, etc).
  8. A standard general-purpose compression library (to add on top of our adaptive compression).
  (i might have missed a few minor items...)
The Little ACE That Could
- For the communications library there was no doubt - ACE.
- ACE (Adaptive Communications Environment) also happens to support OS-portability APIs, portable multi-threading APIs, portable multi-processes APIs, etc.
- For the higher-level communications that did not need too much control (i.e. did not run across a WAN) - we wanted CORBA, and so TAO (The ACE ORB) was the natural choice.
- Having (good) prior experience with both of them couldn't hurt either ;)
- For WAN communications, CORBA was not tweak-able enough (at least not without a high learning-curve), and thus we settled for direct use of ACE for the WAN link (as well as the communications with workstations in the offices, and servers in the data-center).
The MySQL Devil
- We needed a database mostly to store accounting records - so we did not need any special features - only that it will be light-weight and fast.
- Since MySQL has these two features - we went for it. It was also easier to set up then Postgres and came by default with our Linux distribution.
- The database layer was limited to a small part of the software (the accounting process) - so replacing the database would have been rather easy, should the need arise.
- From the management GUI we accessed the database via ODBC - which meant the code was database-neutral. Replacing the ODBC driver would suffice to work with a different database.
When To Develop Infrastructure Software In-House
- For some parts of the infrastructure, we did not manage to find satisfactory 3rd-party libraries.
- For example, there was no process-manager software that had the capabilities I deemed important - so it was decided to develop it in-house.
- Configuration libraries were plenty - but all had some annoying problems - they relied on too much extra libraries, they supported read but not write, some were too simplistic (treated everything as strings, did not support hierarchical data).
- Eventually, it was decided to go with XML (not my choice, but I did not strongly object), and go with a standard XML library (xerces). On Microsoft windows we went with Microsoft's library, since using xerces there proved to be a pain.
And When To Mix The Two Approaches?
- Since XML parsers are very annoying (DOM, anyone?), we wrote our own wrapper library on top, which supported a sub-set of XML that made our lives simpler.
- In general - if it takes longer to search for a good software tool, then it will take to develop and maintain it - I prefer to develop it and gain the extra control over the source and the features.
- Sometimes a hybrid approach is best - take an existing software that does most of what you want, and add what you need. This is how we handled the DNS server.
The Development Environment
- Software was written in C++ - so it was compiled with g++.
- Makefiles were written to compile the project
- And CVS was used to manage versions
- Doxygen was used for automatic API documentation generation. We hardly read the documentation (viewing the header files was more comfortable) - but it forced us to write better documentation.
- As a new initiate of the "automatic testing" cult, I set up the system very early for automatic builds every night, that included running automatic unit and sub-system tests.
- In addition, learning from sourceforge, the CVS server was set to send diffs after every check-in - and I made a habit of reading them every day, and using the SCSI cable to flog the criminals...
A Word About Software Licenses
- As a conscious user of FOSS, I wanted to make sure that all software we use - allows us to do so.
- Thus, we limited ourselves to libraries with licenses such as the LGPL (and then we made sure to dynamically-link with them), BSD, X11 and similar.
- It is easy to be tempted to use some GPL library in a manner that is not allowed - one should make sure to read the licenses and develop awareness to various implications (e.g. what constitutes "derived work").
- In doing this, my best guide was "prior art" - if I knew that some software has a license that is considered "understood", or if it is being used by large companies (which employ good lawyers especially for these purposes) - I can generally trust them.
- More on this - in our epilog...
The Journey
- Once all the players are set on the board, Someone shouts "Pawn to K-4" and the game begins.
- During this part, we shall see how programmers coming from the proprietary world, adapt to the new environment.
- We will also see a few examples where Linux literally rescued us from ourselves, or from 3rd-party products...
- (Automatic) system testing will be mentioned as well.
The first step - learning ACE
- Since ACE was our abstraction layer, (almost) everyone had to learn how to use ACE. ACE has a very good book that serves as an intro, and from there on - we used the examples and test programs of ACE.
- CORBA was a little tougher - so we bought what I knew to be the only book about CORBA that actually teaches CORBA programming. Since we didn't need it too often, one (used) copy was bought.
- Initially, we were asked to "wrap ACE up, in case we want to replace it one day". This line of thinking comes mostly from the proprietary world (we might want to use a different product one day), or from the portability-frightened world (what if we need to port it to a platform where this product does not exist? hmm???).
- ACE being what it is (a useful portability wrapper with a lot of extra functionality) - it does not lend itself to wrapping (how do you wrap a wrapper?).
- It took a bit to get this extra-wrapping requirement off our backs...
ACE Communications control - The all-mighty Reactor
- One of the nicest parts of ACE is its socket-programming wrapper module - the ACE Reactor, and the ACE Acceptor-Connector.
- Using them, it is extremely easy to write a select-based server that handles as many clients as we want.
- This Reactor allows adding clients and server-listeners quickly - all you need to do is derive from the proper Acceptor (server) or Connector (client) class, implement a few methods, and you're set.
- Thus, we had a Link class to handle the WAN link (where compression and connections interleaving is performed).
- We had a similar class to bring up listener sockets on the branch-office, and to connect to the data-center servers (on the data-center side).
- It took about 2 weeks to have software to proxy one service (with no management or compression capabilities). It took a while more to make this software configurable via an XML file.
Writing The Process Manager
- In a previous project I worked on, we had a team write an overly-sophisticated process manager.
- I took some of its requirements and said "we need this, only we'll write it in 1 month with one person, instead of half a year with 3+ programmers". How come? because of lessons learned, and because we'll only do the really necessary parts.
- What a process manager basically does is:
  1. Load a list of processes in a defined order, with configurable parameters and configurable environment (working directory, environment variables).
  2. Monitor these processes (both on the OS Level, and on the application level).
  3. In case a process crashes - it performs a configurable recovery operation.
  4. It allows shutting the system down in a convenient manner.
Writing The Process Manager (Cont.)
- Since the process manager is the parent of everything else (so it could catch CHLD signals) - it better be very stable - or it'll crash and bring the entire system down.
- Since we had a limited amount of machines, it must be possible to run two instances of the system on the same machine easily.
- Some people will say "use a bunch of shell scripts" - I find it a crappy solution, which is hard to maintain, and is slow to respond in case of problems (ever wrote a signal-handler in a shell script?).
- Important note: the reason to split the system into several processes was in order to increase stability - a bug that causes the compression analyzer to crash shouldn't bring the communications code down.
The Situation With FOSS Library Documentation
- Many FOSS libraries suffer form poor documentation - things keep changing, and none of the programmers has the patience to update the documentation and add new interfaces.
- You often see the following in a reference manual:
```
     connect(Addr address, int timeout, Connector& c) - connect to 
     a server.
     
```
  After reading such a thing I say "ah. now _everything_ is clear. NOT!".
- (Good) proprietary products tend to have much better documentation. On the other hand, they tend not to come with the source ;)
The Situation With FOSS Library Documentation (Cont.)
- So with FOSS libraries, everything is a little harder (since you need to read the source in order to understand the nuances of an interface).
- On the other hand, with FOSS, when you have a problem, you can almost always find the answer in the source. With closed-source software - you are stuck with the support.
- Sometimes the support of a proprietary product is really good - and sometimes they need your source in order to help you - and you spend hours and days in order to come up with a stand-alone demonstration of the problem you're having.
- Personally, I prefer the poor documentation with the source code, then the good documentation with having to wait for support coming from across the sea (i.e. at least one working day to get the request for extra details, and another working day to get the answer).
The Local Source Oracle
- Since the other programmers were not used to working with FOSS tools, I served as the local oracle.
- When one of the programmers ran into a problem with one of our FOSS tools, I started reading the source-code and supply the answer.
- This could take anywhere between 5 minutes to 2 days.
- Which means that overall, the support level they got was quite better then the support they got from companies with good support.
- One of them kept whining about this situation - apparently, being dependent on people you know is harder then being dependent on people you only talked to via the phone (or via e-mail).
- It is the same reason why programmers never trust the software their own company creates - it's easier to trust a life-support system when you are not aware of the bugs that QA found and were not fixed...
"Don't Install A DNS Server In My Network!"
- In our first test site, the network administrators immediately said "Don't install a DNS server in my network!"
- They were probably burnt by DNS servers in the past - it was non-negotiable.
- If we cannot redirect clients via DNS - we can redirect them via iptables transparent-proxy rules...
- ...provided that our appliance is "in the middle" (connected as a router), or the network administrators can add a transparent proxy rule on their current router.
- A little more coding and GUI changes, and we can select between proxing via DNS or proxing via iptables.
Testing On A WAN Without A WAN
- Writing for a WAN is different then writing for a LAN. The TCP stack is the same - the communications conditions are not.
- NIST Net - a free-software patch to the linux kernel adds capabilities of WAN simulation (delay packets, reduce bandwidth and add random packet drops).
- Very poor documentation, and some annoying bugs - but for a product that worthy - we learned how to work around these issues.
- Once it is set up and you've learned how to use it - changing the configuration is very simple.
- It required a dedicated host, but a single host can simulate several WANs, so there was no need for an extra host per developer.
Automatic System Tests
- When all you have is 4 developers and 0.5 testers - you need automatic testing or else you die.
- Our initial focus was on web applications. Users work with Internet explorer - so the tests had to be with Internet explorer, and hence run on windows.
- Internet explorer exposes a COM interface - which can be activated from Perl.
- The tester (which was also the 0.5 programmer) wrote a script to automatically surf required sites in a pattern we thought would represent a "real user".
- Several weeks of work, and he could run 3-4 explorers on a windows machine (any more and the machine would crash).
- There was something about connecting to a windows telnet server in this process - but I forget why ;)