Archives For December 2012

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)

You deploy the application on your production environment and… surprise: doesn’t work!

Antipattern: start from your dev environment and scale up adding more servers, load balancers, db replicas etc…

While this seems a standard way to design and release a new application, it will just postpone the issues the you will have when you will scale up your infrastructure to production.
Let me clarify: at a first glance, this can be confused with “over design”, and we all know this is wrong, but you know that application will run in a production environment, serving real users, so you have to design starting from your production environment, and solve the issues that arise here first (fail fast and learn fast, as always).

The recipe: scale down from production

Here is a list of design considerations, emerged from experiences at my company.
One of our application, based on PHP+MongoDB, run on “standard” setup for a production environment: one domain name is resolved to 2 different IPs, allocated to 2 load balancers running HAProxy; the load balancers direct the traffic to 2 or more web servers running Apache+PHP, connected to a 3 nodes MongoDB replica set (no need for a shard yet). All the machines are virtual machines, built using RighScale templates.

Using this methodology, can “scale down” the infrastructure simply collapsing the 3 layers (LB, FrontEnd, BackEnd) in less virtual machines, still using the same full stack: for instance, our “Integration Tests” environment runs the full stack on a single machine, but HAProxy and a MongoDB replica set are included, even if not required. The apparently complexity of this setup is simplyfied by using “Server Templates” and virtual machines, than can be instantiated from scratch with just a simple click of a button.

The development environment is a virtual machine too, running on each of our laptop; we are working to have the dev virtual machine built from a RightScale template too, but at least we can share a “master” image to all the evs when something changes in the setup (an article on that coming soon…)
This practice saved our life a few times:

  • Starting from v.1.3, the MongoDB driver for PHP requires a different configuration for replica set: we discovered that BEFORE going in production enviroment (yes… we should read the change log…)
  • We have cron jobs that must run as “environment singleton” (just job running in a given enviroment): again, we discoverd this need BEFORE going in production and addressed it.
  • Because we have multiple front end web servers in production, we need to address the “session management”: the final decision was to design the application “sessionless”. If we would have followed the classic approch, we should have addressed the “session management” in prodution enviroment in some way: storing sessions on DB, using server affinity in load balancer, etc… in summary: a more complex infrastructure.

Conclusions

Starting from design your production enviroment and scale it down is a practice to anticipate the troubles, and solve them at “development” stage.
Usign modern technologies like server virtualization and automated infrastructure provisioning, simplify this practice.

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 5.0/5 (3 votes cast)

The first problem than you can have after a deployment is that…. the configuration files for your software are somehow wrong: a wrong database password, a wrong connection url, a wrong log file name… more or less every configuration parameter in your application can be wrong.

Antipattern: prepare a configuration file for the production environment, deploy the application and look at the logs.

Well, you know what happen now: the application sometimes fails to start and you have to spend your time looking at your log files, while the application is down and your services not available to users.

We found a simple solution for that: introduce a smoke test stage into the deployment pipeline.

The recipe to smoke test your deployment

First of all, what is a Smoke Test? According to Wikipedia,

a smoke test proves that “the pipes will not leak, the keys seal properly, the circuit will not burn, or the software will not crash outright

The idea is simple: read the configuration parameters and validate them in the actual environment. For instance:

  • search all the database connection parameters (hostname, database name, username, password, etc…) and try to make a real connection to that database.
  • search all the remote URLs your application uses (APIs, webservices, etc..) and try to connect to them.
  • search all the folder names and file names and check the existence and the permissions.

You can just create a script in your preferred language, execute it before going live with the new release (remember the Practice #3: Deploy on the same way from dev to production environment) and abort the build in case of a test failure: your application still at the actual release and continues to be available to your users.

It’s very import to execute the smoke test script in the enviroment where your application run, because you want to be sure that the resources (database, external API, etc..) are available from the same host where the application is installed, so, if you have multiple servers, run the smoke tests from all the servers in  the array.

Here some real how-to, taken from our smoke test scripts.

How to smoke test a MongoDB connection

Here an example script:

$ smoke-test-mongo-connection.sh srv1:27017,srv2:27017,srv3:27017
#!/bin/bash
# smoke-test-mongo-connection.sh
#
IFS=',' read -ra mongoConnections <<< "$1"
mongoConnection=${mongoConnections[0]}

mongoOutput=$(mongo  $mongoConnection --eval 'status = db.serverStatus(); print(status["ok"]);' --quiet)
if [ "$mongoOutput" != "1" ]; then
    echo "MongoDB not reachable at: $mongoConnection"
    exit 1
fi

echo "Check configuration OK"

How to smoke test folder permissions

Just run a script like this:

$ smoke-test-log-permissions.sh path/to/my/logs theuser

 

#!/bin/bash
# smoke-test-log-permissions.sh
#
applicationLogPath=$1
applicationLogPathPermissions=$2

applicationLogPathOutput=$(stat --format=%U $applicationLogPath)
if [ "$applicationLogPathOutput" != "$applicationLogPathPermissions" ]; then
    echo "Permissions or owner of file $applicationLogPath are wrong: $applicationLogPathOutput"
    exit 1
fi

echo "Check configuration OK"

How to smoke test an URL

You can just execute a script like this:

$ smoke-test-url.sh http://api.exmaple.com

 

#!/bin/bash
# smokte-test-url.sh
#
statusOutput=$(curl --insecure --silent -o /dev/null  -w %{http_code} $url)
if [[ "$statusOutput" == "000" || "$statusOutput" == "40*" || "$statusOutput" == "50*" ]]; then
    echo "$url not reachable"
    exit 1
fi
echo "Check configuration OK"

 Conclusions

As you can understand, I can show here just a few rows of our deployment smoke test script, but I think you can get the idea:

  • Parse the configuration files
  • Try to instantiate database connections, connect to URLs and writing to files
  • fail the build if one resource is not available

I would like to see your comments and thoughts here

VN:F [1.9.22_1171]
Rating: 5.0/5 (3 votes cast)
VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)

This is what we are learning in the Company that I work for, where we are moving from an “svn update /var/www” anti-pattern to a deployment pipeline in 4 different environments, from Dev to Production.

Here you can find real solutions and how-to that worked on our environment.

Practices that we follow for Continuous Deployment

As soon as we feel that we are suffering in some way, we try to find a Practice to mitigate this sufferance. Here what we did until now

Practice #1: Same software stack from dev to production environment
Mitigate: “on my PC works!”

Practice #2: Dedicated Continuous Integration environment for unit test
Mitigate: “on my PC works!”

Practice #3: Deploy on the same way from dev to production environment
Mitigate: deployment to production fails

Practice #4: Smoke test the deployments
Mitigate: major configuration errors

Practice #5: Use Business Metrics (KPI) to validate deploy
Mitigate: deployment to production secceded, but business was impacted

Practice #6: Automate Rollback
Mitigate: production downtime after a failed deploy

Practice #7: Keep deployment pipeline fast
Mitigate: too much time required to go in production

Practice #8: Scale down, not up
Mitigate: “on my PC works!”

(updated soon…)

 

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)

We found that this is the best way to mitigate the usual impedance mismatch between “devs” and “operations”: deployment to production WILL fail, even if you write the best documentation possible, with a full description of all the steps required, even if the “operations” guys write some script to execute the procedures.

The real solution that worked in our environment was to create a single deploy script that is used to deploy the application in all the environments in the deployment pipeline, from dev to production environment.

The recipe to deploy on the same way from dev to production environment

As our application stack is based on PHP running on Linux, was natural for us to implement the deploy script using phing, scp and ssh

The workflow is:

  • build an “artifact” (environment independent)
  • copy the artifact to the destination environment in an new folder, let’s say “application.233”
  • copy the configuration files required for the destination environment
  • setup all the required permissions on destination environment
  • run the “migration” scripts required to update the database, if required
  • move a symbolic link “application” from “application.232” to “application.233”
  • apache will now see the new files and serves the new application

We deploy in all the environment using this same workflow, so even to deploy on “localhost” machine, we run “scp” to copy files and “ssh” to setup permissions, run migrations, etc… and because we have multiple servers on the each environment, we use the “parallel-ssh” and “parallel-scp” utilities.

Let’s see some of the magic that we have in the phing build.xml file…

How to use parallel-scp to copy on multiple servers

Here is the snippet from build.xml

 

<target name="exec-pscp">
 <exec
 command="pscp ${scp.host.remote} -O LogLevel=ERROR -O UserKnownHostsFile=/dev/null -O StrictHostKeyChecking=no -O IdentityFile=${ssh.privkeyfile} -t 0 ${file} '${todir}'"
 outputProperty="pscp.output"
 />
 <echo msg="pscp.output: ${pscp.output}" />
 <condition property="pscp.failed">
 <contains substring="FAILURE" string="${pscp.output}" />
 </condition>
 <fail if="pscp.failed" message="Remote copy failed: ${pscp.output}" />
 </target>

 

The property ${scp.host.remote} contains the list of the servers to deploy into, for instance “-H root@server1 -H root@server2”

As you can see, we store the output of the pscp command in the property “pscp.output”, check the result of the operation and fail the build if required.

 

How to use parallel-ssh to remote execute commands on multiple servers

A similar trick is used to remote execute commands on all the servers in the environment. Here is the snippet:

 

<target name="exec-pssh">
 <exec
 command="pssh ${ssh.host.remote} -O LogLevel=ERROR -O UserKnownHostsFile=/dev/null -O StrictHostKeyChecking=no -O IdentityFile=${ssh.privkeyfile} -P -i -t 0 '${ssh.command}'"
 outputProperty="pssh.output"
 />
 <echo msg="pssh.output: ${pssh.output}" />
 <condition property="pssh.failed">
 <contains substring="FAILURE" string="${pssh.output}" />
 </condition>
 <fail if="pssh.failed" message="Remote command failed: ${pssh.output}" />
 </target>

 

Again, the property ${scp.host.remote} contains the list of the servers to deploy and we check the result of the operation looking at the “outputProperty” of command.

The deploy worlflow

here is an extract from the build.xml, showing the main workflow for the deploy.

 

<target name="deploy">
<echo msg="Copying ${artifact.tgz.path} archive to environment ${deploy.environment} hosts:${fs.host}" />
 <phingcall target="exec-pssh">
 <property name="ssh.host.remote" value="${fs.hosts.argument}" />
 <property name="ssh.command" value="rm -Rf ${fs.deploy.basedir.number}; mkdir -p ${fs.deploy.basedir.number}" />
 </phingcall>
 <phingcall target="exec-pscp">
 <property name="scp.host.remote" value="${fs.hosts.argument}" />
 <property name="todir" value="${fs.deploy.basedir.number}" />
 <property name="file"  value="${artifact.tgz.path}" />
 </phingcall>
 <echo msg="${artifact.tgz.name} copied to environment ${deploy.environment} host:${fs.host}" />
<!-- we unzip the application artifacts in a new fresh location,           -->
 <!-- set permissions and and move a symlink to this location.              -->
 <!-- in this way we have a local backup of previuos version, useful for    -->
 <!-- a fast rollback (just move the link to the previous version           -->
<echo msg="Estracting archive to ${deploy.basedir.number} on host:${fs.host}" />
 <phingcall target="exec-pssh">
 <property name="ssh.host.remote" value="${fs.hosts.argument}" />
 <property name="ssh.command"
 value="tar -xzf ${fs.deploy.basedir.number}/${artifact.tgz.name} -C ${fs.deploy.basedir.number};
 rm -f ${fs.deploy.basedir.number}/${artifact.tgz.name}" />
 </phingcall>
 <echo msg="copying configs from configs/${deploy.environment} to ${fs.deploy.basedir.number}/code/configs on host:${fs.host}" />
 <phingcall target="exec-pssh">
 <property name="ssh.host.remote" value="${fs.hosts.argument}" />
 <property name="ssh.command" value="mkdir -p ${fs.deploy.basedir.number}/code/configs" />
 </phingcall>
 <phingcall target="exec-pscp">
 <property name="scp.host.remote" value="${fs.hosts.argument}" />
 <property name="todir" value="${fs.deploy.basedir.number}/code/configs" />
 <property name="file"  value="configs/${deploy.environment}/*.ini" />
 </phingcall>
 <echo msg="done" />

…. more steps here…. and then

 

<echo msg="================================================================================" />
 <echo msg="Final step: Linking ${deploy.basedir.number} to ${deploy.basedir}" />
 <echo msg="================================================================================" />
 <phingcall target="exec-pssh">
 <property name="ssh.host.remote" value="${ws.hosts.argument}" />
 <property name="ssh.command"
 value="rm -fR ${deploy.basedir};
 ln -s ${deploy.basedir.number} ${deploy.basedir};
 " />
 </phingcall>
 <echo msg="================================================================================" />
 <echo msg="Checking link..." />
 <phingcall target="check-link" />
 <echo msg="Linking OK!" />
 <echo msg="================================================================================" />
 </deploy>

Conclusions

I can show here just a few snippets of our build.xml, but I think you can get the basics from what I show you:

  • Write a single deploy script that can be used to deploy in every environment in the same way
  • Start from solving problems related to your production environment early (EG: multiple servers, minimize downtime, easy rollback, etc..)
  • Alway check the results of the operations run and fail as required.

Feel free to share your comments and thoughts

VN:F [1.9.22_1171]
Rating: 5.0/5 (1 vote cast)