Wednesday, June 29, 2011

Would you like to save users 1,000+ hours per year?

#SharePoint

If this grabbed your attention, in keeping with the theme ‘Time and Energy matters’ this post suggests how important it is to performance test and optimise your SharePoint environment?

How about getting the load time for the home page of your intranet from an average 5 seconds to less than 1 second?

Introduction

Having worked with SharePoint for many years, I have done a number of performance testing engagements. With my most recent engagements, I thought it might be a good idea to share some of my thoughts and experiences, with the hope that it will help others to improve the performance and reliability of their solutions to benefit the users of the solutions.

Why would you be concerned about performance testing?

Performance testing has many aspects to consider as you’ll see, but what happens if you don’t consider performance testing your solutions?

First and foremost I would suggest, if the users of the solution have a poor user experience relating to speed and availability, this will have a number of outcomes:

  • Poor perception of your solution (and maybe you)
  • Frustrated users
  • Potentially lower uptake and sustained usage
  • Loss of energy and time for users and your business

 

Let’s look at a very simplistic hypothetical scenario.

If I have 5000 users and on average each user opens the Intranet home page once per day to look for some information and it takes 5 seconds to load the home page, then the total energy expended is 25,000 seconds or 6.94 hours per day. (just for 1 page load!)

So if I can cut that page load time down to 1 second, the the number is 5000 seconds or 1.38 hours per day.

So that equates to a saving of 111 hours every four weeks for a single page load!

Clearly you would be aiming for far more than one page view per person per day, so you can see that there might be very significant savings to be made by investing some time and energy into making sure that the site performs as well as possible.

Of course it is not all about time saving. I suggest we should be aiming to ensure that the user perception of performance is very good. The happier the users are with the system, the more chance there is they will use it and get benefit from it, and consequently the more value your organisation will receive from its investment.

Technical issues

I have seen numerous technical issues uncovered through performance and load testing including:

  • 3rd party or custom web parts with memory leaks which cause frequent IIS app pool re-cycles
  • Infrastructure issues such as load balancer failures and server failures.
  • Badly performing web parts under load
  • Memory leaks in code
  • Server configuration issues

 

The path to better performance

 

MY TOOLKIT

These are the tools I use in a performance testing exercise. Everyone seems to have their favourites, so you might substitute some of them with an alternative.

Firefox Used with various plugins
Fiddler2 Examining HTTP traffic to/from your workstation
Y-Slow I use this in particular to look at the page payload with and without a primed cache.
SharePoint developer dashboard Very useful for seeing how long SharePoint is taking to load a page and what it is doing under the covers.
Selenium Used for executing UI tests. This can drive browsers through C# and other languages
Internet Explorer 8 or 9 The JavaScript debugger can be useful although you can use Firefox for this.
Visual Studio 2010 Ultimate This edition provides the web testing tools for load testing.

I would suggest that we break down the path to better performance into two sections:

1 Individual page load time Looking at the payload of the page, caching options, custom code and database queries.
2 Performance and reliability under stress What happens to the page load time, resource utilisation and reliability when the farm is under load?

 

Individual Page Load

I recently went through two different performance testing exercise for new Intranets. In both cases, the target audience is more than 5000 people geographically dispersed.

I undertook performance tests in a variety of ways using the tools mentioned above. I saw very inconsistent results with the page load time ranging from 50 seconds to 2 seconds. Clearly something was not well with our farm.

Page payload and caching

The first thing I did was to look at the payload of the page. What is being loaded, how long is it taking and what is being cached.

First I use Y-slow to determine the size of the page with and without a primed cache. This shows what will be loaded the first time by the browser and what will then be loaded in subsequent times after the cache is primed. I did notice that the primed cache page load size was bigger than I would have expected.

I use fiddler to examine this further. The actual content of the page was not really a problem. I didn’t see any really large images that might cause a problem.

Two things I did notice though were:

  • whenever I pressed F5 on the browser, I saw a lot of HTTP 304 responses from the server. This is normal behaviour, but is quite different from when I just click a link on a page. Basically the browser is confirming with the server that the objects have not changed since the last time they were cached.
  • I noticed that some images that I would have expected to be cached were being served on every load and I was seeing an HTTP 200 message. When I examined these requests further, the items were being served from SharePoint libraries and for some reason had a cache expiry set in the past. I never worked out why SharePoint was doing this. However, when I enabled blob caching on the SharePoint server, this problem went away and those items were no longer being assigned the cache expiry headers. (Subsequently, this issue has re-appeared for some images so I am looking into it further.)

Then I thought, perhaps I should look at the page output cache in SharePoint to try to further reduce the page load time. This allows the SharePoint server to cache the page and not reassemble it for every page load, thus cutting down on the server resources required.

One gotcha with this is if you have personalised content on the page. On the home page for one organisation, I did have personalised data from their Newsfeed. Once I enabled the page output caching, this stopped working, so I had to switch it off again. The results of using page output cache would be more prominent in load testing than individual page loads.

I also use developer dashboard to see what is happening server side during the page load and where the most work is being performed.

Asynchronous loading of data

Something to consider is that one of the most important things is the users perception of the page load time. If you have a lot of processing on a page which will block the page from loading immediately, consider whether you can use the technique of loading some of the content asynchronously. In other words, let the page load and then process some web parts and controls asynchronously so that the user has an experience of seeing the page very quickly, and only has to wait if they need to see the data that is loaded asynchronously.

Some examples of this might be calls to other systems for data or calls for data that is initially hidden by a tab or some other graphical element.

Some SharePoint 2010 web parts have an option to automatically load data asynchronously without any coding.

Fiddler can be very useful for helping to identify where a page load is being blocked by some call for data.

LATENCY IS THE ENEMY

So lets say that we have got our page load to a pretty good level. Let’s also assume that we are on a very fast link to the data centre. What happens for users that are on the other side of the world? This could probably turn into a long discussion about web acceleration and lots of other technology options. For this post, let’s just focus on the implications.

When I was testing a SharePoint solution hosted in Australia from client workstations in South America and the UK, it was really driven home to me what the impact of the number of requests and latency really is. For those locations, the latency was 300ms or more. That means that there is .3 of a second delay for EVERY request/response. If your page requires 100 requests because of all of the items on it, it might take a minimum of 30 seconds to load.

I started looking at a product called Aptimize which dynamically optimises pages with CSS sprites, reorganises JavaScript, compression etc. on the fly. It looks good, but I have not tried it in a production scenario yet. It does have some interesting tools and does provide a lot of insight though.

TIP: While doing some performance analysis from those international locations, I noticed that when using multi-lingual variations, you hit the root URL for the site, are then redirected to variationroot.aspx and then get redirected to the home URL for the language you are configured for. These redirects were taking up to 5 seconds to occur before the client even started to render the page. Therefore, if possible, set the default page URL in the browser at the locations to the full address such as http://sharepoint/en/pages/default.aspx rather than just http://sharepoint.

Why were we experiencing such inconsistent performance?

So now we had the files caching as expected, we still had very inconsistent performance. So now, I started to isolate individual Web Front end servers to rule out the load balancer. This was simply done by changing my hosts file on the client to point to a specific server rather than go through the load balancer. What I discovered was that one server worked quite well and consistently and three other servers did not.

With no errors being reported in any logs, we decided that perhaps it was the network. So we ran some file transfer tests between servers and monitored the throughput and performance. What we discovered was that the file transfers would ‘pause’ and restart sporadically.

It turned out that the servers had dual NICs with ‘teaming’ software installed. This was causing network issues on the servers and once we disabled the teaming, performance became consistent. I can only assume this is some sort of issue in the teaming software.

Another thing to consider is whether the network adapters are set to auto sense the speed. It is better to set them to a specific speed to ensure that you are getting maximum throughput.

Custom caching

So now we had consistent page load times of about 2.5 seconds for the home page. Not bad by my reckoning. However, given that this home page will be loaded every time a user opens a browser, I wanted to see what else we could do to improve the performance even further.

We looked at a number of the web parts on the home page and considered how often the data would need to change. Many of them would not change that frequently, but we didn’t want to only rely on the page output cache which typically refreshes every 3 minutes by default.

So we subclassed a number of web parts including the content queries and cached the information in the HttpRuntime.Cache.

Doing this allowed us to get the primed cache load time down to approximately 0.6 seconds!

This does of course introduce an issue of what happens when I must update the cached information right now.

The first answer we came up with was to intercept a query string of resetCache=1 in a control on a page and then remove the objects from the HttpRuntime cache. This worked well until we realised that in our medium farm, we have multiple WFEs and the current solution would only reset the cache on a single server.

So we needed a solution which would allow us to bypass the load balancer and go to every WFE and reset the cache.

The way we solved this was to write an Application page which would use .Net to get a list of all of the web servers in the farm and then call the resetCache on every server. This works well.

Performance and reliability under stress

So now we have the individual page load working the way we want, it is time to load test the farm to make sure that the day we launch it, it won’t fail or have such bad performance that people will be disappointed with the site.

There are lots of posts and information about load testing sites out there, so I won’t try to re-write those. I will simply provide my perspective and experiences.

We used Visual Studio 2010 Ultimate which includes the web testing facilities to perform the load tests. You can download a trial edition if you want to give it a try.

I think some people get a bit confused about some terminology and what they actually want to test with a load testing exercise.

Let me suggest that initially with our test we want to put our infrastructure under as much stress as possible to identify any weak points.

Visual Studio 2010 Ultimate allows a maximum of 250 virtual users. Often the workstation you run it on may not have enough capacity to simulate that many users. You might find the CPU, RAM or network adapter simply can’t pump out enough requests to stress the environment adequately. So you might consider running multiple agents or simply run VS2010 on multiple machines at the same time.

You need to make sure that you record the performance counters for all of the servers in your infrastructure to understand where bottlenecks might occur. You should also watch for application pool re-cycles which usually look like the server memory utilisation rises until the app pool recycles and it then drops suddenly. This is often caused by a memory leak in a web part or some other code.

I suggest testing for at least 10 to 15 minutes at a time. Don’t use think time when you configure your load test. What we want to understand is what the max requests per second are that our farm can deal with and still respond within an acceptable time frame.

Bill Baer’s post on RPS is quite useful. I used it to estimate how many requests per second our farm needed to handle to deal with the expected peak load for 5000 users. The estimate worked out to about 80 RPS in our case.

Another consideration is to use multiple users during your test. You can link a csv of user names and passwords to the test. This can also be important, because page and caching behaviour may be different depending on the permissions the users have to the site.

So in our first test when we hammered the servers as hard as possible we broke the load balancer. This was a good result. Much better to do it now than on launch day!

Once that was resolved, we tweaked our load tests and then started to build up the number of ‘users’ to work out where the threshold of RPS vs. page load time was. It worked out in this case that it was about 100 RPS with a 3 second page load. This means that we should be able to cope with the expected peak load and have some additional head room.

We also reviewed the performance of our SQL environment. Two changes we made to the SQL configuration were:

  • Create one TEMPDB per processor core
  • Increase the RAM to 48GB

After doing this, we re-ran the tests and saw another increase in performance.

Client Testing (including Javascript)

It is not necessarily obvious that Visual Studio does NOT execute JavaScript during a web/load test. It is just sending and receiving http requests.

If you want to do some UI testing where JavaScript does execute, one way of doing this is to use Selenium. This is only something I tried recently but it works well. There are a number of ways you can use it. I started using the Firefox plug-in, where you can record and playback a scenario. Then I moved to using the .Net API to drive Firefox/IE/Chrome. It is definitely something I will continue to look into the future.

I heard Chris O’Brien talking about Hammerhead on the SharePoint Pod Show which also allows you to measure page load times including running JavaScript.

Conclusion

Every project and scenario is different, so your experience and process might differ from mine, but my message to you is that it is really important to consider testing and optimising your SharePoint solution. Ensure that users perception of the solution performance does not get in the way of adoption and respect your users time and energy and don’t waste it unnecessarily.

Useful links:

SharePoint Pod Show episodes

1 comment:

David Marsh said...

This is not directly related to SharePoint but certainly some practices that should be followed if you really want performance to be a feature of your SharePoint environment. Performance is a Feature. Some other great ways to boost performance of SharePoint is to minimise the number of HTTP requests through sprite images, minifying scripts, offloading static files like images, css, and scripts to a separate sub domain so requests can run in parallel. base64 encoding scripts and images, this adds ~30% to page size but removes the need for extra HTTP requests which could be more time saved in the long run.