E-commerce and Reliable Systems

Table of Contents

  1. Presentation Objectives
  2. E-Commerce Landscape
  3. Customer Landscape
  4. Behavior Definitions
  5. System Redundancy
  6. Customer Load Balancing
  7. Transient TCP Connections
  8. Persistent TCP Connections
  9. Application Threads of Execution
  10. Application Thread Model
  11. Instrumentation
  12. Prototypical Distributed Computing Model
  13. Resource Exhaustion
  14. Prototypical Tiered Computing Model
  15. Performance and Capacity
  16. Typical User and System Histograms
  17. TPS Requirements Model
  18. Conclusions



Presentation Objectives

This lecture describes a succesful e-commerce architecture, including system and application behavior and components. The design of the system as a whole in its totality was stressed. A discussion of the stability of e-commerce systems was included. Key points were that a completely stable e-commerce system is difficult to achieve, but well worth the effort.

back to top


E-Commerce Landscape

Individuals and businesses that participate in e-commerce are interested in buying or selling goods or services electronically. Each participant comes to the Internet marketplace with a set of objectives that they want to be met. When using an e-commerce system, participants tend to only remember the bad parts of their experience. For this reason, it is important that all parts of the e-commerce system function properly. As the speaker indicated, an Internet marketplace is only as good as it's weakest link.

Part of what blocks participants from successful experiences with an e-commerce system is a lack of understanding of the component technologies that make up the system. Each system is composed of a collection of components. Each service of the system has a different level of stability associated with it. The speaker emphasized that the responsibility for creating a stable Internet marketplace falls on those individuals who are empowered to cause the necessary changes.

back to top


Customer Landscape

There are several factors affecting a customer's ability to interact with an Internet marketplace. The first such factor is the customer's connection to the Internet. In order to access the Internet, the customer must be able to connect to an Internet Service Provider. Once connected, the customer will need to traverse the Internet from one Internet Service Provider to others.

Other issues that the customer must deal with include firewalls and proxy support (ability to open sockets and route traffic through secure intermediaries), and Internet latency (packets sent on the Internet traverse through ISP routers which introduce some delay). Lastly, the customer must have a web browser that supports all the necessary functionality of the system (which may include Java applets, Javascript, DHTML, etc.).

back to top


Behavior Definitions

The speaker introduced the following terminology and definitions:

back to top


System Redundancy

In order to ensure that a system has maximum uptime (availability) and reliability, it is very important to have redundancy in the system design. This can be achieved using these three methods:

back to top


Customer Load Balancing

Customer load balancing is very imporant. Two effects are achieved. The first is that load balancing prevents any given server from becoming too over-utilized while another system is completely under-utilized. The second effect is that if a system fails, the load balancer can send the customer to a different system. In the diagram, each site has 3 server clusters, which would be several machines.

back to top


Transient TCP Connections

With transient TCP connections, each connection is established and then immediately closed after the request is fulfilled. With this method, a small number of sockets can serve a large number of requests if the requests are short enough to support the arrival rate of requests. One problem with this type of connection is the amount of overhead required for the SYN and ACK handshaking procedure necessary to initiate a TCP connection. With transient connections, this process needs to be repeated each time. For example, an HTML page with 40 images to load may require 40 transient TCP connections (assuming an HTTP keep alive is not in place), which requires a great number of SYN and ACK messages. One other problem is that socket address limitations may cause an unavailability.

back to top


Persistent TCP Connections

With persistent TCP connections, a connection is established, and the request is fulfilled, and the connection stays open afterwards. The connection will stay open until thread connection is terminated. For this reason, the number of requests that can be handles is equal to the number of threaded sockets available. Thread limitation, therefore, may cause unavailability. A persistent connection would be used for something like a stock ticker, which requires constant updating (and therefore would be very inefficient with transient connections).

back to top


Application Threads of Execution

Several issues are introduced when dealing with application threads. These include:

back to top


Application Thread Model

This diagram shows the number of threads used by different portions of a system. Notice that the diagram is split into web server, back end, etc.

back to top


Instrumentation

Dynamic debug levels - Run-time setting of server software to log different levels of detail about what is happening inside the program
Concurrency levels - The number of active connections at the server
Operating system statistics - Low-level system internals such as CPU activity, RAM use, disk activity, number of active threads
Response times - the time taken to respond to various types within a fixed time period
Request type profiling and histograms - Charts and analysis of quantities and arrival rates of various request types

back to top


Prototypical Distributed Computing Model

This is a diagram of Fidelity's network. The gold boxes on the left labeled "1", "2", and "3" are the stock exchanges themselves, and the gold box just to the right of those is an enormously powerful machine that is Fidelity's link to the exchanges. The light blue boxes represent Fidelity's web servers which are multicasted by the purple boxes next to them.

back to top


Resource Exhaustion

There are many types of resource exhaustion or many events that can cause resource exhaustion. Listed below are some examples:


Too many active processes to allocate CPU time
No free TCP sockets
No free threads of execution
No free memory or swap space(swap space used to be a unique feature of unix, but NT now implements this)
No free disk space
No free file descriptors
No free network bandwidth
Too many scheduled processes
Threads and file descriptors are the top failures

back to top


Prototypical Tiered Computing Model

This is an example of a tiered computing model which can be used to show which types of machines are used for certain functions within an organization. The numbers in the boxes represent the number of computers represented by that box.

back to top


Performance and Capacity

Little's Law

n=number of customers in the system(in flight request)
t=mean time customers are in the system(response time)
r=customer arrival rate(new requests)

Max number of queued in-flight requests
n = t * r

Max number of new requests per second
r = n / t

Moral - Given a finite set of resources, the slower the response time, the smaller the queue of customers that can be supported

Choice - User fewer resources versus provide more resources

back to top


Typical User and System Histograms

These graphs show the number of hits that Fidelity's website receives over the course of a day. The x axis represents the time of day. You will notice that the site gets some hits even at times when most people are asleep. This is, in part, due to the fact that Fidelity hits its own site to test it regularly. Around 10 percent of the sites hits are from internal tests.

back to top


TPS Requirements Model

Goal: Maintain 4:1 Headroom @ 0-8 second delay

Middle numbers inside the bubbles represent total capacity in WEB equivalent TPS for all bubbles ahead of MIDDLE and MIDDLE VPS for all bubbles behind MIDDLE. Bottom numbers inside the bubbles represent the headroom for a load of 180 TPS and 60,000 users. Links connecting components show the TPS (black) or VPS (blue) for traffic flowing between components based on a load of 180 WEB TPS and 60,000 users. Underscored numbers are estimated.

back to top


Conclusions

The entire system must be available when customers want to do business on E-Commerce sites.
The success of E-Commerce sites is impacted by Performance, Capacity, Availability, Security, Supportability, Usability, Maintainability, Extensibility, Feature Set and Cost
Do Not wait until the end of the process to consider the whole problem space: do it now!
Do not sacrifice quality and stability for speed of delivery: the customer will pay the price, and will not return to your site.
There is no better qualified person than you to work through and solve E-Commerce stability solutions

back to top