E-commerce and Reliable Systems

Table of Contents

Presentation Objectives
E-Commerce Landscape
Customer Landscape
Behavior Definitions
System Redundancy
Customer Load Balancing
Transient TCP Connections
Persistent TCP Connections
Application Threads of Execution
Application Thread Model
Instrumentation
Prototypical Distributed Computing Model
Resource Exhaustion
Prototypical Tiered Computing Model
Performance and Capacity
Typical User and System Histograms
TPS Requirements Model
Conclusions

Presentation Objectives

This lecture describes a succesful e-commerce architecture, including system and application behavior and components. The design of the system as a whole in its totality was stressed. A discussion of the stability of e-commerce systems was included. Key points were that a completely stable e-commerce system is difficult to achieve, but well worth the effort.

E-Commerce Landscape

Individuals and businesses that participate in e-commerce are interested in buying or selling goods or services electronically. Each participant comes to the Internet marketplace with a set of objectives that they want to be met. When using an e-commerce system, participants tend to only remember the bad parts of their experience. For this reason, it is important that all parts of the e-commerce system function properly. As the speaker indicated, an Internet marketplace is only as good as it's weakest link.

Part of what blocks participants from successful experiences with an e-commerce system is a lack of understanding of the component technologies that make up the system. Each system is composed of a collection of components. Each service of the system has a different level of stability associated with it. The speaker emphasized that the responsibility for creating a stable Internet marketplace falls on those individuals who are empowered to cause the necessary changes.

Customer Landscape

There are several factors affecting a customer's ability to interact with an Internet marketplace. The first such factor is the customer's connection to the Internet. In order to access the Internet, the customer must be able to connect to an Internet Service Provider. Once connected, the customer will need to traverse the Internet from one Internet Service Provider to others.

Other issues that the customer must deal with include firewalls and proxy support (ability to open sockets and route traffic through secure intermediaries), and Internet latency (packets sent on the Internet traverse through ISP routers which introduce some delay). Lastly, the customer must have a web browser that supports all the necessary functionality of the system (which may include Java applets, Javascript, DHTML, etc.).

Behavior Definitions

The speaker introduced the following terminology and definitions:

Availability: the ability to always provide a response to a request
Supportability: the ability to administrate, monitor and disable services
Performance: the ability to respond to a request in a timely manner
Capacity: the ability to provide responsed to all simultaneous requests made to a site
Usability: the ability for customers to easily navigate to the desired target
Maintainability: the ability to install and upgrade services
Extensibility: the ability to add features as the services evolve
Feature Set: the ability to provide the services customers request

System Redundancy

In order to ensure that a system has maximum uptime (availability) and reliability, it is very important to have redundancy in the system design. This can be achieved using these three methods:

Multi-Site Design : A good system should have identical systems installed in various geographical locations, and distribute the customers among these installations.
Multi-System Design: A good system should also have multiple systems of the same type for the customer to access. This ensures that if a particular system fails at a given location, a backup system can compensate.
Data Replication: data replication can be performed using one read-write master and any number of replicas. This introduces some latency however due to the time it takes to synchronize the systems.

Customer Load Balancing

Customer load balancing is very imporant. Two effects are achieved. The first is that load balancing prevents any given server from becoming too over-utilized while another system is completely under-utilized. The second effect is that if a system fails, the load balancer can send the customer to a different system. In the diagram, each site has 3 server clusters, which would be several machines.

Transient TCP Connections

With transient TCP connections, each connection is established and then immediately closed after the request is fulfilled. With this method, a small number of sockets can serve a large number of requests if the requests are short enough to support the arrival rate of requests. One problem with this type of connection is the amount of overhead required for the SYN and ACK handshaking procedure necessary to initiate a TCP connection. With transient connections, this process needs to be repeated each time. For example, an HTML page with 40 images to load may require 40 transient TCP connections (assuming an HTTP keep alive is not in place), which requires a great number of SYN and ACK messages. One other problem is that socket address limitations may cause an unavailability.

Persistent TCP Connections

With persistent TCP connections, a connection is established, and the request is fulfilled, and the connection stays open afterwards. The connection will stay open until thread connection is terminated. For this reason, the number of requests that can be handles is equal to the number of threaded sockets available. Thread limitation, therefore, may cause unavailability. A persistent connection would be used for something like a stock ticker, which requires constant updating (and therefore would be very inefficient with transient connections).

Application Threads of Execution

Several issues are introduced when dealing with application threads. These include:

Race Condition: this condition occurs when one thread changes shared resources and causes another thread to fail
Deadly Embrace: thiis condition occurs when two threads block and wait for each other to complete. Because each thread is waiting on the other, neither one will complete.
Exhaustion: this condition occurs when all threads are waiting to complete and no new threads are available
Starvation: this condition occurs when attempts are made to start a new thread when no new threads are available
Pooling: this condition occurs when a larger number of thread requestors successfully share a smaller number of actual threads

Application Thread Model

This diagram shows the number of threads used by different portions of a system. Notice that the diagram is split into web server, back end, etc.

Instrumentation

Dynamic debug levels - Run-time setting of server software to log different levels of detail about what is happening inside the program
Concurrency levels - The number of active connections at the server
Operating system statistics - Low-level system internals such as CPU activity, RAM use, disk activity, number of active threads
Response times - the time taken to respond to various types within a fixed time period
Request type profiling and histograms - Charts and analysis of quantities and arrival rates of various request types

Prototypical Distributed Computing Model

This is a diagram of Fidelity's network. The gold boxes on the left labeled "1", "2", and "3" are the stock exchanges themselves, and the gold box just to the right of those is an enormously powerful machine that is Fidelity's link to the exchanges. The light blue boxes represent Fidelity's web servers which are multicasted by the purple boxes next to them.

Resource Exhaustion

There are many types of resource exhaustion or many events that can cause resource exhaustion. Listed below are some examples:

Too many active processes to allocate CPU time
No free TCP sockets
No free threads of execution
No free memory or swap space(swap space used to be a unique feature of unix, but NT now implements this)
No free disk space
No free file descriptors
No free network bandwidth
Too many scheduled processes
Threads and file descriptors are the top failures

Prototypical Tiered Computing Model

This is an example of a tiered computing model which can be used to show which types of machines are used for certain functions within an organization. The numbers in the boxes represent the number of computers represented by that box.

Performance and Capacity

Little's Law

n=number of customers in the system(in flight request)
t=mean time customers are in the system(response time)
r=customer arrival rate(new requests)

Max number of queued in-flight requests
n = t * r

Max number of new requests per second
r = n / t

Moral - Given a finite set of resources, the slower the response time, the smaller the queue of customers that can be supported

Choice - User fewer resources versus provide more resources

Typical User and System Histograms

These graphs show the number of hits that Fidelity's website receives over the course of a day. The x axis represents the time of day. You will notice that the site gets some hits even at times when most people are asleep. This is, in part, due to the fact that Fidelity hits its own site to test it regularly. Around 10 percent of the sites hits are from internal tests.

TPS Requirements Model

Goal: Maintain 4:1 Headroom @ 0-8 second delay

Middle numbers inside the bubbles represent total capacity in WEB equivalent TPS for all bubbles ahead of MIDDLE and MIDDLE VPS for all bubbles behind MIDDLE. Bottom numbers inside the bubbles represent the headroom for a load of 180 TPS and 60,000 users. Links connecting components show the TPS (black) or VPS (blue) for traffic flowing between components based on a load of 180 WEB TPS and 60,000 users. Underscored numbers are estimated.

Conclusions

The entire system must be available when customers want to do business on E-Commerce sites.
The success of E-Commerce sites is impacted by Performance, Capacity, Availability, Security, Supportability, Usability, Maintainability, Extensibility, Feature Set and Cost
Do Not wait until the end of the process to consider the whole problem space: do it now!
Do not sacrifice quality and stability for speed of delivery: the customer will pay the price, and will not return to your site.
There is no better qualified person than you to work through and solve E-Commerce stability solutions