Advanced Information Systems: Parallel and Distributed Databases

Next: About this document Up: No Title Previous: No Title

Advanced Information Systems: Parallel and Distributed Databases

Parallel and Distributed Databases: Why?

Performance: using several resources (CPU, disks) in parallel significantly improves performance
availability: if site containing relation goes down, relation continues to be available if a copy maintained at another site
distributed access to data: although analysts may need to access data corresponding to different sites, usually find locality in the access patterns; locality can be exploited by distributing data accordingly
analysis of distributed data: organizations/applications increasingly want to examine all data available to them even if stored across multiple sites and multiple database systems

Parallel Database system: seeks to improve performance via parallel implementation of operations; any distribution of data governed solely performance considerations

Distributed Database system: data physically stored across multiple sites each typically managed by independent DBMS; distribution of data governed by factors such as local ownership, increased availability in addition to performance issues

Architectures for Parallel Databases

basic idea: carry out evaluation steps in parallel whenever possible, in order to improve performance

For simplicity: centralized DBMS but processing of operations is parallelized

three main architecture models:

shared-memory: multiple CPUs attached to interconnection network and can access common region of main memory
shared-disk: each CPU has private memory and direct access to all disks through interconnection network
shared-nothing: each CPU has local main memory and disk space but no two CPUs can access same storage area; all communication between CPUs through network connection

Evaluation of Architectures

shared memory:
- closer to conventional machine (many commercial systems ported to shared memory platforms relatively easily)
- communication overheads are low because main memory used for this purpose
- okay for moderate parallelism, but memory contention becomes bottleneck as number of CPUs increases
shared disk: shared disk faces same problem as shared memory
- basic problem in shared disk and shared memory architectures is intereference: as more CPUs added, existing CPUs slowed down due to increased contention for memory accesses and network bandwidth
shared nothing: widely considered best architecture for large parallel database systems
- requires more extensive reorganization of DBMS code
- but provides linear speedup: time taken for operations decreases in proportion to increase in number of CPUs and disks
- provides linear scale-up: performance is sustained if number of CPUs and disks are increased in proportion to data size
- implies ever-more powerful databases can be built by taking advantage of rapidly improving performance for single CPU systems and connecting as many CPUs as possible

Parallel Query Evaluation

relational query execution plan is graph of relational algebra operators and operators in graph can be executed in parallel

if operator consumes output of a second operator we have pipelined parallelism - output of second operator worked on by first operator as soon as it is generated
- pipelined parallelism limited by the presence of operators that block (or produce no output) until they have consumed all their inputs
if two operators are independent they can be evaluated in parallel
in addition, can evaluate each individual operator in parallel
- key to evaluating operator in parallel is to partition the input data: work on each partition in parallel and then combine results
- data partitioned parallel evaluation
- exercising some care, existing code for sequential evaluation can be ported easily for data partitioned parallel evaluation
- another reason for success of shared-nothing architectures is database query evaluation very amenable to data-partitioned parallel evaluation
- goal is to minimize data shipping by partitioning data and structuring algorithms to do most processing at individual processors (CPU plus disk)

Data Partitioning

partitioning large dataset across several disks enables us to exploit I/O bandwidth of disks by reading and writing in parallel

partitioning methods:

round-robin partitioning: if n processors then i-th tuple assigned to processor i mod n.
- suitable for efficiently evaluating queries that access entire relation
- not suitable if only subset of tuples needed (since they may all fall on same disk)- i.e., age = 20
hash-partitioning: hash function applied to (selected) fields of tuple to determine processor.
- suitable if only subset of tuples needed that satisfy a selection predicate (i.e., age = 20) - access only those disks that contain the data
- not as efficient if selection is on a range of values (age between 10 and 20)
- hash has additional virtue in keeping data evenly distributed even if data grows or shrinks over time
range-partitioning: tuples are sorted (conceptually) and n ranges chosen for sort key values so that each range contains roughly same number of tuples; tuples in range i assigned to processor i.
- superior to hash for selections over range of values (age between 10 and 20)
- can lead to data skew; partitions with widely varying number of tuples across partitions or disks
- skew causes processors dealing with large partitions to become performance bottlenecks
- to reduce skew - how to choose ranges by which tuples are distributed? one approach: take samples from each processor, collect and sort samples, divide sorted set into equally sized subsets. If tuples partitioned on age, age ranges of sampled subsets can be used as basis for redistributing entire relation

Parallel Sequential Operator Evaluation Code

an elegant software architecture for parallel DBMS enables us to readily parallelize existing code for sequentially evaluating an operator
basic idea is to use parallel data streams
- streams (from different disks or output of other operators) are merged as needed to provide the inputs for the operator
- outputs are then split as needed to parallelize subsequent processing
parallel evaluation plan consists of dataflow network of relational, merge, and split operators
merge and split operators should be able to buffer some data and should be able to halt the operators producing their input data; thus, they can regulate speed of execution according to the execution speed of the operator that consumes theur output

Parallelizing Relational Operations

assume each relation horizontally partitioned across several disks in shared nothing architecture
above scheme may or may not be best partitioning strategy
evaluation of query must take into account initial partitioning and repartition if necessary
two simple operations:
- scanning: pages can be read in parallel while scanning a relation, and retrieved tuples can be merged, if the relation is partitioned across several disks; idea also applies when retrieving all tuples that meet a selection condition; if hashing or range partitioning is used, selection queries can be answered by going to just those processors that contain relevant tuples
- bulk loading: similar observation holds for this operation; bulk loading a relation that has associated indexes can benefit even more from parallelism because any sorting of data entries required for building indexes can also be done in parallel

Sorting

simple idea: let each CPU sort part of relation on its local disk and then merge sorted sets of tuples - degree of parallelism limited by the merge phase
better idea: first redistribute all tuples in relation using range partitioning
- for example: to sort employee tuples by salary, salary values ranging from 10 to 210 and we have 20 processors, could send all tuples with salary values in range 10 to 20 to first processor, in range 21 to 30 to second processor etc. (Prior to redistribution we cannot assume they have been distributed according to salary ranges)
each processor sorts tuples assigned to it (using some sequential algorithm)
- for example: processor can collect tuples until its memory is full, then sort these tuples and write out a run, until all incoming tuples have been written to such sorted runs on the local disk. these runs can then be merged to create sorted version of the set of tuples assigned to this processor
entire sorted relation can be retrieved by visiting processors in order corresponding to the ranges assigned to them and simply scanning tuples
basic challenge in parallel sorting is to do the range partitioning so that each processor receives roughly equal number of tuples - otherwise we have performace bottleneck
- one good approach to range partitioning is obtain a sample of the entire relation by taking samples at each processor that contains part of the relation. the small sample is sorted and used to identify ranges with equal number of tuples; this set of range values, called splitting vector, is then distributed to all processors and used to range partition the entire relation
important application of parallel sorting is sorting the data entries in tree structured indexes; sorting data entries can speed up the process of bulk loading an index

Parallel Join Algorithms

consider join of two relations A and B on age attribute

assume initially distributed across several disks (in some way not useful for join operation)- initial partitioning not based on join attribute
basic ideain joining A and B in parallel is to decompose join into collection of k smaller joins
decompose join by partitioning both A and B into k logical partitions or buckets - union of k smaller joins computes join
since A and B are distributed across processors, partitioning itself can be done in parallel: each processor sends tuples in i-th partition to processor i;
- at each processor all local tuples are retreived and hashed into one of k partitionings (same hash function at all sites) - leads to hash join version
- alternatively, can partition A and B by dividing range of join attribute into k disjoint subranges and placing A and B tuples into partitions according to subrange to which their age values belong; for example: ten processors age has values 0 to 100, assuming uniform distribution A and B tuples with age 0 to 10 go to processor 1. note: can lead to skew
assign each partition to processor and carry out local join (using any sequential join algorithm);
- after partitioning, each processor joins A and B tuples assigned to it; each join process executes sequential join code, receives input A and B tuples, merge operator merges all incoming A tuples, another merge operator merges B tuples, depending on how we want to distribute result of join the output may have to be split
- if range partitioning is used the output is available in sorted order

Parallel Hash Join

if relations A and B are very large and number of partitions k is chosen to be equal to number of processors - size of each partition is still large therefore high cost for local join
hash-based refinement for improved performance
execute smaller joins one after other but in parallel - this allows utlilization of total available main memory at all processors
1. at each site apply hash function h1 to partition A and B tuples into partitions i=1,2,...k; Let A be smaller relation, the numbers of partitions k chosen such that each partition of A fits into the aggregate combined main memory of all n processors
2. For i=1....k process join of i-th partitions of A and B, to compute A join B do at every site:
  - apply second hash function h2 to all Ai tuples to determine where they should be joined and send tuple t to site h2(t)
  - as Ai tuples arrive to be joined, add them to an in-memory hash table
  - after all Ai tuples have been distributed apply h2 to Bi, and send to site h2(t)
  - as Bi tuples arrive to be joined, probe in-memory hash table of Ai tuples and output results
use of second hash function h2 ensures tuples are uniformly distributed across all n processors
this approach greatly reduces cost for each of the smaller joins and thus the cost of the overall join

Next: About this document Up: No Title Previous: No Title

Bhagirath Narahari
Thu Sep 11 13:14:14 EDT 1997