Next: About this document
Up: No Title
Previous: No Title
Parallel and Distributed Databases: Why?
- Performance: using several resources (CPU, disks) in parallel
significantly improves performance
- availability: if site containing relation goes down, relation continues
to be available if a copy maintained at another site
- distributed access to data: although analysts may need to access
data corresponding to different sites, usually find locality in the
access patterns; locality can be exploited by distributing data accordingly
- analysis of distributed data: organizations/applications increasingly
want to examine all data available to them even if stored
across multiple sites and multiple database systems
Parallel Database system: seeks to improve performance via
parallel implementation of operations; any distribution of data governed
solely performance considerations
Distributed Database system: data physically stored across
multiple sites each typically managed by independent DBMS;
distribution of data governed by factors such as local ownership,
increased availability in addition to performance issues
Architectures for Parallel Databases
basic idea: carry out evaluation steps in parallel whenever
possible, in order to improve performance
For simplicity: centralized DBMS
but processing of operations is parallelized
three main architecture models:
- shared-memory: multiple CPUs attached to interconnection
network and can access common region of main memory
- shared-disk: each CPU has private memory and direct access
to all disks through interconnection network
- shared-nothing: each CPU has local main memory and disk space
but no two CPUs can access same storage area; all communication
between CPUs through network connection
Evaluation of Architectures
- shared memory:
- closer to conventional machine (many commercial systems ported to shared
memory platforms relatively easily)
- communication overheads are low because main memory used for this
purpose
- okay for moderate parallelism, but memory contention becomes
bottleneck as number of CPUs increases
- shared disk: shared disk faces same problem as shared memory
- basic problem in shared disk and shared memory architectures is
intereference: as more CPUs added, existing CPUs slowed down due
to increased contention for memory accesses and network bandwidth
- shared nothing: widely considered best architecture for large
parallel database systems
- requires more extensive reorganization of DBMS code
- but provides linear speedup: time taken for operations decreases
in proportion to increase in number of CPUs and disks
- provides linear scale-up: performance is sustained if number of
CPUs and disks are increased in proportion to data size
- implies ever-more powerful databases can be built by taking advantage
of rapidly improving performance for single CPU systems and connecting
as many CPUs as possible
Parallel Query Evaluation
relational query execution plan is graph of relational algebra operators
and operators in graph can be executed in parallel
- if operator consumes output of a second operator we have
pipelined parallelism - output of second operator worked on
by first operator as soon as it is generated
- pipelined parallelism limited by the presence of operators that
block (or produce no output) until they have consumed all their inputs
- if two operators are independent they can be evaluated in parallel
- in addition,
can evaluate each individual operator in parallel
- key to evaluating operator in parallel is to partition
the input data: work on each partition in parallel and then combine
results
- data partitioned parallel evaluation
- exercising some care, existing code for sequential evaluation
can be ported easily for data partitioned parallel evaluation
- another reason for success of shared-nothing architectures
is database query evaluation very amenable to data-partitioned
parallel evaluation
- goal is to minimize data shipping by partitioning data and structuring
algorithms to do most processing at individual processors (CPU plus disk)
Data Partitioning
partitioning large dataset across several disks enables us
to exploit I/O bandwidth of disks by reading and writing in parallel
partitioning methods:
- round-robin partitioning: if n processors then
i-th tuple assigned to processor i mod n.
- suitable for efficiently evaluating queries that access entire relation
- not suitable if only subset of tuples needed (since they may
all fall on same disk)- i.e., age = 20
- hash-partitioning: hash function applied to (selected) fields
of tuple to determine processor.
- suitable if only subset of tuples needed that satisfy a
selection predicate (i.e., age = 20) - access only those disks that
contain the data
- not as efficient if selection is on a range of values (age between
10 and 20)
- hash has additional virtue in keeping data evenly distributed
even if data grows or shrinks over time
- range-partitioning: tuples are sorted (conceptually) and n ranges
chosen for sort key values so that each range contains roughly same
number of tuples; tuples in range i assigned to processor i.
- superior to hash for selections over range of values (age between
10 and 20)
- can lead to data skew; partitions with widely varying
number of tuples across partitions or disks
- skew causes processors dealing with large partitions to become
performance bottlenecks
- to reduce skew - how to choose ranges by which tuples are distributed?
one approach: take samples from each processor, collect and sort samples,
divide sorted set into equally sized subsets. If tuples partitioned
on age, age ranges of sampled subsets can be used as basis for redistributing
entire relation
Parallel Sequential Operator Evaluation Code
- an elegant software architecture for parallel DBMS enables
us to readily parallelize existing code for sequentially evaluating
an operator
- basic idea is to use parallel data streams
- streams (from different disks or output of other operators) are
merged as needed to provide the inputs for the operator
- outputs are then split as needed to parallelize subsequent
processing
- parallel evaluation plan consists of dataflow network of relational,
merge, and split operators
- merge and split operators should be able to buffer some data
and should be able to halt the operators producing their input data;
thus, they can regulate speed of execution according to the execution speed
of the operator that consumes theur output
Parallelizing Relational Operations
- assume each relation horizontally partitioned
across several disks in shared nothing architecture
- above scheme may or may not be best partitioning strategy
- evaluation of query must take into account initial partitioning
and repartition if necessary
- two simple operations:
- scanning: pages can be read in parallel while scanning a relation,
and retrieved tuples can be merged, if the relation is partitioned
across several disks;
idea also applies when retrieving all tuples that meet a selection condition;
if hashing or range partitioning is used, selection queries can be answered
by going to just those processors that contain relevant tuples
- bulk loading: similar observation holds for this operation;
bulk loading a relation that has associated indexes can benefit even more
from parallelism because any sorting of data entries required for building
indexes can also be done in parallel
Sorting
- simple idea: let each CPU sort part of relation on its local disk
and then merge sorted sets of tuples - degree of parallelism limited
by the merge phase
- better idea: first redistribute all tuples in relation using range
partitioning
- for example: to sort employee tuples by salary,
salary values ranging from 10 to 210 and we have 20 processors,
could send all tuples with salary values in range 10 to 20 to first
processor, in range 21 to 30 to second processor etc. (Prior to
redistribution we cannot assume they have been distributed according
to salary ranges)
- each processor sorts tuples assigned to it (using some sequential
algorithm)
- for example: processor can collect tuples until its memory is full,
then sort these tuples and write out a run, until all incoming tuples
have been written to such sorted runs on the local disk. these
runs can then be merged to create sorted version of the set of tuples
assigned to this processor
- entire sorted relation can be retrieved by visiting processors in
order corresponding to the ranges assigned to them and simply scanning
tuples
- basic challenge in parallel sorting is to do the range partitioning
so that each processor receives roughly equal number of tuples - otherwise
we have performace bottleneck
- one good approach to range partitioning is obtain a sample of
the entire relation by taking samples at each processor that contains
part of the relation. the small sample is sorted and used to identify
ranges with equal number of tuples; this set of range values,
called splitting vector, is then distributed to all processors
and used to range partition the entire relation
- important application of parallel sorting is sorting the data entries
in tree structured indexes; sorting data entries can speed up the
process of bulk loading an index
Parallel Join Algorithms
consider join of two relations A and B on age attribute
- assume initially distributed across several disks (in some way
not useful for join operation)- initial partitioning not based on
join attribute
- basic ideain joining A and B in parallel is to decompose join
into collection of k smaller joins
- decompose join by partitioning both A and B into k logical
partitions or buckets - union of k smaller joins computes join
- since A and B are distributed across processors, partitioning
itself can be done in parallel:
each processor sends tuples in i-th partition to processor i;
- at each processor all local tuples
are retreived and hashed into one of k partitionings (same
hash function at all sites) - leads to hash join version
- alternatively, can partition A and B by dividing range of join attribute
into k disjoint subranges and placing A and B tuples into partitions according
to subrange to which their age values belong;
for example: ten processors age has values 0 to 100, assuming uniform
distribution A and B tuples with age 0 to 10 go to processor 1.
note: can lead to skew
- assign each partition to processor and carry out local join (using
any sequential join algorithm);
- after partitioning, each processor joins A and B tuples assigned to it;
each join process executes sequential join code, receives input A and B
tuples, merge operator merges all incoming A tuples, another merge
operator merges B tuples, depending on how we want to distribute result
of join the output may have to be split
- if range partitioning is used the output is available in sorted order
Parallel Hash Join
- if relations A and B are very large and number of partitions k
is chosen to be equal to number of processors - size of each
partition is still large therefore high cost for local join
- hash-based refinement for improved performance
- execute smaller joins one after other but in parallel -
this allows utlilization of total available main memory at all processors
- at each site apply hash function h1 to partition A and B tuples into
partitions i=1,2,...k;
Let A be smaller relation, the numbers of partitions k chosen such that each
partition of A fits into the aggregate combined main memory of all n processors
- For i=1....k process join of i-th partitions of A and B,
to compute A join B do at every site:
- apply second hash function h2 to all Ai tuples to determine where
they should be joined and send tuple t to site h2(t)
- as Ai tuples arrive to be joined, add them to an in-memory hash table
- after all Ai tuples have been distributed apply h2 to Bi, and
send to site h2(t)
- as Bi tuples arrive to be joined, probe in-memory hash table of Ai
tuples and output results
- use of second hash function h2 ensures tuples are uniformly distributed
across all n processors
- this approach greatly reduces cost for each of the smaller joins
and thus the cost of the overall join
Next: About this document
Up: No Title
Previous: No Title
Bhagirath Narahari
Thu Sep 11 13:14:14 EDT 1997