Database Overview
What is a database?
- A database is a collection of interrelated data.
For what purpose?
- A Database Management System (DBMS) is a collection of
interrelated data and a set of programs to create, access and manipulate the data
- Large data size
- Persistence of data: data still exists after programs complete execution
- Variety of interrelated data
Typical application
- University registrar keeps information about students, faculty, courses offered, classrooms etc...
- For each student: ID, name, major, grades
- For each course: courseID, course name, credits, department
- For each section: courseID, course name, instructor
- For each faculty: name, department, courses, office, telephone
- What does the registrar do with the data?
- Answer queries: "What grade did Smith get in CS177?", "Get all courses taught by Jones"
- Insert new info: "Smith got B+ in CS177", "Jones teaches CS123"
- Delete some info: "Remove Smith and related data"
Features of a database system
- Control data redundancy and inconsistency
- Unnecessary duplication of data can cause problems
- Space wastage: putting a student's postal address with every occurence of student's name
- Update cost: if an address changes, we have to find every occurence of old address and update it
- Some duplication is necessary.
- Data must be consistent at all times: what happens if update crashes before completion?
- Efficient data access
- Don't scan whole dbase to answer a query.
- For large amounts of data, searching can be time-consuming
- Access efficiency measured differently in DBMS:
=> Goal: minimize disk access time.
- Concurrent access
- Example: Registrar's office has many employees that need to access database simultaneously
- Need to coordinate actions of different users carefully
- User 1: Delete "Smith took CS177"
- User 2: "List Smith's classes"
- Security
- Not every user should be able to access all the data (students able to add/delete grades)
- Control of access to data is needed without having to duplicate data
- Integrity constraints
- Grades need to be in the set {A+,...,F}
- GPA's cannot be negative
- A student cannot take a non existent course
- DBMS should enforce integrity constraints
- Backup and recovery
- Data must be systematically backed up
- Recovery in system crashes
- Inserts and updates cannot occur partially (either complete it or don't do it at all)
- Metadata
- Data about the data (data types, interrelationships, integrity constraints etc).
- Directory and catalog information.
- Data and program independence
- Data should be stored in a program independent fashion
- Changes in program should not affect access to data
- Data reentry should not be needed
- User and programmer interface
- Query language
- GUI
- Programmer API
- Features that allow data to be reorganized for efficient access
What is a data model?
- Data Model: An agreed upon method for abstractly describing the logical organization of data
- Analogy: architectural blueprints
- Blueprints are standardized: any builder can work on a plan drawn
by any architect.
=> conventional symbols for windows doors plumbing etc.
- However, blueprint standards may differ across countries.
- A Data Model is like a blueprint standard:
- An agreed upon way to describe how a dbase is to be organized.
- Permits actual implementation to be independent of this description.
=> blueprint does not specify building materials (cement or brick).
- A data model can be implemented in various ways.
- Example: sets
- Sets are described using notation like
A = {17, 234, 88}
B = {45, 88, 129, 564}
- Operations are defined on sets (union, intersection etc).
- However, sets can be implemented in many ways:
- Arrays
- Linked lists
- Various algorithms for union, intersection etc
- Summary: A Data Model describes the logical organization of data along
with operations that manipulate the data
Databases: Then and Now
- Ancient data storage: stone tablets, scrolls, paper, log books etc
- Early days of computing: storage on cards, paper tape
- 1950's:
- First commercial uses of computers
- Census database
- Customer database for IBM
- 1960's:
- IBM and others develop first few DBMS
- Based on files, hierarchical and network data models
- Nascent formalization of databases
- 1970's:
- Relational database model (E.F.Codd)
- Significant research in sorting, searching, physical implementation
- Query languages: SQL, QUEL, QBE
- Several popular databases: System R (IBM),Ingres (Berkeley)
- Transaction processing
- 1980's:
- Databases for PC's
- Many additional features (GUIs, additional power)
- Research on distributed databases, OO databases
- Today's database giants: Oracle, Sybase, Informix
- Preliminary research on non-traditional databases:
=> geographic and spatial systems, image databases, scientific databases
- 1990's:
- Geographic Information Systems: GIS, spatial databases
=> maps, terrain data
- Image databases
=> Images, video
- Web interfaces
- Financial, accounting systems
- Scientific databases, bioinformatics
People associated with DBMS
- Database administrator
- Database application programmer
- End user
- DBMS system programmer
- Database researcher
Relation Data Model
- Recall: a data model is a convention for describing the logical
organization of data
- Other data models: hierarchical and network models
- Both the hierarchical and network models need a detailed understanding
of the structure of the data to answer queries
- Goals of relational model
- To describe a logical structure simple enough to
minimize structural or navigational information.
- To create a mathematically precise framework for the logical representation of data.
Definitions
- Consider this airline database example:
- The data in the DBMS is a collection of tables.
- Each table has named columns and data in the rows.
- Another word for table: relation.
- Other names for a row: a record or a tuple.
Example: "Alice, 555-55-5555, F101"
- Other names for columns: fields, attributes.
What is a key?
- A key is a group of columns in a table such that: each row has a
unique value in those columns
Example: SSN in the PASSENGER table.
- Keys are useful in searching and removing duplication.
- Primary key: opt for one key among several possible keys.
- Foreign key: a column in one table that is a primary key of
another table.
Example: FLIGHT# in PASSENGER table.
Constraints
- Domain constraints: proper typing of values (remember the example where GPA cannot be negative).
- Key constraint 1: Every relation (or table) should have a key
- Key constraint 2: Primary keys can't have null (empty) values
- Foreign key constraint: A table shouldn't have a
foreign key value that doesn't exist in the foreign relation
Operations on tables
- Update
- Insert a record in the table
Example: Insert "Mary, 999-99-9999, F123" in the first table
- Delete a record in the first table
Example: Delete all rows with "NAME=John"
- Modify an existing record in the table
Example: Modify "Alice, 555-55-5555, F313" only changes the
flight number field for Alice's record
- Retrieval
- Selection operator
- Applies to a single table and results in a new table
- Need to specify row-selection criteria
- Results in a new table
- Can specify condition as a boolean expression
Example: "Select all records from first table
where NAME=Bob" returns
- Projection operator
- Applies to a single table
- Need to specify attributes list
- Results in a new table containing only the attributes specified
Example: "Project NAME and SSN from PASSENGER
table"
- Union operator
- Applies to two tables that are union-compatible
- Results in a new table containing rows from both tables
(with duplicates removed)
- Intersection and difference operators
- Applies to two tables that are union-compatible
- Intersection: rows common to both tables
- Difference: rows in first table that are not in the second
- Join operator
- Applies to two tables
- Resulting table contains rows that are "joined" from one
row each of the two tables.
- Two rows are "joined" based on the join condition.
Example: Join PASSENGER and FLIGHT using FLIGHT#
Exercise 1:
What is the result of the query
"List the names of all the customers arriving at LAX airport".
Show how to combine join and projection to get the result.
SQL: A Language for Relational Databases
- Formal query languagues: Relational algebra and relational calculus
- Real world query languages: SQL, QUEL, QBE (among others)
- Most commonly used: SQL (pronounced either "see-kwell" or "ess-cue-ell"
- SQL: originally called SEQUEL (Structured English QUEry Language), 1974-1976, IBM
- ANSI SQL standard, 1986
- SQL2 standard, 1992 - also called SQL-92 (current work on SQL3)
- SQL:
- Lets you create and delete tables
- Lets you specify queries on existing tables
- Lets you specify domain, key and foreign constraints
- Has support for security, transaction management and remote access
Basic SQL Query
- What is an SQL "program"?
- SQL code can be typed interactively into an interpreter
- SQL code can reside in text files and be compiled (s most high level languages are)
- SQL statements can be input from within other programming languages
- Basic form of an SQL query statement:
select <attribute list>
from <relation list>
[where <condition>];
- NOTE:
- The SQL keywords above are: select, from, where
- select and from clauses are required
=> where clause is optional
- There are other optional clauses (such as group by)
- Example:
select NAME, SSN
from PASSENGER
where PASSENGER.FLIGHT# = 'F313';
- This expresses the query:
"Find the names and SSN's of all passengers whose flight number is F313"
- The result:
- The select in SQL corresponds to the project operator
- The where clause corresponds to the select operator
- The from clause specifies the tables which the query applies to.
- Another example:
- Query: "List the names and SSN's of all the passengers flying in flight F123"
- This is how a join is done in SQL:
select NAME, SSN
from PASSENGER, FLIGHT
where PASSENGER.FLIGHT# = FLIGHT.FLIGHT#
and PASSENGER.FLIGHT# = 'F123';
- How to "read" an SQL statement:
- Think of computing the cross-product of the relations in the from clause
- Then, apply the condition in the where clause to each tuple in
the cross-product to select tuples
- Finally, project the attributes in the select clause to get the result relation
- Observe: the join condition is explicitly specified in the where clause
- NOTE: in an actual database, query optimization will try to prevent
expensive cross product computations
Bioinformatics Computing
The narrow view of bioinformatics:
- Sequence comparisons (alignment) and searches.
- Phylogenetics
Computing skills required for the narrow view:
- Basic skills: programming, data structures, algorithms
- Probability (Basic probability, Markov chains).
- Discrete Optimization: combinatorial problems, dynamic
programming, string searching
- Specialized algorithms: clustering, discrimination.
The broader view of bioinformatics:
- Sequence comparisons (alignment) and searches.
- Phylogenetics
- Protein folding computations
- Graphics and visualization
- Drug design
- Chemical pathway process simulation
Additional computing skills for the broader view:
- Systems: databases, distributed computing, parallel computing
- Graphics/User Interfaces
- Scientific computing