SPJU - Select, Project, Join, Union queries
ID - Inclusion dependency
FD - Functional dependency
MVD - Multivalued Dependency

Overview

Typical DBMS functionality

Define a particular database in terms of its data types, structures and constraints
Construct or Load the initial database contents on a secondary storage device
Manipulating the database
- Retrieval: QUerying, generating reports
- Modification: Insertions, deletions and updates
- Accessing the database through Web applications
Processing and Sharing by a set of concurrent users and apps - yet, keeping data valid and consistent
Protection and security
Maintaining the db over the lifetime of the db application

Why use DBMS

Self-describing nature of database system (description called meta-data - stored in catalog itself)
Data independence - don't need to know the implementation of the database to access the data
Application insulated from how data is structured and stored
- change the order of tuples
- add or modify columns
- add or modify indexes
Query does not change when physical structure changes
Efficient access - queries are optimized
Reduced application development time - declarative expression, don't need to indicate how to execute query
Data integrity and security - some constraints enforced automatically
Concurrent access, recovery from crashes - users behave as if using a single-user system
Speed - even with huge data

Data Models

Schema - description of a particular collection of data, using a given data model
Relation model of data - a table with rows and columns, has a schema describing columns or fields

Database Users

End users
Database administrators
database developers

Relation Model

-> All SPJU and SPJ queries are safe from infinite negation, because everything in output must have occurred in input (no elements are created)

Table Name: [List of attributes]
Unique attributes within a table and unique names of tables

Relational Calculus - Rule-based queries

existential quantifiers ∃
universal quantiﬁcation 'for all' ∀ - ∀x F(x) = ¬∃x ¬F(x)
conjunction ∧
disjunction ∨ - F ∨ G = ¬(¬F ∧ ¬G)
Bound Variable - occurring in ∃ or ∀
Free variable - not bound
Queries without free variables are called Boolean queries
Safe Relational Calculus Query - if it always returns a finite result (e.g. Boolean query or SPJ/SPJU query)

SQL

SELECT
FROM Relation Variable
WHERE
AND, OR
AS (e.g. SELECT Mother AS Parent)
UNION
INTERSECT
EXCEPT
IN, NOT IN, EXISTS

Relational Algebra

Projection π
Selection σ
Cartesian Product ×
Natural Join ⋈
Renaming ρ
Union ∪
Intersection ∩
Difference -

Properties of relational algebra operators
- size(π(R)) ≤ size(R)
- 0 ≤ size(σ(R)) ≤ size(R)
- size(R × S) = size(R) × size(S)
- 0 ≤ size(R ⋈ S) ≤ size(R) × size(S)

Disjunctions OR

WHERE A.x="asd" OR A.x="qwe"
SELECT.. FROM.. WHERE
UNION
SELECT.. FROM.. WHERE
answer(y) :- A(x,y,z), x="asd"
answer(y) :- A(x,y,z), x="qwe"
Relational Algebra - ∪

Calclulus to Algebra translation

Active domain of relation: the set of all constants that occur in it: ADOM(R1,...,Rn) = ADOM(R1) ∪...∪ ADOM(Rn)
A safe query over relations R1,...,Rn cannot produce an element outside ADOM(R1,...,Rn)
∀ in calculus could be translated to M - (A×B - A⋈B) with appropriate projections

Functional Dependencies

A relation R satisfies a functional dependency A->B where A and B are attributes, if for every two tuples t1 and t2 in R:

__π__A(t1) = __π__A(t2) implies __π__B(t1) = __π__B(t2)

Let K be a set of attributes of R and U the set of all attributes of R. Then K is a key if R satisfies functional dependency K->U

__π__K(t1) = __π__K(t2) implies t1 = t2

Inclusion Dependencies - foreign keys

E.g.
Employee(EmplId, Name, Dept, Salary)
ReportsTo(Empl1, Empl2)
Expect both Empl1 and Empl2 to be found in Employee, hence:
ReportsTo[Empl1] ⊆ Employee[EmplId]
ReportsTo[Empl2] ⊆ Employee[EmplId]

SQL

Create table

CREATE TABLE Name (attribute1 type, ..., attributeN type)

Types

char(n)
varchar(n)
bit(n) - fixed length bit string of exactly n bits
bit varying(n) - variable length bit string of up to n bits
int - signed 4 bytes
smallint - signed 2 bytes
real
float(s)
date - e.g. DATA '2013-04-22'
time - e.g. TIME '17:15:00'
timestamp - e.g. TIMESTAMP '2013-04-22 17:15:00'
character, etc.

Populating tables

INSERT INTO Name VALUES (...)

Dropping tables

DROP TABLE Name

Changing tables

ALTER TABLE Name ADD COLUMN newColumnName type
ALTER TABLE Name DROP COLUMN columnName

Default values

CREATE TABLE Name (field type DEFAULT value)
e.g. CREATE TABLE Fck (U INT DEFAULT 0, B INT)

Fixed and variable length

field CHAR(10)	field VARCHAR(10)
INSERT ... "xxx"	INSERT ... "xxx"
LENGTH(field) = 10	LENGTH(field) = 3

Constraints on fields

not null
primary key
field int not null primary key
OR
field int not null
primary key (field)
unique (field1, field2)
foreign key (field1, field2) references OtherTable
foreign key (field1, field2) references OtherTable on delete cascade
foreign key (field1, field2) references OtherTable on delete set null

SELECT

SELCET DISTINCT
SUM, COUNT - still requires DISTINCT
AVG - average, keep duplicates
SELECT Director, AVG(CAST (Length AS REAL)) AS Avgl
FROM Movies
GROUP BY Director
MIN, MAX
HAVING, e.g.:
SELECT Director, AVG(Length+0.0)
FROM Movies
GROUP BY Director
HAVING MAX(Length) > 120
< or > ANY or ALL
ORDER BY field (, field,..) (DESC)

WHERE

UNION, INTERSECT and EXCEPT eliminate duplicates
UNION ALL, INTERSECT ALL and EXCEPT ALL to keep duplicates
LIKE 'string%' or LIKE 'string_'
_ - stands for any letter
% - stands for any substring, including empty
Note: LIKE 'char_' may return empty because of trailing spaces - use VARCHAR or %

UPDATE

UPDATE table
SET field = value
WHERE condition

NATURAL JOIN

Generally: R JOINT S ON c
preserve all tuples and fill with nulls
- Relation1 NATURAL LEFT OUTER JOIN Relation2 - keep non-matching tuples from R1
- Relation1 NATURAL RIGHT OUTER JOIN Relation2 - keep non-matching tuples from R2
- Relation1 NATURAL FULL OUTER JOIN Relation2 - keep all

VIEWS

CREATE VIEW Name (field1, field2, ...) AS [query]
DROP VIEW Name

WITH

Same as view, but temporary

Design

Entities

Entity - an object that is involved in the enterprise
Entity type - set of similar objects (e.g. students, courses, professors)
Attribute - describes on aspect of an entity type (e.g. name, address, Id)
- Domain - possible values (e.g. {Informatics, EE, Joint})
- Key - minimum set of attributes that uniquely identifies an entity
- Entity Schema - Type name, attributes, key

E-R diagrams

IsA - Attributes of supertype apply to subtype

Can create more concise and readable E-R diagram
common attributes need not be repeated
attributes of sibling types can be different

Participation

Constraints and Design

FDs - X -> Y
Keys - X -> U, where U is the set of all attributes in a relation
Candidate keys - a minimal key X, no subset of X is key
Inclusion dependency - R[A1, .., An] ⊆ S[B1, .., Bn] (unary if n=1)
Foreign key - R[A1, .., An] ⊆ S[B1, .., Bn] and {B1, .., Bn} is a key

Good Properties of DB schemas

no update anomalies
no redundancies
no information loss

What causes update anomalies

Functional dependencies X -> Y where X is not a key

BCNF

A relation is in BCNF if for every nontrivial FD X -> Y, X is a key.

A database is in BCNF if every relation in it is in BCNF.

Decomposition

Given a set (U, F) of attributes U and FDs F, a decomposition is:

(U1, F1), .., (Un, Fn)

where Ui ⊆ U and Fi is a set of FDs over Ui.
BCNF decomposition - if each (Ui, Fi) is in BCNF.

4NF - 4th Normal Form

Can only be produced in the presence of MVDs
4NF implies BCNF
Any BCNF schema that has one key that consists of single attribute, is in 4NF.

3NF - 3rd Normal Form

(U, F) is in 3NF if for any FD X -> A, where A ∉ X is an attribute F ⊢ X -> A implies that one of the following is true:
- either X is a key, or
- A is prime

Query Processing and Optimization

σc1(σc2(E)) = σc1∧c2(E)
πA1,..,An(σc(E)) = σc(πA1,..,An(E))
σc(E1 ⋈ E2) = σc(E1) ⋈ E2 (if c only involves attributes of E1)
σc(E1 ⋈ E2) = σc(E1) ⋈ σc(E2)

Tailored optimization

Order matters - in WHERE use more limiting conditions first, e.g.
WHERE grade=’A’
AND sex=’female’
Use a>x OR a<x instead of a<>x
Provide more join information (e.g. all equal fields in >3 relations)
Avoid UNION if OR is sufficient
Use JOIN instead of nested queries

Transaction and Concurrency Control

ACID properties

Atomicity - either all operations of a transaction are reflected properly in the db, or none are.
Consistency - execution of a transaction in isolation preserves the consistency of the db.
Isolation - for any two transactions T and T', it appears to T that either T' finished before or started execution after T.
Durability - after a transaction completes successfully, the changes it has made persist.

Swapping a pair of operations in a schedule is allowed when:

they refer to different data items
they refer to the same data item and are both read operations

A schedule is called conflict serializable if it can be transformed into a serial schedule by a sequence of such swap operations

Precedence Graphs__

With read and write
An edge from T to T' if:

T executes write(X) before T' executes read(X)
T executes read(X) before T' executes write(X)
T executes write(X) before T' executes write(X)

A schedule S is conflict serializable iff its precedence graph contains no cycles.
A topological order of a precedence graph gives a serial schedule.

With lock and unlock
An edge from Ti to Tj when operation Ti:unlock(A) and a consecutive Tj:lock(A)

Two-phase locking

A protocol which guarantees conflict-serializable schedule
Two phases:
- Growing phase - request new locks, but not release locks
- Shrinking phase - release locks, but not request locks

Deadlocks

Prevention - low data utilization
Detection - if T2 request a resource that T1 holds:
- let T2 wait
- roll back T1 and grant the lock to T2
- roll back T2
  Decision based on timestamps - older transaction gets it.
  Starvation may occur - as some transaction are rolled back constantly

Wait-fore graph

There is an edge from T to T' if T' waits for T to release a lock.
There is a lock if there is a cycle in the graph
Solution: identify a minimal set of transactions such that rolling them back (removing from graph) would result in a cycle-free graph.

XML - eXtensive Markup Language

Tags and text
User-defined structure
Tags come in pairs - must be properly nested
Has only one "basic" type: PCDATA (Parsed Character DATA)
XML elements
- Element - segment between an start and corresponding end tag
- Subelement - -the relation between an element and its component elements
- Attribute - unique per tag, cannot be nested, e.g.

Query Languages

XPath
XSLT
XQuery

Tasks

FIlter, select values - navigation, selection, extraction
Merge, integrate values from multiple XML sources - joins, aggregation
Transform XML values from one schema to another - XML construction

DTD

<!ELEMENT node_name (regex)>
<!ELEMENT node_name attribute_name TYPE #required>
<!ELEMENT leaf_name (#PCDATA)>
E.g.:
<!ELEMENT book (title, (author+ | editor+), publisher?, price?) >
<!ATTLIST book year CDATA #required >
<!ELEMENT publisher (#PCDATA) >

XPath

Downward traversal
- lib/book -- parent/child
- lib//last -- parent descendant
- lib/book/* -- wild card
- lib/book/@year -- attribute
- //book/@* -- attribute wild card
- book/(editor | author)
Filters (qualifiers)
- //book[price]/title -- titles of books with price
- //book[@year > 1991]/title -- titles of books published after 1991
- //book[title and author and not(price)]/title
- //book[author/first = "Neil"]/title -- titles of books with an author whose fist name is Neil
- //book[editor | author]/title -- titles of books with either an author or editor
Upward traversal
- //author[../title="Databases"]/last == //book[title="Databases"]/author/last
- ancestor :: book[//last="Gaiman"] -- find book ancestors with "Gaiman" as its last descendant
Sideways
- following-sibling :: book[//last="Gaiman"] -- find the next book written by Gaiman
- preceding-sibling :: book[//last="Gaiman"] -- find the previous book written by Gaiman

XQuery

I'm not revising that shite, unless it shows up in a past paper :/

DTD

<!ELEMENT E P>
- P: EMPTY|ANY|#PCDATA|E'|P1,P2|(P1|P2)|P?|P+|P*
- Recursive definition, e.g.:
  <!ELEMENT node (leaf| (node,node))>
  <!ELEMENT leaf (#PCDATA)>
<!ATTLIST element_name (attribute-name attribute-type default-declaration)+>
- attribute-type:
  CDATA -- text
  ID -- element's unique ID
  IDREF -- value must be some other element's value already in document
  IDREFS -- a set of IDREF
- default-declaration:
  #required - a value must be provided
  #implied - a value is optional

Limitations

ID unique within entire document, while a key needs only to uniquely identify a tuple within a relation
IDREF untyped: one has no control over what it points to
Keys in relations could be multi-valued, while XML IDs must be single-valued
a relation may have multiple keys, while an element can have at most one ID (primary)
ID/IDREF can only be defined in a DTD, while XML data may not come with a DTD/schema
ID/IDREF, even relational keys / foreign keys, fail to capture semantics of hierarchical data

koshov/DBS.md