- Database - a collection of data (often in a DBMS) organized for a specific application (also see [Database Section](https://github.com/ION606/learn/blob/main/Database%20Systems/Notes.md#databases))
- Database Application - a software product that uses DBMSs to store one or more databases for a specific purpose
- Database Schema
- what types of data are valid to store
- fixed model
- Hard/expensive to change once implemented
- Does NOT contain the data itself
- **attributes are just the column names**
- Database Instance -
- the actual data that satisfies the rules of the database schema
- changing facts, what is true about the data at the moment
- Relational Data Model - the most popular way to describe data schema
- Data Model
- the type of data that can be stored
- rules about the data (Database Schema)
- design so that you hopefully never have to make changes, cause making changes later on is difficult
- Transaction - a program that changes data or a sequence of database operations that satisfies the ACID properties (which can be perceived as a single logical operation on the data)
- ACID - see [ACID Section](https://github.com/ION606/learn/blob/main/Database%20Systems/Notes.md#boyce-codd-normal-form-bcnf)
- Relational Data Model - see [Relational Data Model](https://github.com/ION606/learn/blob/main/Database%20Systems/Notes.md#relational-data-model) section
3. query language - allow access (read/write/update) to stored data easily
4. durability - data is safe even after something like a power outage
5. concurrent access - multiple users can read/write the same data without compromising integrity
### DBMS Components
- Storage Manager
- index or file manager
- Database Language Tools
- DML - Data query or manipulation language compiler
- DDL - Data definition language
- Query Execution Engine
- Buffer Manager
- Transaction Manager
- Logging and Recovery
- Concurrency Control
- Database Admin
- responsible for designing the data model
- Database Programmer
- responsible for writing application software that stores the database
- Systems Admin
- responsible for installation and tuning the DBMS system
### A C I D
a set of properties of database transactions intended to guarantee data validity despite errors, power failures, etc.
**__ACID stands for:__**
- Atomicity - transactions must be completed fully or leave no effect on the database
- Consistency - DBMS must not allow programmers to violate consistency rules for a database schema
- Isolation - multiple transactions executed at the same time should result in the same thing as executing them one at a time
- Durability - once a transaction completes, DBMS must record ALL its results and make sure they're not lost
::: info
Example: A transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction
:::
### Databases
- given by data schema/model (rules regarding data) and the database instance (the data)
- more here later.....
### Data Model
- Logical Data Model
- Relations and attributes
- Constraints (what is valid data and what is not)
- relation, tuple, attribute
- Physical Data Model
- Where to store the data
- which file systems (distributed, replicated)
- How to store the data
- which indices to create
- table, row, column
- Application Logic
- Built on top of database queries
- declarative: write once and optimize on top of the logical data model
## Relational Data Model
**Definitions**
- Relations (or tables) - store information
- Attribute (or column) - a property of a specific object represented by a relation
- Domain - a set of valid input
- Simple domains are integers/strings
- Complex Domains:
- can be defined with restrictions over these domains
- example: an 8-digit integer that starts with 6
- Schema - the names/domains of/for the attributes
**Structure**
- A relation contains a set of tuples
- A valid relation instance is made of tuples containing:
- values for all attributes in the relation schema that are drawn from the domain with that attribute
**Logical vs Physical Names**
- Logical
- the mathematical definition of the relational data model
- based on a set of semantics
- Physical
- the storage/implementation of the data model
- the implementation might not be identical to the logical model
R1 and R2 are the same relation A key would be \[title, author\]
### Projection
syntax: `project_{property}{set}`
### Selection

### Cartesian Product
R x S = { t such that t has all attributes in R and all the attributes in S, such that there is a tuple r in R and a tuple s in S where t is equal to r for attributes in R and to s for attributes in S
A dependency function (FD) is a database constraint that determines the relationship of one attribute to another in a [database management system (DBMS)](https://www.geeksforgeeks.org/introduction-of-dbms-database-management-system-set-1/). Functional dependencies help maintain the quality of data in the database. Functional dependence is a relationship that exists between two attributes. It usually exists between the primary key and non-prime attributes in the table.
**Example:****X -> Y**
In this case, the left side of the arrow is the determinant and the right of the arrow is dependent. X will be the primary attribute and Y will be a non-prime attribute of the table. It shows that column X's attributes uniquely identify column Y's attributes to satisfy this functional dependency.
***AKA each value on the left side of the arrow is associated with exactly one thing on the right side of the arrow***
#### Functional Dependency Keys
A set of keys that implies all other dependencies
**Example:**
You are given the following set F2 of functional dependencies for relation R(A,B,C,D,E,F):
F2 = {AB -> CD, D->E, CA->B}
The keys would be ABF and ACF
### **Inference Rules**
FDs stands for Functional Dependencies. These are the set of attributes, which are logically related to each other.
**There are 6 inference rules:**
- **Reflexive Rule:** if B is a subset of A then A logically determines B. Formally, **B ⊆ A** then **A → B**.
- Example: Let us take an example of the Address (A) of a house, which contains so many parameters like House no, Street no, City, etc. These all are the subsets of A. Thus, address (A) → House no. (B).
- **Augmentation Rule:** It is also known as [**Partial dependency**](https://www.geeksforgeeks.org/differentiate-between-partial-dependency-and-fully-functional-dependency/). If A logically determines B, then adding any extra attribute doesn’t change the basic functional dependency.
- Example: **A → B**, then adding any extra attribute let's say C will give **AC → BC** and doesn’t make any change.
- **Transitive rule:** if A determines B and B determines C, then it can be said that A indirectly determines C.
- Example: If **A → B** and **B → C** then **A → C**.
- **Union Rule:** If A determines B and C, then A determines BC.
- Example: If **A → B** and **A → C** then **A → BC.**
- **Decomposition Rule:** It is perfectly the reverse of the above Union rule. If A determined BC then it can be decomposed as A → B and A → C.
- Example: If **A → BC** then **A → B** and **A → C.**
- **Pseudo Transitive Rule:** If A determines B and BC determines D then BC determines D.
- Example: If **A → B** and **BC → D** then **AC → D**.
### Prime Attribute
Given a relation R and a set F of fds, X is a superkey if X+ is all attributes in R (in other words: X->X+ is in F+).
### Basis
A set of functional dependencies forms a basis, if there is only one attribute on the right-hand side of each functional dependency
### Minimal Basis:
A set of functional dependencies F if we can not remove any fd or any attributes without changing the meaning (closure)
##### Algorithm for Converting a set F to a minimal basis
1. convert F to a basis form by using the splitting rule
2. Remove all trivial dependencies
3. Suppose X --> Y is in F, create F' by removing X --> Y
1. If X+ is the same in F and F' then C --> Y can be removed
2. AKA if we attempt to remove the functional dependency and the closure is the same, then the FD was not important, as it can just be reconstructed from the inverse (Y->X)
```
COPY THIS EXAMPLE LATER (jesus christ)
```
### BOYCE-CODD NORMAL FORM (BCNF)
Given a relation R and a set of fds F, R is in BCNF iff for all fds in F of the form X -> Y one of the following is true:
1. X is a superkey of R, or
2. X -> Y is trivial.
3. Y is prime attribute
If a relational is in BCNF, then it is also in 3NF
NOTE\*: To formally find all keys, you must go through all subsets. Remember to get rid of superkeys once you find a minimal key
For example:
```
given 2 keys: AB, BC
which give you
AB+ = (A, B, C, D)
BC+ = (A, B, C, D)
the keys would be
AB+ = (A, B, C, D)
BC+ = (A, B, C, D)
BD+ = (B, D)
Superkeys: AB, ABC, ABD, ABCD, BC, BCD
Prime Attributes: A, B, C
BCNF:
AB --> C (OK because AB is a superkey)
AB --> D (OK because AB is a superkey)
C --> A (NOT OK becauseC is not a superkey and C --> A is not trivial)
3NF:
AB --> C (OK because AB is a superkey)
AB --> D (OK because AB is a superkey)
C --> A (OK ONLY IN 3NF NOT DCNF because A is a prime attr)
A --> A (OK because trivial)
ABD --> C (OK because ABD is a superkey)
is in 3NF
```
Prime attributes: appear in all keys
### Equivalency:
Two sets of functional dependencies F1 and F2 over the same relation R are equivalent if:
F1 = { A->C }
F2 = { A -> C, A -> A }
F1+ = F2+
These are equivalent because ignoring trivial dependencies (A -> A) they are the same
#### Decomposition
A decomposition of R into R1, R2, ...., Rn is valid if R, R2, etc make up all of the attributes of R and is given by
R1 = project\_{attributes of R1} (R)
R2 = project\_{attributes of R2} (R)
. . . .
Rn = project\_{attributes of Rn} (R)
a good decomposition is:
- lossless required property, all decompositions should be lossless
- a decomp is lossless IF AND ONLY IF we are guaranteed that for every possible instance of R that R < R1 \* R2 .... Rn
- dependency preserving (desired property)
### Multi-valued dependency
Represented by "->>". Means that the value on the right-hand side can be multiple values.
A multi-valued dependency of the form A1 ... AN ->> B1 ... Bm means that for all pairs of tuples t1 and t2 that agree on A (everything on the left), we can find a tuple v in R such that:
- v agrees with t1 and t2 on A's
- v agrees with t1 on B's
- v agrees with t2 on the remaining attributes (not A's or B's)
Ex in class:
rin ->> hobby
rin ->> phone_number
For a given rin, there can be multiple values for a hobby and/or phone_number.
Every FD is a MVD. Every MVD is not necessarily an FD. This rule is called FD promotion.
Complementation rule: If A1 ... AN =>> B1 ... Bm is true and C1 ... Ck are all attributres in R that are not As or Bs then A1 ... An =>> C1 ... Ck is also true.
#### 4NF:
A relation is in fourth normal form iff whenever A1... An =>> B1 ... Bm is a non-trivial MVD, then A1...An is a superkey. The notions of keys and superkeys depend on f.d.s only; adding MVDs does not change the definition of "key". To decompose a relation into fourth normal form, use an algorithm similar to BCNF decomposition algorithm using MVDs. Relations in 4NF C\_ Relations in BNCF C\_ Relations in 3NF.
# COPY EXAMPLE HERE
### Hw 1 notes from kuzmin:
min-max functions do not exist
cannot sort, select the best thing to use?
RelaX: <https://dbis-uibk.github.io/relax/calc/local/uibk/local/0> (recommended tool for checking your answer)

Database structure such that any table can NOT express redundant info (no 2 birthdays per customer for example)
#### Normal Forms
Sets of data safety assessments/safety guarantees
###### **First Normal Form**
**Violating FNF**
- if you're using row order to convey information because row order is not maintained in a database
- mixing data types
- repeating groups
- re-adding data to each row
- like an inventory where you add items again and again to each table like \[shield, shield, shield\]
__Rules__
1. Using row order to convey information is not permitted
2. mixing data types within the same column is not permitted
3. having a table without a primary key is not permitted
4. repeating groups are not permitted
**Solution**:
1. Add primary key
2. structure the table to avoid redundancies
1. keep the count of every item in a player inventory instead of storing duplicates
###### Second Normal Form
**Definition: each non-key attribute must depend on the ENTIRE primary key**
deletion anomaly: deleting unrelated data breaks the logic
update anomaly: changing unrelated data breaks the logic
update insertion: having no data breaks the logic
###### Third Normal Form
Definition:
1. No non-key attribute may NEVER depend on a non-key attribute
2. Put another way, every non-key attribute in the table should depend on the key, the whole key, and nothing but the key (lmao)
Transitive Dependency: An attribute is dependent on an attribute that is dependent on another attribute
###### Fourth Normal Form
Definition: The only multivalued dependencies in a table MUST be dependencies on the key
Multi-value dependency:
- expressed using double arrow
-
## Entity-Relationship (ER) Models
- Method for designing databases
- Helps give high-level view of the whole database, while normalization is more geared toward optimizing individual relations
- Help modularize database design
- ER models are object-oriented, not relational
#### ER Data Models
- ER Data models design a whole database using entities and relationships
- **ER Data models design a whole database using entities and relationships**
- Converting ER diagrams to a relational model:
- 1\. Convert each entity into a new relation R. Map entity keys for relation R. Map all other attributes to attributes of relation R.
- 2\. Convert relationships based on cardinality
- One-to-one/one-to-many: Map the entity E1 that has one of the other entity E2 by adding E2's key as an attribute.
- Many-to-many: Create a new relation R: Include in R the keys of all joining entities. The keys must include the keys of all entities that have an N participation.
- Lossy decomposition: representing ternary relationship in three binary relationships does not give the same exact result
- foundational approach for database design
- focus on representing entities, their attributes, and the relationships between them
- ensure a clear and modular database structure
- play an important role in providing a high-level perspective before the database is normalized or transformed into a relational model.
- **Purpose**: ER models are used for designing databases and offer a high-level, object-oriented view of the data structure.
- **Normalization vs ER Models**: While normalization focuses on optimizing individual relationships, ER models help simplify the database by modularizing it into entities.
- **Modularization**: Entities represent major components, and relationships link these entities to one another.
- **Commonality**: ER modeling is widely used but is not the only database design method.
**Key Points**:
- Focus on entities and relationships.
- Modular design helps make normalization easier.
---
### **ER Data Models**
- **Entities and Relationships**: The core of ER modeling is to define entities (objects or classes) and relationships (connections) between them.
- **Relational Model Mapping**: Once the ER model is complete, it can be mapped to a relational data model. For example, after defining entities such as "Student" and "Faculty," they can be converted to relational tables.
---
### **Entity Classes and Attributes**
- **Entities**: An entity represents a class of objects, and each entity has attributes that describe its characteristics.
- **Attributes**: Should be simple values (no sets or multi-valued attributes).
- **Key Attributes**: An entity must have a key attribute (or a combination of attributes) to ensure uniqueness.
**Example**:
- **Faculty**: `{id, name}`
- The key is `id`.
- **Students**: `{id, name}`
- The key is `id`.
**Notation**:
- Entities are represented with boxes, attributes with ellipses, and key attributes are underlined.
---
### **Relationships**
- **Linking Entities**: Relationships connect entities to one another. They represent how entities interact, such as "Students take Classes" or "Faculty work in Departments."
- **Participation Constraints**: These specify how many instances of an entity participate in the relationship. Participation can be one-to-one, one-to-many, or many-to-many.
**Example**:
- One-to-many relationship: Each department has many faculty members.
- Many-to-many relationship: Students can take multiple classes, and each class can have many students.
### **Keys in Relationships**:
- Relationships do not generally have keys, although some conventions might allow it.
---
### **Recursive Relationships**
- Sometimes, an entity can be linked to itself through a relationship.
- **Example**: A faculty can mentor other faculty members, establishing a "mentor-mentee" relationship within the same entity.
- Involves three entities but should be used carefully. Many ternary relationships can be decomposed into binary relationships.
- **Example**: A faculty advising multiple students on different majors might seem ternary, but binary relationships between faculty and students or faculty and majors may suffice.
The **subclasses** section in Entity-Relationship (ER) models discusses how entities that share common attributes can be structured in a hierarchical manner. Subclasses are used when there is a need to represent entities that are specialized versions of a more general entity class, allowing inheritance of attributes and keys.
---
### **Key Concepts in Subclasses**
#### **Generalization and Specialization**
- **Generalization**: When multiple entities share common attributes, they can be generalized into a parent (superclass) entity. The individual entities (subclasses) inherit the attributes and key of the superclass.
- **Specialization**: Subclasses represent specialized entities that have additional attributes not shared with other subclasses or the parent.
#### **Type Hierarchy**
- In the subclass hierarchy, entities are organized in a **type hierarchy**, where each subclass inherits attributes from the parent entity class.
- The key and attributes of the parent entity (superclass) are passed down to the subclasses.
---
### **Example of Subclass Structure**
- **Superclass**: `People`
- Attributes: `person_id, name`
- **Subclasses**:
1.**Students** (inherits from `People`)
- Attributes: `person_id, name, class`
2.**Staff** (inherits from `People`)
- Attributes: `person_id, name, salary`
In this example, both `Students` and `Staff` inherit the `person_id` and `name` attributes from the `People` entity, but they also have their own specific attributes such as `class` (for students) and `salary` (for staff).
---
### **Disjoint and Overlapping Subclasses**
#### **Disjoint Subclasses**
- **Disjoint** subclasses mean that an entity can belong to only one subclass at a time.
- **Example**: A person can either be a student or a staff member, but not both.
#### **Overlapping Subclasses**
- **Overlapping** subclasses mean that an entity can belong to multiple subclasses at once.
- **Example**: A person could be both a student and a staff member, such as a teaching assistant who is also enrolled in classes.
---
### **Covering and Partial Subclasses**
#### **Covering Subclasses**
- In **covering** subclasses, all instances of the superclass must belong to at least one subclass.
- **Example**: All people in the `People` entity must either be a student or staff. No person can exist that is not part of one of these two subclasses.
#### **Partial Subclasses**
- In **partial** subclasses, some instances of the superclass may not belong to any subclass.
- **Example**: There could be people in the `People` entity who are neither students nor staff, representing individuals outside the scope of these two subclasses.
---
### **Mapping Subclasses to a Relational Model**
There are three basic ways to map a subclass hierarchy to a relational model:
#### **1. Storing Only Unique Information in Each Relation**
- In this method, only the attributes unique to each subclass are stored in the subclass tables, while the common attributes are stored in the superclass table.
**Example**:
```sql
People(person_id,name)-- Superclass
Students(person_id,class)-- Subclass
Staff(person_id,salary)-- Subclass
```
- **Advantages**: Easy to find all people (common superclass table).
- **Disadvantages**: Joins are required to retrieve full information about a student or staff, leading to slower queries.
#### **2. Map Each Entity to a Separate Relation**
- Each subclass and the superclass are stored in separate tables, with repeated attributes included in each table.
**Example**:
```sql
People(person_id,name)-- Superclass
Students(person_id,name,class)-- Subclass
Staff(person_id,name,salary)-- Subclass
```
- **Advantages**: Faster queries when retrieving information about a specific subclass.
- **Disadvantages**: Requires unions when querying for all people, as the data is spread across multiple tables.
#### **3. Combine All Information in a Single Relation**
- All data, including subclass-specific attributes, are stored in a single table, with some columns left `NULL` when they don't apply to an instance.
- **Advantages**: Simplified data model, fast queries.
- **Disadvantages**: There may be many null values (e.g., `class` for staff members or `salary` for students), and the model may become harder to manage and query.
---
### **Choosing a Mapping Strategy**
The choice of mapping strategy depends on factors like the class hierarchy's **disjoint** or **overlapping** nature, and whether it is **covering** or **partial**. For example:
- If the subclasses are disjoint and covering, storing all the information in a single table may be efficient.
- If the subclasses are overlapping and partial, mapping each subclass to a separate table might be the better option.
---
### **Summary of Subclasses in ER Models**
- Subclasses allow for more detailed data modeling when entities share common attributes but also have their own specialized characteristics.
- The decision on how to map subclasses to a relational model should consider factors like performance, query complexity, and data integrity.
This structure helps ensure that the database accurately models real-world entities and relationships while optimizing for performance and maintainability.
---
# SQL
- SQL is an industry standard language for relational databases.
- Almost all database management systems implement SQL the same, except:
- Core of the SQL standard is the same across all databases
- Advanced features may vary from database to database
- It is highly advisable to write queries that are portable from system to system: no bells and whistles unless it really gets you some strong performance gains.
- We will try to distinguish between core and special features as much as possible.
- A logical/declarative query language
- Express what you want, not how to get it
- Each SQL expression can be translated to multiple equivalent relational algebra expressions
- SQL is tuple based, each statement refers to individual tuples in relations
- SQL has bag semantics
- Recall RDBMS implementations of relations as tables do not require tables to always have a key, hence allowing the possibility of duplicate tuples.
Same is true for SQL, an SQL expression may return duplicate tuples, unless they are removed explicitly.
- SQL is case insensitive (though strings are case sensitive of course)
- Syntax:
- All statements must end with a semi-colon!
- Strings are single-quoted.
### Components
- Query language:
```
SELECT ... FROM ... WHERE ...
```
allows you to write queries to find what is stored in databases.
- DML: data manipulation language
```
INSERT
UPDATE
DELETE
```
allows you to change the contents of the existing tables
- DDL: data definition language
```
CREATE DATABASE
CREATE TABLE
ALTER TABLE
DROP TABLE
```
allows you to define database objects: schema, tables, indices, etc.
### Control Flow
1. From: read relations involved in the form
2. Where: check for each tuple if it passes the where clause
3. Select:
1. for tuples that pass the where clause
2. construct the output by the projection of attributes in select
## Syntax
#### General
```SQL
SELECT
baker
FROM
bakers
WHERE
hometown = 'London'
and age < 30;
```
this is equivalent to
`project_{ baker}(select_{ hometown == 'London' and age < 30 }(Bakers))`
This will have duplicates however, so we use...
#### Duplicate Removal
```SQL
SELECT DISTINCT
baker
FROM
bakers
WHERE
hometown = 'London'
and age < 30;
```
#### SELECT
- You can rename attributes returned
- You can use expressions over the attributes
- You can return constants
- Optionally, you can remove duplicates using distinct (only one DISTINCT clause in a single query)
```SQL
SELECT
LEFT(fullname, strpos(fullname, ' ')) as firstname,
UPPER(substring(fullname from strpos(fullname, ' ')+1)) as lastname,
'baker' as position,
occupation || ' from: ' || hometown as label
FROM
bakers ;
-- position is a new column with a fixed value, constant 'baker'
-- firstname is a substring of a column
-- label is a concatenation of two strings
-- functions can be combined in complex expressions
```
#### WHERE
- WHERE statement is equivalent to the selection in relational algebra.
- It contains a Boolean expression over individual tuples
- For each tuple produced by the FROM statement, we check whether the WHERE statement is true.
#### FROM
running `SELECT * FROM bakers, technicals ;` will create a **cartesian product** from the two tables
if we want to do a **join** we MUST include a join condition
```
SELECT *
FROM bakers b, technicals t
WHERE b.baker = t.baker;
```
- The variables b and t are aliases for the table names, especially needed if the two tables have attributes with the same name
-`SELECT attributes FROM R1,R2,.., Rn WHERE Conditions` is equivalent to
- An aggregate returns a single tuple (unless accompanied by other clauses like GROUP BY or FILTER)
```SQL
-- Find total number of times ‘Kim-Joy’ won star baker.
SELECT count(*) as num_wins
FROM results
WHERE baker = 'Kim-Joy';
```
**Note:**
- `count(*)` counts the total number of tuples.
- `count(attribute)` counts the total number of values for a given attribute, disregarding the NULL values.
- `count(DISTINCT attribute)` counts the total number of distinct values for a given attribute, disregarding the NULL values.
#### GROUP BY
Instead of computing the aggregates for the whole query, it is possible to compute it for a group.
- Group by multiple attributes by finding tuples that have the same values for the grouping attributes
- For each group, produce a single tuple containing grouping attributes and any agregates over the group.
- To return an attribute from a relation, you MUST include it in the grouping attributes.
**Example**
Find the total number of star baker wins for each baker. Return the full name and hometown of each baker.
```SQL
SELECT b.baker, b.fullname, count(*) as numwins
FROM bakers b, results r
WHERE b.baker = r.baker and r.result = 'star baker'
GROUP BY b.baker, b.fullname;
```
#### GROUP BY - HAVING
- Group by statement can be followed by an optional HAVING clause.
- You can write conditions to eliminate groups in the HAVING clause
- Aggregates over the groups.
- All other conditions should be put in the WHERE clause to reduce the size of the relation to be grouped
**Example**
Find all bakers who have used ‘chocolate’ or ‘ginger’ in the showstopper challenge at least two different episodes and won star baker at least twice. Return their fullname
```SQL
SELECT b.baker, b.fullname
FROM bakers b, showstoppers ss, results r
WHERE
b.baker = ss.baker
and b.baker = r.baker
and r.result = 'star baker'
and (lower(ss.make) like '%ginger%' or lower(ss.make) like '%chocolate%')
GROUP BY
b.baker
HAVING
count(DISTINCT ss.episodeid) >= 2
and count(DISTINCT r.episodeid) >= 2;
```
####
#### ORDER BY
- You can order the tuples returned by the query with respect to one or more attributes.
```sql
-- Return the students, order with respect to year (descending) and name (ascending).
SELECT * FROM episodes
ORDER BY viewers7day desc, id asc;
```
#### LIMIT
- You can limit the number of tuples returned
- is the **last possible statement to add**
- makes the most sense when combined with an order by
```sql
-- Find the top 3 bakers in terms of number of wins. Return their name
SELECT b.baker, b.fullname, count(*) as numwins
FROM bakers b , results r
WHERE
b.baker = r.baker
and r.result = 'star baker'
GROUP BY b.baker
ORDER BY numwins desc;
LIMIT 3;
```
---
# Lecture Notes: Advanced SQL Query Techniques
#### Generated by ChatGPT 4-o from my insane rambling notes because I'm sick and can't be fucked
---
## Introduction
In this lecture, we'll explore advanced SQL query techniques using a practical example involving a database schema and specific querying requirements. We'll cover topics such as regular expressions in SQL, data type conversions, handling `NULL` values, debugging SQL queries, and ensuring compatibility across different SQL dialects.
---
## Comprehensive SQL Concepts and Definitions for Future Assignments
---
## Table of Contents
1. Understanding the Database Schema
2. Basic SQL Statements
3. Data Filtering Techniques
4. Joining Tables
5. Working with NULL Values
6. Data Type Conversions and Casting
7. Functions and Expressions
8. Regular Expressions in SQL
9. Extracting Numbers from Strings
10. Aggregate Functions and Grouping Data
11. Subqueries and Common Table Expressions (CTEs)
12. Sorting and Limiting Results
13. SQL Dialects and Compatibility
14. Error Handling and Debugging
15. Best Practices
16. Security Considerations
17. Conclusion
18. Appendices
- Execution Order of SQL Statements
- Common SQL Functions
- Additional Resources
---
## 1. Understanding the Database Schema
Before writing SQL queries, it's crucial to understand the database schema:
- **Tables**: Structures that store data in rows and columns.
- **Columns**: Attributes or fields in a table.
- **Relationships**: How tables are related (e.g., primary keys, foreign keys).
3. **Validate Data Types**: Use casting if necessary.
4. **Simplify the Query**: Break it down to identify the problematic part.
5. **Use Comments**: Comment out sections to isolate errors.
### Example Error and Resolution:
```sql
-- Error: function round(double precision, integer) does not exist
-- Solution: Cast the number to numeric
SELECT ROUND(viewers7day::numeric, 2) FROM episodes;
```
---
## 15. Best Practices
- **Use Aliases for Clarity**:
- Shorten table/column names.
- *Example*: `SELECT e.name FROM employees AS e;`
- **Filter Early**:
- Apply `WHERE` clauses before `GROUP BY` or `JOIN` to reduce data size.
- **Optimize Joins**:
- Ensure proper indexing on join columns.
- Use appropriate join types.
- **Handle NULLs Appropriately**:
- Be aware of NULL behavior in comparisons and functions.
- **Comment Your Code**:
- Use `--` for single-line and `/* ... */` for multi-line comments.
- **Consistent Formatting**:
- Write SQL keywords in uppercase.
- Use indentation for readability.
---
## 16. Security Considerations
- **Prevent SQL Injection**:
- Use parameterized queries.
- Avoid concatenating user input into SQL statements. signs\*\*
You can also return a table of rows:
> - Return each tuple with RETURN NEXT and finish with RETURN
> - As these return a table, they are called in the FROM clause. See the loop section below for examples.
### Handling SQL
````SQL
CREATE FUNCTION sales_tax(subtotal real) RETURNS boolean AS
- **Limit Permissions**:
- Grant only necessary privileges to users.
- **Validate Input**:
- Sanitize user inputs.
- Use input validation to enforce data integrity.
---
## 17. Conclusion
Understanding these SQL concepts equips you to handle various data querying and manipulation tasks effectively. By mastering pattern matching, data type conversions, error handling, and other advanced techniques, you can write efficient and robust SQL queries for future assignments.
- **Calculate Difference**: `ABS(e1.viewers7day - e2.viewers7day)` computes the absolute difference in viewers.
- **Aggregate Function**: `MAX()` retrieves the maximum difference.
- **Result**: The query returns a single value `maxviewerdiff`, representing the largest viewer drop or increase between two consecutive episodes.
---
## Closing Thoughts
By integrating these advanced topics into your SQL knowledge base, you enhance your ability to write complex queries, troubleshoot issues, and ensure your code is both efficient and secure. Practice regularly with different scenarios to solidify these concepts.
---
## Procedural Programming
To enable the use of SQL for costly queries, while making it possible to write code/procedures on top of it, databases support a number of options.
- Server-side
- Client-side
#### Server-side
- Languages make it possible to define procedures, functions, and triggers
- These programs are compiled and stored in the database server
- They can also be called by queries
#### Client-side
- Languages allow programs to integrate querying of the database with a procedural language
- Coding in a host language with db hooks (C, C++, Java, Python, etc.) using the data structures of these languages
- Coding in frameworks with their own data models (Java, Python, etc) with similar db hooks as in above.
### Programming in SQL
**All programming paradigms support:**
- methods to execute queries/update statements
- executing any SQL statement, catching the outcome, and interpreting the errors if any
- input values from variables into queries and outputting the values from queries into variables
- loop over query results (if multiple tuples)
- raise exceptions, which results in rollbacks of transactions
- store and reuse queries in the shape of cursors
- starting and committing transactions
**Client-side programs also support:**
- opening/closing connections
- allocating/releasing database resources for queries
**Server-side language examples (Generally database-specific):**
- pl/pgsql: a generic procedural language for postgresql
- pl/pyhton: a procedural language that is an extension of python
**Client-side language examples:**
- libpq: a C library for postgres which uses library calls specific to psql
- OCCI: Oracle library for C
- ECPG: embedded programming in SQL, based on the embedded programming standard with a postgresql specific pre-compiler, an the standard C compiler
**Frameworks**
based on specific design principles for developing database backed applications
examples:
- Object-relational-mapping used by Rails, Hibernate, Django, WebObjects, .NET (different frameworks have different models)
- Note that the frameworks can be built on top of other languages (such as Java + JDBC)
### pl/pgsql
- supports the same data types as the database
- programs and functions can be compiled and used directly by the db server
- main pl/pgsql block is in this form:
```pl/pgsql
[ <<label>> ]
[DECLARE
variable declarations ]
BEGIN
statement
END [ label ] ;
```
- Variable types
```PSQL
integer
numeric(5)
varchar
tablename%ROWTYPE
tablename.columname%TYPE
RECORD
-- ROWTYPE and RECORD have subfields, i.e. x.name.
```
#### Constructs
**Conditionals**
```SQL
IF ... THEN ... ELSIF ... THEN ... ELSE ... END IF
Loops:
[ <<label>> ]
LOOP
statements
END LOOP [ label ];
```
**Returning a value**
- pl/pgsql functions do not allow you to modify input variables
- RETURN will return a value. As a result, you can call it like a constant in the select statement shown below:
```postgresql
CREATE FUNCTION sales_tax(subtotal real, state varchar) RETURNS real AS $$
DECLARE
adjusted_subtotal real ;
BEGIN
IF state = 'NY' THEN
adjusted_subtotal = subtotal * 0.08 ;
ELSIF state = 'AL' THEN
adjusted_subtotal = subtotal ;
ELSE
adjusted_subtotal = subtotal * 0.06;
END IF ;
RETURN adjusted_subtotal ;
END ;
$$ LANGUAGE plpgsql ;
```
we can test it like:
```SQL
select sales_tax(100, 'NY') ;
sales_tax
-----------
8
(1 row)
```
\*\*The whole body of the function is entered within the two dollar signs\*\*
You can also return a table of rows:
> - Return each tuple with RETURN NEXT and finish with RETURN
> - As these return a table, they are called in the FROM clause. See the loop section below for examples.
### Handling SQL
```SQL
CREATE FUNCTION sales_tax(subtotal real) RETURNS boolean AS $$
DECLARE
adjusted_subtotal real ;
BEGIN
adjusted_subtotal = subtotal * 0.06;
BEGIN
INSERT INTO temp VALUES (adjusted_subtotal) ;
RETURN true ;
EXCEPTION WHEN unique_violation THEN
RETURN false ;
END ;
END ;
$$ LANGUAGE plpgsql ;
```
when you run this function, a row is inserted into table temp
#### Executing queries
When the query returns a single row, then we can read it directly into a variable
```SQL
SELECT * INTO myrec FROM emp WHERE empname = myname;
IF NOT FOUND THEN
RAISE EXCEPTION 'employee % not found', myname;
END IF;
-- input: myname, output: myrec
```
When the query returns multiple rows, then a loop is needed to go through them one by one.
- A query returns a stream of tuples, which needs to be processed.
- Equally important is closing the stream associated with a query if required by the programming language.
```SQL
[ <<label>> ]
FOR target IN query LOOP
statements
END LOOP [ label ];
DECLARE
myRow RECORD ;
lastX INT ;
yCnt INT ;
BEGIN
lastX = 0 ;
yCnt = 0 ;
FOR myRow IN
SELECT x,y, count(*) as num
FROM temp GROUP BY x,y ORDER BY x, num ASC LOOP
yCnt = yCnt + 1;
IF yCnt < 4 AND lastX = myRow.x THEN
INSERT INTO temp2 VALUES(myRow.x, myRow.y, myRow.num) ;
ELSIF lastX <> myRow.x THEN
lastX = myRow.x ;
yCnt = 1 ;
INSERT INTO temp2 VALUES(myRow.x, myRow.y, myRow.num) ;
END IF ;
END LOOP ;
RETURN 1 ;
END ;
```
Example 2:
```SQL
CREATE TABLE names (name VARCHAR(255)) ;
CREATE FUNCTION allnames() RETURNS SETOF names AS $$
DECLARE
row RECORD ;
BEGIN
FOR row in SELECT DISTINCT crsname FROM courses LOOP
RETURN NEXT row ;
END LOOP ;
RETURN ;
END ;
$$ LANGUAGE plpgsql ;
```
call it like `select * from allnames();`
### Cursors
- A query with a handle and can have input.
- Can be defined once and used many times to read tuples.
- A cursor is optimized once, reducing the cost of optimizing the query many times.
- Functions may return reference to a cursor, allowing a program to read tuples that are returned.
- Cursors provide a more efficient implementation of queries returning many tuples.
- First, declare cursors:
```SQL
DECLARE curs2 CURSOR FOR SELECT * FROM tenk1;
```
- Then, execute the associated query by opening them:
```SQL
OPEN curs2;
```
- Then, retrieve tuples in the result using fetch:
```SQL
FETCH curs2 INTO foo, bar, baz;
```
or
> ```SQL
> FOR recordvar IN curs2 LOOP
> ```
- When finished, close the cursor to release allocated memory:
```SQL
CLOSE curs2;
```
- Cursors can also be used for update/delete if it is pointing to a specific tuple (similar to the notion of an updatable view). - Update/delete the tuple the cursor is pointing to
### Exceptions
- When an SQL statement is executed, if it is not successful, it raises an error. This error can be caught in the usual try/catch format:
con->createStatement("INSERT INTO my_table (a, b) VALUES (1, 'A')");
s1->executeUpdate();
```
#### Select statements
- uses `executeQuery()` method
- To process these tuples, you need a result set object which processes tuples in a similar way to a file
- Need to open, iterate through, and close a result set to access the tuples
- this essentially returns an iterator (use `itr->next()` , returns false when done, etc)
- getXXX(i) means attribute i of the query should have type XXX
```SQL
statement* s1 = con->createStatement(
"SELECT id, name FROM emp WHERE id < 1000");
ResultSet r = s1->executeQuery();
while (r->next()) {
varId = r->getInt(1) ;
varName = r->getString(2) ;
}
s->closeResultSet(r);
```
#### Errors and Statuses
```SQL
try{
... operations which throw SQLException ...
}
catch (SQLException e){
cerr << e.what();
cerr << e.getMessage();
cerr << e.getErrorCode();
}
```
- it is possible to check the status of a statement during runtime, which can be \``` UNPREPARED, PREPARED, RESULT_SET_AVAILABLE, UPDATE_COUNT_AVAILABLE` ``
- you can check if the result set us \``END_OF_FETCH = 0`, `DATA_AVAILABLE`, or even `r->isNull`
-`vector<MetaData> getColumnListMetaData() const;` will return the number, types and properties of aResultSet’s columns
#### Transactions
```C
con->commit();
con->rollback();
```
After a rollback/commit, the next query/update will start a new transaction
- the query cannot be written without it or it provides savings that are missed by the optimizer
- Otherwise, the optimizer may miss some optimizations and rewritings of the query when views are used
### Views (not anonymous)
a view with a name that can then be used elsewhere
```SQL
CREATEVIEWnoteliminated(baker,fullnamename,age)
AS
SELECTbaker,fullname,age
FROMbakers
WHEREbakernotin(selectbaker
fromresults
whereresult='eliminated');
```
this can now be used like it was a table:
```SQL
SELECT*
FROMnoteliminated
WHEREage>45;
```
### Why use views?
##### compartmentalization
different users can only see the data that they have access to
EXAMPLE:
problem: faculty cannot access the financial information of students and can only access the information about the students who are currently taking the course with them
solution: create a view for students in a specific class with ONLY the relevant attributes, then build the application on top of that
Views can also be used to insert/update/delete tuples instead of the table they are based on.
- This builds on the philosophy of building functionality based on views
- However, this is only possible for a very restricted subset of views, called updatable views
- Updatable views are such that each tuple in the view maps to one and only one tuple in the table it is based on
- Using views to create functionality hides data complexity from developers
- If the data model changes, the application code does not have to change as long as the new model can be mapped to the same view
### Why not use views
- Writing a query using views may hide some optimizations from the database, creating sub-optimal query plans
### Updatable views
A view is updatable if:
- It has only one table T in its FROM clause
- It contains all attributes from T that cannot be null
- It does not have any DISTINCT, GROUP BY statements (one-to-one correspondence between a tuple in the view and a tuple in the table)
EXAMPLE:
```SQL
CREATEVIEWlt40(baker,fullnamename,age)
AS
SELECTbaker,fullname,age
FROMbakers
WHEREage<40;
UPDATElt40SETage=40WHEREbaker='Manon';
```
- lt40 does not store any tuples. This expression allows only those tuples of bakers that are accessible through view to be updated.
- **After the update, the resulting tuple may not even be in the view (unless the view is created with the CHECK OPTION):**
```SQL
UPDATElt40SETage=40WHEREbaker='Manon';
```
Since now Manon is not younger than 40, she will not be returned by the view
###
Indexing
- Views don't improve performance, and may even cause a loss of performance
- One way to improve performance is to store (cache) the results of some queries in the database
- That's an index, just a cached query result
```SQL
SELECTepisodeid
FROMtechnicals
WHEREbaker='Kim-Joy'andrank=1;
```
This requires reading the entire `technicals` table, and returns very little
#### Cost Analysis
- consider a large table X, and a query that returns a few tuples as the answer
- suppose X is stored on a disk in 100 disk pages, answering this query will take reading all 100 disks
- instead, use an Index on Technicals (baker) for the above syntax (or a similar index for relation X)
- this will make the cost just reading the tuples from the index, which is much less expensive
#### Indexing as views
- Indexes are just query results stored explicitly
- They are also stored on disk, but can be cheaper to use because:
- They have fewer disk pages as they store only a subset of the attributes in the relation
- They are stored in a way to make it easy to find queries on specific values in the index
#### Index cost/benefit analysis
- Indices are good if
- they reduce the cost of frequently asked queries
- the reduction is considerable
- Indices must be kept up to date when the tables change
- Indices increase the cost of insert/update/delete operations (at least one extra disk page access for each index created)
- a good index will reduce the total number of matching tuples to 1 or a few
- almost all databases will create an index on the primary key
EXAMPLE:
an index on students(id) would improve queries like `SELECT * FROM bakers WHERE baker = 'Rahul';`
**If the underlying relation is sorted with respect to some attribute, then an index on that attribute will help performance**
EXAMPLE:
- Suppose, technicals tuples are sorted by baker and rank.
- Create an index on Technicals(baker, rank)
given this query
```SQL
SELECT
episodeid
FROM
technicals
WHERE
baker='Kim-Joy'
andrank=1;
```
use the index to find the first tuple for `baker` 'Kim-Joy', and then scan the `technicals` relation starting from that point
### Access Structure
- A postgresql database cluster is organized into databases
- No data can be shared across databases
- Information in a database can be clustered into logical units called schema
#### Schema
create a schema using:
```SQL
CREATESCHEMAschemaname;
```
access/create tables in the schema using:
```SQL
schema.table
```
delete a schema and everything inside it using:
```SQL
DROPSCHEMAschemaname;
```
create a schema owned by someone
```SQL
CREATESCHEMAschemanameAUTHORIZATIONusername;
```
#### Search path
when a table is used, the db tries to find the correct instance
search path is usually (in order):
1.`\$user`: a schema with the same name as the current user
2. public: any information that is open to all users
::: info
the search path can be changed by using* `set search_path to ...`
:::
### Security
* postgres allows the creation of roles
* a role is like a user, but more general
* **a role with a login privilage is considered a user**
* a role can be given the right to create databases and/or create other roles
* a role with superuser privilages can bypass all security checks
#### Role Creation and Inheritance
`INHERIT` allows the role to inherit all the privilages given to that role?
```SQL
CREATEROLEjoeLOGININHERIT;
CREATEROLEadminNOINHERIT;
CREATEROLEwheelNOINHERIT;
GRANTadminTOjoe;
GRANTwheelTOadmin;
```
- Joe has privileges of admin on login because user Joe inherits from its roles. However, admin does not have the privileges assigned to wheel because it does not inherit (it is not inherited).
- As a role connects to the database, it has all the rights given to that role (login role). For other privileges that are not inherited, the user must explicitly set itself to that role using `SET ROLE admin ;`
### Database Objects
- all database objects (database, tables, indices, procedures, triggers, etc) have an owner (the role that created them)
- owner has all the access rights on the objects they create
- other role can be granted explicit privilages on these objects like `SELECT, INSERT, UPDATE, DELETE, TRUNCATE, REFERENCES, TRIGGER, CREATE, CONNECT, TEMPORARY, EXECUTE, and USAGE`
- SELECT, INSERT, DELETE, UPDATE are the privileges to query (select) and change the data of some other role.
- Can be specific: `SELECT(name)`
-`REFERENCES` is the right to refer to a relation in an integrity constraint
-`USAGE` is the right to use a schema element in relations, assertions, etc.
-`TRIGGER` is the right to define triggers.
-`UNDER` is the right create subtypes
#### Grant option
Users/roles can pass a privilege to another user/role is they have the grant option
```SQL
GRANTselectONusersTOuname
WITHGRANTOPTION
```
::: info
Only a role that has a grant option can grant the grant option to the others.
:::
#### Grant diagrams
- Nodes represent a user and a privilege
- Two different privileges of the same person should be put in two different nodes
- If one privilege for a user is the more general version of another, they should both be included.
- many connection can be opened in a program, but generally one connection per database is sufficient
- different databases can be used in a single program
- if you want to close the connections (do this before the program exits) use `EXEC SQL DISCONNECT [connection];`
- change between multiple open connections using `EXEC SQL SET CONNECTION connection-name;`
#### Variables in ESQL
All variables **MUST** be declaired using ESQL declarations and data types
```SQL
EXECSQLBEGINDECLARESECTION;
VARCHARe_name[30],username[30];
INTEGERe_ssn,e_dept_id;
EXECSQLENDDECLARESECTION;
```
::: info
You can use almost any SQL command in ESQL as long as proper input to these commands are provided in the form of program variables
:::
### Executing SQL commands
EXAMPLE
find the name of an employee given their SSN
```SQL
EXECSQLselectname,dept_idinto:e_name,:e_dept_id
fromemployee
wheressn=:e_ssn;
```
::: info
Program variables are preceded by “:”, i.e. :e_ssn.
:::
#### Strings from Oracle to C
in Oracle, strings are stored along with the length, so no need for a terminating char
to use with C you must **MANUALLY ADD THE TERMINATING CHARACTER**
EXAMPLE:
```C
strcpy(username.arr,“SibelAdali") ;
username.len=strlen(“SibelAdali") ;
strcpy(passwd.arr,“tweety-bird") ;
passwd.len=strlen(“tweety-bird") ;
execsqlconnect:usernameidentifiedby:passwd;
scanf(“%d", &e_ssn) ;
execsqlselectname,dept_idinto:e_name,:e_dept_id
fromemployeewheressn=:e_ssn;
e_name.arr[e_name.len]='\0';/* so can use string in C*/
printf(“%s", e_name.arr) ;
execsqlcommitwork;/* make any changes permanent */
execsqldisconnect;/* disconnect from the database */
```
#### Status Processing
**<u>SQL Communications area (sqlca)</u>** is a data structure that contains information about
- error codes (might be more detailed than SQLSTATE)
- warning flags
- event information
- processed rows' count
- diagnostics for all processed SQL statements
- include it in the program using `EXEC SQL INCLUDE SQLCA;` or `#include <sqlca.h>`
- Whenever an SQL statement is executed, its status is returned in a variable named `"SQLSTATE"`
- this variable **MUST** be defined in the variable section, but the values are populated automatically
```SQL
EXECSQLBEGINDECLARESECTION;
charSQLSTATE[6];
EXECSQLENDDECLARESECTION;
```
::: warn
if multiple errors or warnings happen during the execution of a statement, sqlca will contain info about the last one
:::
- if no error or warning occurred in the last SQL statement, `sqlca.sqlcode` will be 0 and `sqlca.sqlstate` will be “00000”
- if an error or warning occured, then `sqlca.sqlcode` will be negative and `sqlca.sqlstate` will not be “00000”
::: info
if the stement was successful, then `sqlca.sqlerrd[1]` will have the OID of the processed row (if applicable) and `sql.sqlerrd[2]` will have the number of processed or returned rows (if applicable)
:::
The code can be checked after each statement and error handling code can be written
- commit, rollback, exit program, etc
```C
if(strcmp(SQLSTATE,"000000")!=0)
rollback;
```
you can use trap conditions that remain active throughout the program
\- Because WHENEVER is a declarative statement, its scope is positional, not logical. That is, it tests all executable SQL statements that physically follow it in the source file, not in the flow of program logic.
\- A WHENEVER directive stays in effect until superseded by another WHENEVER directive checking for the same condition.
:::
#### ESQL Transactions
- Transactions start with the logically start with the first SQL statement and end with either a COMMIT or ROLLBACK statement
- It is possible to set boundaries of transactions with the SQL statement:
```SQL
BEGIN ;
SET TRANSACTION READ ONLY
ISOLATION LEVEL READ COMMITTED
DIAGNOSTICS SIZE 6 ;
```
- Diagnostics size is the number of exception conditions that can be described at one time in the status
- READ ONLY, READ/WRITE transactions allow the programmer to define the type of the transaction
#### ESQL Cursor Operations
to declair a cursor, use a normal SQL query
```SQL
EXEC SQL DECLARE emps_dept CURSOR FOR
select ssn, name from employee
where dept_id = :e_dept_id ;
```
- Open a cursor: the corresponding SQL query is executed, the results are written to a file (or a data structure) and the cursor is pointing to the first row `EXEC SQL OPEN emps_dept ;`
- read the row the cursor is pointing to using `FETCH` (this also moves the cursor to the next row) `EXEC SQL FETCH emps_dept INTO :e_ssn, :e_name ;`
- when the cursor is done, `sqlca.sqlcode == -1
- handle errors using `EXEC SQL WHENEVER NOT FOUND {}`
#### Cursors and Snapshots
cursors can be declared as INSENSITIVE which means
- the contents are computed when the cursor is opened
- the contents will not change even if the database changes
::: info
This type of cursor is used for snapshots of the database
:::
```SQL
DECLAREcursor_name[INSENSITIVE][SCROLL]CURSORFOR
table_expression
[ORDERBYorder-item-comma-list]
[FOR[READONLY|UPDATE|OFcolumn-commalist]]
```
#### Cursors for Update
cursors can be declared for update which means:
- updates can be performed on the current tuple
- these updates will ONLY have an effect if the cursor is **NOT** insensitive
```SQL
DECLARECURSORcursor-nameCURSORFORtable-expression
FORUPDATEOFcolumn-list
UPDATEtable-nameSETassignment-list
WHERECURRENTOFcursor-name
DELETEFROMtable-nameWHERECURRENTOFcursor-name
```
#### Constraints
- throw an `sqlerror` if violated
- can be violated if
- if constraint checking is immediate, then a violation will trigger an immediate rollback
- if constraint checking is deferrable, then a violation will do nothing until a transaction tries to commit, when it will be thrown and trigger a rollback
#### Dynamic SQL
- embedded SQL statements are created using strings
- strings are fed to an `EXEC` SQL statement `exec sql execute immediate :sql_string`
- statements are not known to the pre-compiler, and must be optimized at runtime
- you can use the same string to run multiple statements
EXAMPLE:
```SQL
strcopy(sqltext.arr,"delete from employee where ssn = ?");
sqltext.len=str.len(sqltext.arr);
execsqlpreparedel_empfrom:sqltext;
execsqlexecutedel_empusing:cust_id;
```
#### SQLDA
- when a dynamic SQL statement is executed, we don't know which columns will be returned/how many
- the SQLDA descriptor definition allows us to find the number of columns/their values
EXAMPLE:
```SQL
execsqlincludesqlda;
execsqldeclaresel_curscursorforsel_emps;
execsqlpreparesel_empsfrom:sqltext;
execsqldescribesel_empsintosqlda;
execsqlopensel_curs;
execsqlfetchsel_cursusingdescriptorsqlda;
```
### SQL Object-Relational Frameworks
see [this link](https://dba-presents.com/other/my-thoughts/34-database-agnostic-applications) for more
- tight integration between application logic and the database
- describe the database model as an object-oriented class description
- write queries not in SQL but directly in the programming language
- Create tools that are DB agnostic (abstracts the database away)
- most queries are simple filter statements over single relations or relations through foreign keys
- do not require full knowledge of SQL
- most application functions are easily mapped to CRUD operations (create, read, update and delete)
::: warn
Be careful if your join is different than what the foreign key impliesBe careful about how much data is read for each object and when: for deep nested structures, does it read the whole hierarchy?
:::
### SQL Object-Relational Extensions
- postgres (and others) have extensions that go beyond the relational data model
- these violate the relational data model
- trades simplicity of data/model queries for harder optimizations
- find where an extension is using `SELECT * FROM pg_available_extensions WHERE name = 'extension_name_here';`
#### Semantic Hierarchies and Inheritance
same as `ISA` (is a) relationships in ER diagrams (i.e. A isa B, which means it has all of B's attributes and then it's own)
CLASS HIERARCHIES EXAMPLE:
```SQL
CREATETABLEcities(
nametext
,populationfloat
,altitudeint-- in feet
);
CREATETABLEcapitals(
statechar(2)
)INHERITS(cities);
```
if you now do
```SQL
SELECTname,altitude
FROMcities
WHEREaltitude>50;
```
the above will query all cities AND all capitals
use the **ONLY** attribute to only query cities, not capitals
```SQL
SELECTname,altitude
FROMONLYcities
WHEREaltitude>50;
```
To find out which table a row comes from use the `relname` attribute from the `pg_class` table
```SQL
SELECTp.relname,c.name,c.altitude
FROMcitiesc,pg_classp
WHERE
c.altitude>50
andc.tableoid=p.oid;
Output:
relname|name|altitude
----------+-----------+----------
cities|LasVegas|2174
cities|Mariposa|1953
capitals|Madison|84
```
#### Complex Objects/User Defined Types
::: warn
This goes against the first normal form (i.e that all values should be atomic), but it allows multiple related values to be encapsulated
:::
```SQL
CREATEtypephone_typeAS(
numvarchar(12)
,typevarchar(50)
);
CREATETABLEperson(
idint
,namevarchar(30)
,phonephone_type
);
INSERTINTOpersonVALUES(
1
,'Kara Danvers'
,('555-1234','work')::phone_type
);
SELECT*FROMpersonWHERE(phone).type='work';
id|name|phone
----+--------------+-----------------
1|KaraDanvers|(555-1234,work)
```
::: info
you can define user defined types to be restricted domains of values and then use in multiple places
The best of use complex types is to write procedures/functions using pl/pgsql or a programming language like C.
:::
### Typed objects and methods
- main use is to create extensions for handling specific types of data
- Examples:
- Geographic data: points (geo locations), polygons (state, city boundaries), line segments (roads, rivers)
- Text data: vectors of words and weights for each word
- JSON
```SQL
SELECT '{"foo": {"bar": "baz"}}'::jsonb;
jsonb
-------------------------
{"foo": {"bar": "baz"}}
SELECT '{"foo": {"bar": "baz"}}'::jsonb->'foo';
?column?
----------------
{"bar": "baz"}
```
#### Geographic Data
- use PostGIS, an extension that supports geographic data
- this is an EXTERNAL PACKAGE AND MUST BE INSTALLED FIRST (i.e. `yay -S postgis`) The way to install postgres in the course notes is out of date, use this instead ([source](http://obsessivecoder.com/2010/02/01/installing-postgresql-8-4-postgis-1-4-1-and-pgrouting-1-0-3-on-ubuntu-9-10-karmic-koala/))
and to_tsvector('english', r.review_text) @@ query
ORDER BY rank DESC
LIMIT 10;
name | rank
----------------------------+-----------
DeFazio's Pizzeria | 0.05
Little Bites and More | 0.05
Notty Pine Tavern | 0.0366667
Red Front Restrnt & Tavern | 0.0285714
New York Style Pizza | 0.025
Milano Restaurant | 0.0218698
DeFazio's Pizzeria | 0.0202986
The Fresh Market | 0.02
Dante's Pizzeria | 0.0192982
Labella Pizza | 0.0155556
```
## Indexing
- databases are mainly optimized for data that is too large to fit in memory
- secondary storage is crucial for understanding:
- how data is accessed to respond to queries and modify data
- how indices can help speed up queries and the performance trade-offs of using them
### Secondary Storage
::: info
the first part of this section is not sql. If you want to skip to the part about tuple storage and indices, jump to <a href="#tuple-storage-on-disk">Tuple Storage on Disk</a>
:::
#### Types of Disks
- **Magnetic Disks**:
- Cost-effective with high capacity.
- Characteristics:
- Inexpensive storage.
- Fast sequential access.
- Slower random access.
- Density increases over time without significant speed improvements.
- **Solid State Drives (SSDs)**:
- Faster access and lower power consumption.
- Characteristics:
- Rapid access for most operations.
- Higher cost (~2x per TB compared to magnetic disks).
- each leaf node will have between 2 and 3 tuples (inclusive)
- each internal node will point to between 2 and 4 nodes below (and so have between 1 and 2 key values)
given `n = 99`
- each node will have between 50 and 99 tuples (inclusive)
- each internal node will point to between 49 and 100 nodes below (and will have between 49 and 99 values)
**note that the root can have two pointers and one key value at least**
#### Searching in B-trees
**searching for equality `A = x`**
1. start at root
2. while (not leaf node)
1. find the first key *greater* than x
2. follow the pointer just before this key
3. if the leaf:
1. contains key value x: return x's tuple id
2. does not contain x: return empty
**Searching for range**
Given an index on attribute A find all tuples in the range `x1 <= A <= x2`
1. start at root
2. while (not leaf node)
1. find the first key value that is greater than `x1` and less than `x`
2. follow the pointer before this key value
3. while (leaf node values < `x2`)
1. find all entries in leaf node in the given range
2. retrieve next leaf node (sibling pointer) and continue
4. return all found tuple ids
#### Index on multiple attributes A,B
an index on multiple attributes A,B will sort first by A then by B
EXAMPLE:
-`A = x AND b = y`: same as searching `A = x` for index on A
- `A = x:
1. search for first value with `A = x` (ignore B)
2. scan leaf nodes to the right (following sibling nodes)
-`A = x AND y1 <= B <= y2`:
1. search only for `x1 <= A`
2. scan leaf nodes for `x1 <= <= x2` following sibling nodes
3. for every leaf node found, check if `B = y`
1. if it is, put it in the output
-`B = y`
1. find the first leaf node, scan all following leaf nodes following sibling nodes
2. for each tuple:
1. if `B = y`, add to output
- THIS IS AN INDEX ONLY SCAN
#### Index-only Search
Given
```SQL
SELECTAFROMRWHEREA<120ANDA>10
```
and an index on `R.A`, scan the index for matching tuples and return the found A values
#### Index partial match
- Given an index on R(A,B) (index is sorted on A first and then on B)
-`select C,D from R where A > 10 and A < 100 and B=2`
- Scan index for the range A > 10 and A < 100, and for each matching tuple check the B value, read matched tuples from disk for C,D attributes
### B-Trees with Duplicate Values
if a B-tree is build on a key/value that contains duplicates, it's built in the same way except:
- **Key Adaptation in Non-Leaf Nodes**:
- When a **non-leaf node** points to a **leaf node**, the value stored in the non-leaf node should help distinguish between different keys.
- If there are duplicates, instead of storing just the repeated key, it stores:
- The **key value of the first unique key** (i.e., the first key in the leaf node that differs from its previous sibling node).
- This helps maintain a clear path for searching and navigating.
- **Handling No Unique Keys**:
- If **no unique key exists** in the leaf node (e.g., all entries in a particular range are duplicates), a **null value** is stored in the non-leaf node.
- This null value indicates there’s no distinguishing key in this branch, and the traversal may rely on other branches or methods to continue the search.
#### Insertion
1. start from root
2. check if the resulting operation splits the root node
3. if it does (the node has more than n nodes in it)
1. split the root node into two nodes
2. promote the middle key to become the new root
3. adjust the tree's height by one level to accommodate the new root (move new root to top, connect other nodes to the new root)
4. if it doesn’t, navigate to the appropriate child node based on the key to be inserted
5. repeat the process recursively:
1. check if the target child node (where the key should be inserted) becomes full
2. if the child node is full:
1. split the child node into two nodes
2. promote the middle key of the child node to the parent node
3. redistribute the remaining keys and pointers between the two resulting nodes
6. insert the key into the appropriate node once a non-full node is found
7. ensure all properties of the B-tree (sorted order, maximum keys per node, and balanced structure) are maintained
1.**Duplicate Values**: Normally, B-trees store unique keys, and non-leaf nodes store keys to guide the search. However, when keys can repeat (e.g., two entries have the same key), you need a strategy to avoid confusion during indexing.
2.**Key Adaptation in Non-Leaf Nodes**:
- When a **non-leaf node** points to a **leaf node**, the value stored in the non-leaf node should help distinguish between different keys.
- If there are duplicates, instead of storing just the repeated key, it stores:
- The **key value of the first unique key** (i.e., the first key in the leaf node that differs from its previous sibling node).
- This helps maintain a clear path for searching and navigating.
3.**Handling No Unique Keys**:
- If **no unique key exists** in the leaf node (e.g., all entries in a particular range are duplicates), a **null value** is stored in the non-leaf node.
- This null value indicates there’s no distinguishing key in this branch, and the traversal may rely on other branches or methods to continue the search.
EXAMPLE:
Imagine a B-tree storing duplicates of the key `10`:
```markdown
Non-leaf node:
| 10 (points to leaf nodes) |
Leaf nodes:
| 10, 10, 10 | 10, 10, 11 | ...
```
- In this case, the non-leaf node:
- Points to the first key `10` in the first leaf.
- Then points to the **first non-repeating key** (`11`) for the second leaf.
- If no non-repeating key existed in any sibling, a `null` would be stored instead.
#### Deletion
like insertion, but backwards
- if there are not enough keys in a node then borrow from neighbor
- if borrowing would break tree structure:
- restructure the child nodes so that it maintains the correct order
- if this results in less than the min values in a node
- merge the node with a neighbor or parent
EXAMPLE:
- Given:
- disk page has capacity of 4K bytes
- each tuple address takes 6 bytes and each key value takes 2 bytes
- each node is 70% full
- need to store 1 million tuples
- Leaf node capacity:
- each (key value, tuple address) pair takes 8 bytes
- disk page capacity is 4K, so (4*1024)/8 = 512 (key value, rowid) pairs per leaf page
- *in reality there are extra headers and pointers that we will ignore*
- Hence, the minimum number of pointers for the tree is 256
- If all pages are 70% full, each page has about 512*0.7 = 359 pointers
- To store 1 million tuples, requires
1,000,000 / 359 = 2786 pages at the leaf level
2789 / 359 = 8 pages at next level up
1 root page pointing to those 8 pages
- Hence, we have a B-tree with 3 levels
### R-trees
- used for searching along two axes
-`x1 <= A <= x2 and y1 <= B <= y2`
- the second range here is not useful in determining the number of nodes
- similar to a B-tree except each key value in an internal node is a rectangle and contains a pointer to values and rectangles within that rectangle
### Bitmaps and Converted Indices
- text valued attributes must be preprocessed before indexing
- this assures that the text fields an words are indexed as well
- a listing file for each word is made
```
word-> (tupleid, location within tuple), ...
-- EXAMPLE
pizza -> t1,2 t1,5 t3,4 t5,12
```
- each inverted listing is then compressed and stored
- a boolean keyword query is processed by bitmap operations (bitwise AND, bitwise OR) over these vectors
- Postgresql GIN structures are used for this purpose and text querying.
- Other open source implementations of inverted files such as Apache Lucene project exist.
- Google main index is a distributed and replicated inverted index over the Web documents.
### Primary and Secondary Indices
- index structure can be secondary
- index pages containing pointers to tuples in data pages which are at a leaf level?
- Primary B-tree indices are also possible
- internal nodes contain pointers to lower levels
- leaf level contains data pages for the table
- THERE CAN ONLY BE A SINGLE PRIMARY INDEX
- you can use clusters in postgres to generate primary indices
### Hashing
- often a primary index method
- given a has function h with K values and attribute A
- allocated a number of disk blocks M to each bucket
- for each tuple t, apply `h(t.A) = x`
- store t in the blocks allocated for bucket x
- to search for an attribute A (`SELECT * FROM r WHERE r.a = c`) do
- apply has function `h(c) = y`
- read the buckets from y to find value c
- will search `M / 2` pages on average and all pages in the worst case
- to search on another attribute
- hashing is useless, search all disk pages
- insertion cost:
- 1 read (find the last page in the appropriate bucket)
- 1 write (store)
- deletion/update cost:
- M/2 (search cost)
- 1 (update cost)
- if a bucket has too many tuples, than the allocated M pages may not be sufficient
- allocate additional overflow area
- if the overflow area is large, the benefit of the hash is lost
### Extensible Hashing
- a dynamic hashing technique that adjusts its structure to handle dataset growth/shrinkage efficiently.
#### Key Concepts
- **Hash function and bit representation**
- uses a hash function `h` to compute binary hash values for keys.
- the first `z` bits of the hash value determine the directory index.
- **Directory structure**
- the directory is an array of pointers to buckets containing the data.
- directory size is `2^z`, corresponding to the `z` bits used from the hash value.
- **Bucket overflow and splitting**
- if a bucket overflows:
- check if the bucket's local depth < global depth.
- split the bucket and redistribute entries.
- if local depth = global depth:
- double the directory size (increasing global depth by 1).
- redistribute entries into new directory structure.
#### Advantages
- **Dynamic directory expansion**
- grows as needed, maintaining efficient data access without performance loss.
- **Efficient space utilization**
- splits only buckets that overflow; directory grows incrementally.
#### Considerations
- **Implementation complexity**
- requires careful handling of dynamic directories and bucket splits.
- **Memory usage**
- directory may consume significant memory for rapidly growing datasets.
#### Insertion
- find the correct bucket using `h(key)`.
- if the bucket overflows:
- split the bucket and redistribute data.
- potentially expand the directory if required.
#### Deletion/Update
- deletion cost:
- locate and remove the tuple (similar to a search).
- update cost:
- search for the tuple and update its value.
#### Performance Notes
- avoids performance degradation typical of static hashing with overflow areas.
- particularly useful for database systems with unpredictable dataset sizes.
### Linear Hashing
- a dynamic hashing technique that grows or shrinks incrementally, one bucket at a time.
#### Key Concepts
- **Hash functions and bucket allocation**
- utilizes a family of hash functions, `h_i`, where each function determines the bucket index.
- starts with an initial hash function, `h_0`, mapping keys to a fixed number of buckets.
- as the dataset grows, switches to higher-level hash functions (`h_1`, `h_2`, etc.).
- **Dynamic expansion**
- triggered when the load factor exceeds a threshold.
- splits a bucket determined by a split pointer `s`.
- split pointer increments linearly and resets when all buckets are split.
- when the pointer resets, the hash level `l` increments, doubling addressable bucket space.
- **Dynamic contraction**
- triggered when the load factor falls below a threshold.
- merges buckets starting from the split pointer `s` with their counterparts.
- decrements the level `l` when all buckets have been merged.
#### Operations
- **Insertion**
- compute bucket index using hash function for level `l`.
- if the index < split pointer `s`, rehash with the function for level `l + 1`.
- insert record into the identified bucket.
- if the load factor exceeds the threshold, split a bucket.
- **Search**
- compute initial bucket index using the hash function for level `l`.
- if the index < split pointer `s`, rehash with the function for level `l + 1`.
- search in the identified bucket.
- **Deletion**
- locate the bucket using the search procedure.
- remove the record.
- if the load factor drops below the threshold, merge buckets.
#### Advantages
- **Gradual resizing**
- grows or shrinks one bucket at a time, avoiding large-scale rehashing.
- **Efficient space usage**
- maintains an optimal load factor, balancing storage and performance.