DBMS Notes - Gurukul College Kota

jumentousklipitiklopΛογισμικό & κατασκευή λογ/κού

30 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

105 εμφανίσεις

1

DBMS Notes


GIET2013EC4SA





D
ata
B
ase :


A database is a collection of stored operational data used by various applications and/or users by
some particular enterprise or by a set of outside authorized applications and authorized users.


DataBase Management System

:


A DataBase
Management System (DBMS) is a software system that manages execution of users
applications to access and modify database data so that the data security, data integrity, and data
reliability is guaranteed for each application and each application is written

with an assumption that it is
the only application active in the database.


What Is Data ?



Different
viewpoints
:



A sequence of characters stored in computer memory or storage



Interpreted sequence of characters stored in computer memory or storage



Interpreted set of objects




Database supports a concurrent access to the data


File Systems :



File is
uninterrupted
, unstructured collection of information


•File operations: delete, catalog, create, rename, open, close, read, write, find, …


•Access m
ethods: Algorithms to implement operations along with internal file organization


•Examples: File of Customers, File of Students; Access method: implementation of a set of operations
on a file of students or customers.



File Management System
Problems:



Data redundancy


•Data Access: New request
-
new program


•Data is not isolated from the access implementation


•Concurrent program execution on the same file


•Difficulties with security enforcement


•Integrity
issues.



Database
Applications:



Airline Res
ervation Systems



Data items are: single passenger reservations; Information about flights
and airports; Information about ticket prices and tickets restrictions.



Banking Systems



Data items are accounts, customers, loans, mortgages, balances, etc. Fai
lures
are not tolerable. Concurrent access must be provided



Corporate Records



Data items are: sales, accounts, bill of materials records, employee and their
dependents

ADVANTAGES OF A DBMS:

Data independence
: Application programs should be as
independent as possible from details of data
representation and storage. The DBMS can provide an abstract view of the data to insulate application
code from such details.


Client

data access
: A DBMS utilizes a variety of sophisticated techniques to store a
nd retrieve data
2

DBMS Notes


GIET2013EC4SA





efficiently
. This feature is especially important if the data is stored on external storage devices.


Data integrity and security
: If data is always accessed through the DBMS, the DBMS can enforce
integrity constraints on the data. For exa
mple, before inserting salary information for an employee, the
DBMS can check that the department budget is not exceeded. Also, the DBMS can enforce access
controls that govern what data is visible to
deferent

classes of users.


Data administration
: When s
everal users share the data, centralizing the administration


of data can o
r signi cant improvements. Experienced professionals who understand the nature of the
data being managed, and how
deferent

groups of users use it, can be responsible for organizing
the
data representation to minimize redundancy and for ne
-
tuning the storage of the data to make retrieval
efficient
.

concarence recovery
: A DBMS schedules concurrent accesses to the data in such a manner that
users can think of the data as being accessed
by only one user at a time. Further, the DBMS protects
users from the eects of system failures.



Reduced application development time
: Clearly, the DBMS supports many important functions that
are common to many applications accessing data stored in the DB
MS. This, in conjunction with the
high
-
level interface to the data, facilitates quick development of applications. Such applications are
also likely to be more robust than applications developed from scratch because many important tasks
are handled by the
DBMS instead of being implemented by the application.


Data Levels and their Roles :






Physical


corresponds to the first view of data
:

How data is stored, how is it accessed, how data
is modified, is data ordered, how data is allocated to computer
memory and/or peripheral devices, how
data items are actually represented (ASCI, EBCDIC,…) .
The physical schema speci
es additional
storage details. Essentially, the physical schema summarizes how the relations described in the
conceptual schema are actuall
y stored on secondary storage devices such as disks and tapes. We
3

DBMS Notes


GIET2013EC4SA





must decide what le organizations to use to store the relations, and create auxiliary data structures
called indexes to speed up data retrieval operations.



Conceptual


corresponds to the
second view of data:

What we want the data to express and
what relationships between data we must express, what “ story” data tells, are all data necessary for
the “story’ are discussed.

The conceptual schema (sometimes called the logical schema) describes

the
stored data in terms of the data model of the DBMS. In a relational DBMS, the conceptual schema
describes all relations that are stored in the database. In our sample university database, these
relations contain information about entities, such as stu
dents and faculty, and about relationships, such
as students' enrollment in courses. All student entities can be described using records in a Students
relation, as we saw earlier. In fact, each collection of entities and each collection of relationships ca
n
be described as a relation, leading to the following conceptual schema:


Students(sid: string, name: string, login: string, age: integer, gpa: real)


Faculty( d: string, fname: string, sal: real)


Courses(cid: string, cname: string, credits: integer)


Ro
oms(rno: integer, address: string, capacity: integer)


Enrolled(sid: string, cid: string, grade: string)


Teaches( d: string, cid: string)


Meets In(cid: string, rno: integer, time: string)


The choice of relations, and the choice of elds for each relation
, is not always obvious,


and the process of arriving at a good conceptual schema is called conceptual


database design.




View


corresponds to the third view of data
:

What part of the data is seen by a specific
application
.

External schemas, which usual
ly are also in terms of the data model of the DBMS, allow
data access to be customized (and authorized) at the level of individual users or groups of users.

The
external schema design is guided by end user requirements. For example, we might ant to allow
s
tudents to
a
nd out the names of faculty members teaching courses, as well as course enro
llments.
This can be done by de
ning the following view:


Course

info(cid: string, fname: string, enrollment: integer)



STRUCTURE OF A DBMS:



When a user issues a qu
ery, the parsed query is presented to a query optimizer, which uses
information about how the data is stored to produce an
efficient

execution plan for evaluating the
query. An execution plan is a blueprint for evaluating a query, and is usually represente
d as a tree of
relational operators.


The code that implements relational operators sits on top of the le and access methods layer. This
layer includes a variety of software for supporting the concept of a le, which, in a DBMS, is a collection
of pages or
a collection of records. This layer typically supports a heap le, or le of unordered pages, as
well as indexes. In addition to keeping track of the pages in a le, this layer organizes the information
within a page.

The les and access methods layer code sit
s on top of the buer manager, which brings
pages in from disk to main memory as needed in response to read requests.

4

DBMS Notes


GIET2013EC4SA





The lowest layer of the DBMS software deals with management of space on disk, where the data is
stored. Higher layers allocate, deallocate,

read, and write pages through (routines provided by) this
layer, called the disk space manager.


The DBMS supports concurrency and crash recovery by carefully scheduling user requests and
maintaining a log of all changes to the database. DBMS components a
ssociated with concurrency
control and recovery include the transaction manager, which ensures that transactions request and
release locks according to a suitable locking protocol and schedules the execution transactions; the
lock manager, which keeps trac
k of requests for locks and grants locks on database objects when they
become available; and the recovery manager, which is responsible for maintaining a log, and restoring
the system to a consistent state after a crash. The disk space manager, buer manage
r, and le and
access method layers must interact with these components.

Data Models:


A collection of tools for describing ......




Data.



Data relationships.



Data semantics.



Data constraints.

Relational model..........

Entity
-
Relationship data model (mainly

for database design) .


Object
-
based data models (Object
-
oriented and Object
-
relational).



Semi structured

data model (XML).

Other older models:.........



Network model .



Hierarchical model.

Database Access from Application Programs:

To access the database,

DML statements need to be executed from the host language.

There are two
ways to do this:

• By providing an application program interface (set of procedures) that can be used to send DML and
DDL statements to the database, and retrieve the results.

The Op
en Database Connectivity (ODBC)
standard defined by Microsoft for use with the C language is a commonly used application program
interface

standard. The Java Database Connectivity (JDBC) standard provides corresponding

features
to the Java language.


• By
extending the host language syntax to embed DML calls within the host language program.
Usually, a special character prefaces DML calls, and a preprocessor, called the DML
precompiled
,
converts the DML statements to normal procedure calls in the host langu
age.

5

DBMS Notes


GIET2013EC4SA





Database Users and Administrators:




Naive users are unsophisticated users who interact with the system by invoking one of the
application programs that have been written previously.



Application programmers are computer professionals who write applicat
ion programs.



Sophisticated users interact with the system without writing programs. Instead,

they form their
requests in a database query language. They submit each such query to a query processor,
whose function is to break down DML statements into instr
uctions that the storage manager
understands. Analysts who submit queries to explore data in the database fall in this category.



Specialized users are sophisticated users who write specialized database applications that do
not fit into the traditional data
-
processing framework.






Database Administrator:

A person who has such central control over the system is called a
database administrator (DBA)



Schema definition
. The DBA creates the original database schema by executing a set of data
definition statements
in the DDL.



Storage structure and access
-
method definition.



Schema and physical
-
organization modification. The DBA carries out changes to the schema
and physical organization to reflect the changing needs of the organization, or to alter the
physical organ
ization to improve performance.



Granting of authorization for data access.



Routine maintenance.







Data Model:


A data model is a collection of conceptual tools for describing data, data relationships, data semantics,
and consistency constraints.

Entity:

An

entity is a “thing” or “object” in the real world that is distinguishable from all other objects.
For example, each person in an enterprise is an entity.

Entity set:

An entity set is a set of entities of the same type that share the same properties,
or
attributes
. The set of all persons who are customers at a given bank, for example, can be defined as
the entity set customer. Similarly, the entity set loan might represent the set of all loans awarded by a
particular bank.


An entity is represented by a s
et of attributes. Attributes are descriptive properties possessed by each
member of an entity set. The designation of an attribute for an entity set expresses that the database
stores similar information concerning each entity in the entity set; however, e
ach entity may have its
own value for each attribute.


Simple and composite attributes:

the attributes have

been simple; that is, they are not divided into
subparts is called as "
simple attributes
". on the other hand, can be divided into subparts is called

as
"
composite attributes"
.

For example, an attribute name could be structured as a composite attribute
consisting of first
-
name, middle
-
initial, and last
-
name.


Single
-
valued and
multivalve

attributes:

For instance, the loan
-
number attribute for a specifi
c loan
entity refers to only one loan number. Such attributes are said to be single valued. There may be
instances where an attribute has a set of values for a specific entity. Consider an employee entity set
6

DBMS Notes


GIET2013EC4SA





with the attribute phone
-
number. An employee ma
y have zero, one, or several phone numbers, and
different employees may have different numbers of phones.


This type of attribute is said to be
multivalve
.


Derived attribute:

The value for this type of attribute can be derived from the values of other
related
attributes or entities. For instance, let us say that the customer entity set has an attribute loans
-
held,
which represents how many loans a customer has from the bank.

We can derive the value for this
attribute by counting the number of loan entit
ies associated with that customer.

Relationship Sets:

A relationship is an association among several entities. A relationship set is a set
of relationships of the same type.

Mapping Cardinalities:

Mapping cardinalities, or cardinality ratios, express the n
umber of entities to
which another entity can be associated via a relationship set. Mapping cardinalities are most useful in
describing binary relationship sets, although they can contribute to the description of relationship sets
that involve more than tw
o entity sets.



One to one
. An entity in A is associated with at most one entity in B, and an entity in B is
associated with at most one entity in A.



One to many. An entity in A is associated with any number (zero or more) of entities in B. An
entity in B,
however, can be associated with at most one entity in A.



Many to one. An entity in A is associated with at most one entity in B. An entity in B, however,
can be associated with any number (zero or more) of entities in A.



Many to many. An entity in A is ass
ociated with any number (zero or more) of entities in B, and
an entity in B is associated with any number (zero or more) of entities in A.

Keys:

A key allows us to identify a set of attributes that suffice to distinguish entities from each other.
Keys also

help uniquely identify relationships, and thus distinguish relationships from each other.

Super key
:

A super

key is a set of one or more attributes that, taken collectively, allow us to identify
uniquely an entity in the entity set. For example, the custo
mer
-
id attribute of the entity set customer is
sufficient to distinguish one customer entity from another. Thus, customer
-
id is a
super key
. Similarly,
the combination of customer
-
name and customer
-
id is a super

key for the entity set customer. The
custome
r
-
name attribute of customer is not a super

key, because several people might have the same
name.

candidate key:

minimal super

keys are called candidate keys.

If K is a super

key, then so is any
superset of K. We are often interested in super

keys for whic
h no proper subset is a super

key.It is
possible that several distinct sets of attributes could serve as a candidate key.

Suppose that a
combination of customer
-
name and customer
-
street is sufficient to distinguish among members of the
customer entity set.

Then, both {customer
-
id} and {customer
-
name, customer
-
street} are candidate
keys. Although the attributes customer

id and customer
-
name together can distinguish customer
entities, their combination does not form a candidate key, since the attribute custom
er
-
id alone is a
candidate key.

primary key:
which denotes the unique identity is called as primary key.

primary key to denote a
candidate key that is chosen by the database designer as the principal means of identifying entities
7

DBMS Notes


GIET2013EC4SA





within an entity set. A key

(primary, candidate, and super) is a property of the entity set, rather than of
the individual entities. Any two individual entities in the set are prohibited from having the same value
on the key attributes at the same time. The designation of a key repr
esents a constraint in the real
-
world enterprise being modeled.

Weak Entity Sets:
An entity set may not have sufficient attributes to form a primary key. Such an entity
set is termed a weak entity set. An entity set that has a primary key is termed a strong

entity set.



For a weak entity set to be meaningful, it must be associated with another entity

set, called the
identifying or owner entity set. Every weak entity must be associated with an identifying entity; that is,
the weak entity set is said to be
existence dependent on the identifying entity set. The identifying entity
set is said to own the weak entity set that it identifies. The relationship associating the weak entity set
with the identifying entity set is called the identifying relationship. Th
e identifying relationship is many to
one from the weak entity set to the identifying entity set, and the participation of the weak entity set in
the relationship is total.


In our example, the identifying entity set for payment is loan, and a relationship

loan
-
payment that
associates payment entities with their corresponding loan entities is the identifying relationship.


Although a weak entity set does not have a primary key, we nevertheless need a means of
distinguishing among all those entities in the w
eak entity set that depend on one particular strong
entity. The discriminator of a weak entity set is a set of attributes that allows this distinction to be made.

In E
-
R diagrams, a doubly outlined box indicates a weak entity set, and a doubly outlined dia
mond
indicates the corresponding identifying relationship. in fig the weak entity set payment depends on the
strong entity set loan via the relationship set loan
-
payment.



The figure also illustrates the use of double lines to indicate total participation

the of the (weak) entity
set payment in the relationship loan
-
payment is total, meaning that every payment must be related via
loan
-
payment to some loan. Finally, the arrow from loan
-
payment to loan indicates that each payment
8

DBMS Notes


GIET2013EC4SA





is for a single loan. The di
scriminator of a weak entity set also is underlined, but with a dashed, rather

than a solid, line.


Specialization:
An entity set may include sub groupings of entities that are distinct in some way

from
other entities in the set. For instance, a subset of
entities within an entity set may have attributes that
are not shared by all the entities in the entity set. The E
-
R model provides a means for representing
these distinctive entity groupings. Consider an entity set person, with attributes name, street, an
d city.
A person

may be further classified as one of the following:


• customer


• employee

Each of these person types is described by a set of attributes that includes all the attributes of entity
set person plus possibly additional attributes. For exampl
e, customer entities may be described further
by the attribute customer
-
id, whereas employee entities may be described further by the attributes
employee
-
id and salary. The process of designating sub groupings within an entity set is called
specialization.

The specialization of person allows us to distinguish among persons according to
whether they are employees or customers.

Generalization: m
The design process may also proceed in a bottom
-
up manner, in which multiple
entity sets are synthesized into a high
er
-
level entity set on the basis of common features. The
database designer may have first identified a customer entity set with the attributes name, street, city,
and customer
-
id, and an employee entity set with the attributes name, street, city, employee
-
id, and
salary. There are similarities between the customer entity set and the employee entity set in the sense
that they have several attributes in common. This commonality can be expressed by generalization,
which is a containment relationship that exist
s between a higher
-
level entity set and one or more lower
-
level entity sets. In our example, person is the higher
-
level entity set and customer and employee are
lower
-
level entity sets.


Higher
-

and lower
-
level entity sets also may be designated by the ter
ms super

class and subclass,
respectively. The person entity set is the superclass of the customer and employee subclasses.

For all
practical purposes, generalization is a simple inversion of specialization. We will apply both processes,
in combination, in

the course of designing the E
-
R schema for an enterprise. In terms of the E
-
R
diagram itself, we do not distinguish between specialization and generalization. New levels of entity
representation will be distinguished (specialization) or synthesized (gener
alization) as the design
schema comes to express fully the database application and the user requirements of the database.
Differences in the two approaches may be characterized by their starting point and overall goal.

Generalization proceeds from the rec
ognition that a number of entity sets share some common
features (namely, they are described by the same attributes and participate

in the same relationship
9

DBMS Notes


GIET2013EC4SA





sets).


Aggregation:


Aggregation is an abstraction in which relationship sets (along with their
associated entity sets) are
treated as higher
-
level entity sets, and can participate in relationships.

10

DBMS Notes


GIET2013EC4SA







Symbols used in the E
-
R notation:


11

DBMS Notes


GIET2013EC4SA





ER Model For a college DB:

Assumptions :



A college contains many departments



Each department can offer any number
of courses



Many instructors can work in a department



An instructor can work only in one department



For each department there is a Head



An instructor can be head of only one department



Each instructor can take any number of courses



A course can be taken by
only one instructor



A student can enroll for any number of courses



Each course can have any number of students

Steps in ER Modeling:



Identify the Entities



Find relationships



Identify the key attributes for every Entity



Identify other relevant attributes



Draw complete E
-
R diagram with all attributes including Primary Key

Step 1: Identify the Entities:



DEPARTMENT



STUDENT



COURSE



INSTRUCTOR

Step 2: Find the relationships:



One course is enrolled by multiple students and one student enrolls for multiple courses
,


hence the cardinality between course and student is Many to Many.



The department offers many courses and each course belongs to only one department,


hence the cardinality between department and course is One to Many.



One department has multiple instruc
tors and one instructor belongs to one and only one


department , hence the cardinality between department and instructor is one to Many.



Each department there is a “Head of department” and one instructor is “Head of


department “,hence the cardinality is
one to one .



One course is taught by only one instructor, but the instructor teaches many courses,


hence the cardinality between course and instructor is many to one.

Step 3: Identify the key attributes



Dept

name is the key attribute for the Entity
“Department”, as it identifies the Department
uniquely.

12

DBMS Notes


GIET2013EC4SA







Course# (CourseId) is the key attribute for “Course” Entity.



Student# (Student Number) is the key attribute for “Student” Entity.



Instructor Name is the key attribute for “Instructor” Entity.




Step 4:

Identify other relevant attributes

For the department entity, the relevant attribute is location


duration,

prerequisite








ER model for Banking Business :

Assumptions :

13

DBMS Notes


GIET2013EC4SA







There are multiple banks and each bank has many branches. Each branch has multiple
customers



Customers have various types of accounts



Some Customers also had taken different types of loans from these bank
branches



One customer can have multiple accounts and Loans

Step 1: Identify the Entities


• BANK


• BRANCH


• LOAN


• ACCOUNT


• CUSTOMER


Step 2: Find the relationships


• One Bank has many branches and each branch belongs to only one bank, hence the


car
dinality between Bank and Branch is One to Many.


• One Branch offers many loans and each loan is associated with one branch, hence the


cardinality between Branch and Loan is One to Many.


• One Branch maintains multiple accounts and each account is assoc
iated to one and


only one Branch, hence the cardinality between Branch and Account is One to Many


• One Loan can be availed by multiple customers, and each Customer can avail multiple


loans, hence the cardinality between Loan and Customer is Many to Man
y.


• One Customer can hold multiple accounts, and each Account can be held by multiple


Customers, hence the cardinality between Customer and Account is Many to Many

Step 3: Identify the key attributes


• Bank

Code (Bank Code) is the key attribute for the

Entity “Bank”, as it identifies the bank


uniquely.


• Branch# (Branch Number) is the key attribute for “Branch” Entity.


• Customer# (Customer Number) is the key attribute for “Customer” Entity.


• Loan# (Loan Number) is the key attribute for “Loan” Enti
ty.


• Account No (Account Number) is the key attribute for “Account” Entity.


Step 4: Identify other relevant attributes


• For the “Bank” Entity, the relevant attributes other than “Bank

Code” would be “Name”


and “Address”.


• For the “Branch” Entity, t
he relevant attributes other than “Branch#” would be “Name”


and “Address”.


• For the “Loan” Entity, the relevant attribute other than “Loan#” would be “Loan Type”.


• For the “Account” Entity, the relevant attribute other than “Account No” would be


“Acc
ount Type”.


• For the “Customer” Entity, the relevant attributes other than “Customer#” would be


“Name”, “Telephone#” and “Address”.

E
-
R diagram with all attributes including Primary Key:

14

DBMS Notes


GIET2013EC4SA








15

DBMS Notes


GIET2013EC4SA





Normalization


While designing a database out of an entity

rel
ationship model, the main problem existing in that “raw”
database is redundancy. Redundancy is storing the same data item in more one place. A redundancy
creates several problems like the following:

1.

Extra storage space: storing the same data in many places

takes large amount of disk space.

2.

Entering same data more than once during data insertion.

3.

Deleting data from more than one place during deletion.

4.

Modifying data in more than one place.

5.

Anomalies may occur in the database if insertion, deletion, modificat
ion etc are no done properly. It
creates inconsistency and unreliability in the database.

To solve this problem, the “raw” database needs to be normalized. This is a step by step process of
removing different kinds of redundancy and anomaly at each step. A
t each step a specific rule is
followed to remove specific kind of impurity in order to give the database a slim and clean look.



Un
-
Normalized Form (UNF)

If a table contains non
-
atomic values at each row, it is said to be in UNF. An

atomic value

is
something that can not be further decomposed. A

non
-
atomic value
, as the name suggests, can be
further decomposed and simplified. Consider the following table:

Emp
-
Id

Emp
-
Name

Month

Sales

Bank
-
Id

Bank
-
Name

E01

AA

Jan

1000

B01

SBI





Feb

1200









Mar

850





E02

BB

Jan

2200

B02

UTI





Feb

2500





E03

CC

Jan

1700

B01

SBI





Feb

1800









Mar

1850









Apr

1725





In the sample table above, there are multiple occurrences of rows under each key Emp
-
Id. Although
considered to be the
primary key, Emp
-
Id cannot give us the unique identification facility for any single
row. Further, each primary key points to a variable length record (3 for E01, 2 for E02 and 4 for E03).



First Normal Form (1NF)

A relation is said to be in 1NF if it
contains no non
-
atomic values and each row can provide a unique
combination of values. The above table in UNF can be processed to create the following table in 1NF.


Emp
-
Id

Emp
-
Name

Month

Sales

Bank
-
Id

Bank
-
Name

E01

AA

Jan

1000

B01

SBI

E01

AA

Feb

1200

B01

SBI

E01

AA

Mar

850

B01

SBI

E02

BB

Jan

2200

B02

UTI

E02

BB

Feb

2500

B02

UTI

16

DBMS Notes


GIET2013EC4SA





E03

CC

Jan

1700

B01

SBI

E03

CC

Feb

1800

B01

SBI

E03

CC

Mar

1850

B01

SBI

E03

CC

Apr

1725

B01

SBI

As you can see now, each row contains unique combination of values.
Unlike in UNF, this relation
contains only atomic values, i.e. the rows can not be further decomposed, so the relation is now in
1NF.



Second Normal Form (2NF)

A relation is said to be in 2NF f if it is already in 1NF and each and every attribute fully de
pends on the
primary key of the relation. Speaking inversely, if a table has some attributes which is not dependant
on the primary key of that table, then it is not in 2NF.

Let us explain. Emp
-
Id is the primary key of the above relation. Emp
-
Name, Month, S
ales and Bank
-
Name all depend upon Emp
-
Id. But the attribute Bank
-
Name depends on Bank
-
Id, which is not the
primary key of the table. So the table is in 1NF, but not in 2NF. If this position can be removed into
another related relation, it would come to 2N
F.

Emp
-
Id

Emp
-
Name

Month

Sales

Bank
-
Id

E01

AA

JAN

1000

B01

E01

AA

FEB

1200

B01

E01

AA

MAR

850

B01

E02

BB

JAN

2200

B02

E02

BB

FEB

2500

B02

E03

CC

JAN

1700

B01

E03

CC

FEB

1800

B01

E03

CC

MAR

1850

B01

E03

CC

APR

1726

B01



Bank
-
Id

Bank
-
Name

B01

SBI

B02

UTI

After removing the portion into another relation we store lesser amount of data in two relations without
any loss information. There is also a significant reduction in redundancy.



Third Normal Form (3NF)

A relation is said to be in 3NF, if it
is already in 2NF and there exists no

transitive dependency

in that
relation. Speaking inversely, if a table contains transitive dependency, then it is not in 3NF, and the
table must be split to bring it into 3NF.

What is a transitive dependency? Within a
relation if we see

A → B [B depends on A]


And


B → C [C depends on B]


Then we may derive


A → C[C depends on A]

17

DBMS Notes


GIET2013EC4SA





Such derived dependencies hold well in most of the situations. For example if we have


Roll → Marks


And

Marks → Grade

Then we may safely deri
ve


Roll → Grade.

This third dependency was not originally specified but we have derived it.

The derived dependency is called a transitive dependency when such dependency becomes
improbable
. For example we have been given

Roll → City

And

City → STD

Code

If

we try to derive Roll → STD

Code it becomes a transitive dependency, because obviously the
STDCode of a city cannot depend on the roll number issued by a school or college. In such a case the
relation should be broken into two, each containing one of thes
e two dependencies:

Roll → City

And

City → STD code



Boyce
-
Code Normal Form (BCNF)

A relationship is said to be in BCNF if it is already in 3NF and the left hand side of every dependency is
a candidate key. A relation which is in 3NF is almost always in B
CNF. These could be same situation
when a 3NF relation may not be in BCNF the following conditions are found true.

1.

The candidate keys are composite.

2.

There are more than one candidate keys in the relation.

3.

There are some common attributes in the relation.


Professor Code

Department

Head of Dept.

Percent Time

P1

Physics

Ghosh

50

P1

Mathematics

Krishnan

50

P2

Chemistry

Rao

25

P2

Physics

Ghosh

75

P3

Mathematics

Krishnan

100

Consider, as an example, the above relation. It is assumed that:

1.

A professor can
work in more than one department

2.

The percentage of the time he spends in each department is given.

3.

Each department has only one Head of Department.

18

DBMS Notes


GIET2013EC4SA





The relation diagram for the above relation is given as the following:




The given relation is in 3NF.
Observe, however, that the names of Dept. and Head of Dept. are
duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information that
Rao is the Head of Department of Chemistry.

The normalization of the relation is done by cr
eating a new relation for Dept. and Head of Dept. and
deleting Head of Dept. form the given relation. The normalized relations are shown in the following.

Professor Code

Department

Percent Time

P1

Physics

50

P1

Mathematics

50

P2

Chemistry

25

P2

Physics

75

P3

Mathematics

100




Department

Head of Dept.

Physics

Ghosh

Mathematics

Krishnan

Chemistry

Rao

See the dependency diagrams for these new relations.






Fourth Normal Form (4NF)

19

DBMS Notes


GIET2013EC4SA





When attributes in a relation have multi
-
valued dependency,
further Normalization to 4NF and 5NF are
required. Let us first find out what multi
-
valued dependency is.

A

multi
-
valued dependency

is a typical kind of dependency in which each and every attribute within a
relation depends upon the other, yet none of them

is a unique primary key.

We will illustrate this with an example. Consider a vendor supplying many items to many projects in an
organization. The following are the assumptions:

1.

A vendor is capable of supplying many items.

2.

A project uses many items.

3.

A vend
or supplies to many projects.

4.

An item may be supplied by many vendors.

A multi valued dependency exists here because all the attributes depend upon the other and yet none
of them is a primary key having unique value.

Vendor Code

Item Code

Project No.

V1

I1

P1

V1

I2

P1

V1

I1

P3

V1

I2

P3

V2

I2

P1

V2

I3

P1

V3

I1

P2

V3

I1

P3

The given relation has a number of problems. For example:

1.

If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a blank for
item code has to
be introduced.

2.

The information about item I1 is stored twice for vendor V3.

Observe that the relation given is in 3NF and also in BCNF. It still has the problem mentioned above.
The problem is reduced by expressing this relation as two relations in the Fou
rth Normal Form (4NF).
A relation is in 4NF if it has no more than one independent multi valued dependency or one
independent multi valued dependency with a functional dependency.

The table can be expressed as the two 4NF relations given as following. The
fact that vendors are
capable of supplying certain items and that they are assigned to supply for some projects in
independently specified in the 4NF relation.

Vendor
-
Supply


Vendor Code

Item Code

V1

I1

V1

I2

V2

I2

V2

I3

V3

I1

Vendor
-
Project


Vendor
Code

Project No.

V1

P1

V1

P3

20

DBMS Notes


GIET2013EC4SA





V2

P1

V3

P2



Fifth Normal Form (5NF)

These relations still have a problem. While defining the 4NF we mentioned that all the attributes
depend upon each other. While creating the two tables in the 4NF, although we have
preserved the
dependencies between Vendor Code and Item code in the first table and Vendor Code and Item code
in the second table, we have lost the relationship between Item Code and Project No. If there were a
primary key then this loss of dependency woul
d not have occurred. In order to revive this relationship
we must add a new table like the following. Please note that during the entire process of normalization,
this is the only step where a new table is created by joining two attributes, rather than spl
itting them
into separate tables.

Project No.

Item Code

P1

11

P1

12

P2

11

P3

11

P3

13



Let us finally summarize the normalization steps we have discussed so far.

Input
Relation

Transformation

Output
Relation

All Relations

Eliminate variable length
record. Remove multi
-
attribute lines in table.

1NF

1NF
Relation

Remove dependency of non
-
key attributes on part of a multi
-
attribute key.

2NF

2NF

Remove dependency of non
-
key attributes on other non
-
key attributes.

3NF

3NF

Remove dependency of an
attribute of a multi attribute key on an attribute
of another (overlapping) multi
-
attribute key.

BCNF

BCNF

Remove more than one independent multi
-
valued dependency from relation
by splitting relation.

4NF

4NF

Add one relation relating attributes with
multi
-
valued dependency.

5NF


Primitive data type

primitive data type



a

basic type

is a

data type

provided by a

programming language

as a basic building block. Most
languages allow more complicated

composite types

to be recursively constructed starting from
basic types.



a

built
-
in type

is a data type for which the programming language provides built
-
in support.

21

DBMS Notes


GIET2013EC4SA





In most programming languages, all basic data types are built
-
in. In addition, many langua
ges also
provide a set of composite data types. Opinions vary as to whether a built
-
in type that is not basic
should be considered "primitive

Depending on the language and its implementation, primitive data types may or may not have a one
-
to
-
one correspond
ence with objects in the computer's memory. However, one usually expects
operations on basic primitive data types to be the fastest language constructs there are

Integer
addition, for example, can be performed as a single machine instruction, and some

processors

offer
specific instructions to process sequences of characters with a single instruction. In particular,
the

C

standard mentions that "a 'plain' int object has the natural size suggested by the architecture of
the execution environment". This means that

int

is likely to be 32 bits long on a 32
-
bit archit
ecture.
Basic primitive types are almost always

value types
.

Most languages do not allow the behavior or capabilities of primitive (either built
-
in or basic) data types
to be modified
by programs. Exceptions include

Smalltalk
, which permits all data types to be extended
within a program, adding to the operations that can be performed on them or even redefining the bui
lt
-
in operations.


Composite types


are derived from more than one primitive type. This can be done in a number of ways. The ways they
are combined are called

data structures
. Composing a primitive type into a compound type generally
results in a new type, e.g.

array
-
of
-
integer

is a different type to

integer
.



An

array

stores a number of elements of the same type in a specific order. They are accessed
using an integer to specify which element is required (although the elements may be of almost any
type). Ar
rays may be fixed
-
length or expandable.



Record

(also called tuple or struct) Records are among the simplest

data structures
. A record is a
value that contains other values, typically in fixed number and sequence and typically indexed by
names. The elements of records are usually called

fields

or

members
.



Union
. A union type definition will specify which of a number of permitted primitive types may be
stored in its instances, e.g. "float or long integer". Contrast with a

record
, which could be defined to
contain a float

and

an integer; whereas, in a union, there is only one value at a time.



A

tagged union

(also called a

variant
, variant record, discriminated union, or disjoint union) contains
an a
dditional field indicating its current type, for enhanced type safety.



A

set

is an

abstract data structure

that can store certain values, without any particular

order
, and no
repeated values. Values themselves are not retrieved from sets, rather
one tests a value for
membership to obtain a boolean "in" or "not in".



An

object

contains a number of data fields, like a record, and also a number of pro
gram code
fragments for accessing or modifying them. Data structures not containing code, like those above,
are called

plain old data structure
.

22

DBMS Notes


GIET2013EC4SA





Many others

are possible, but they tend to be further variations and compounds of the above.

Logical and Physical Database Requirements

The requirements for a logical and physical database vary by size and design parameters. A logical
database must be able to access
and identify all files within the storage system to operate correctly,
whereas a physical database manages a much smaller field of information. Sometimes, a physical
database stores only a single file with one value or word in it.


Logical Database Definit
ion


A logical database is the collected information stored on multiple physical disk files and hard drives
within a computer. This database provides a structure to house all the accumulated information within
the device and determines the relationships be
tween different types of files and programs. A logical
database determines these relationships through a series of highly structured tables designed to
categorize information into groups for easier accessibility. Without this categorization, accessing
diff
erent files within a computer would take additional time as the system searched each file for the
appropriate match.


Logical Database Requirements

A logical database can stretch over multiple physical hard disks and information files. The data storage
uni
t is still a single database for information retrieval purposes. To have a logical database, all given
hard disks and information files must be accessible from a single source. An example would be a
personal computer able to access its information files st
ored on multiple hard drives from a single user
interface. According to Microsoft, when a logical database is successful, the user sees a coherent list
of information from a central location that draws from the many file sources tied into the storage
syste
m.


Physical Database Definition

A physical database is both the actual device housing the information files and the search paths used
to access information between each source. According to Microsoft, the term "database" refers only to
the logical databas
e controlling information files for the entire system. A physical database is
technically a smaller unit of storage referred to as either a company, field, record or table, depending
on how much information the physical storage device contains. A field is
the smallest unit of storage
housing only a single file. A company is the largest
--

next to a database
--

housing separate, large
groups of data.


Physical Storage Requirements

The requirements for a physical database vary by the parameters of the storage

device in question.
For example, a flash drive designed to hold up to 2 gigabytes of information needs a personal
computer or another USB
-
connected device to allow access to the information stored on the
equipment. A physical database also needs a power s
ource to access information. A computer hard
drive cannot function without electricity. A flash drive cannot operate without a device with an adequate
power source.



23

DBMS Notes


GIET2013EC4SA








Two types of data modeling are as follows:



Logical modeling



Physical modeling

If you are

going to be working with databases, then it is important to understand the difference between
logical and physical modeling, and how they relate to one another. Logical and physical modeling are
described in more detail in the following subsections.



Logical

Modeling

Logical modeling deals with gathering business requirements and converting those requirements into a
model. The logical model revolves around the needs of the business, not the database, although the
needs of the business are used to estab
lish the needs of the database. Logical modeling involves
gathering information about business processes, business entities (categories of data), and
organizational units. After this information is gathered, diagrams and reports are produced including
enti
ty relationship diagrams, business process diagrams, and eventually process flow diagrams. The
diagrams produced should show the processes and data that exists, as well as the relationships
between business processes and data. Logical modeling should accur
ately render a visual
representation of the activities and data relevant to a particular business.

The diagrams and documentation generated during logical modeling is used to determine whether the
requirements of the business have been completely gathered.

Management, developers, and end
users alike review these diagrams and documentation to determine if more work is required before
physical modeling commences.

Typical deliverables of logical modeling include



Entity relationship diagrams


An Entity Relation
ship Diagram is also referred to as an analysis ERD. The point of the initial ERD is to
provide the development team with a picture of the different categories of data for the business, as well
as how these categories of data are related to one another.



Bu
siness process diagrams


The process model illustrates all the parent and child processes that are performed by individuals
within a company. The process model gives the development team an idea of how data moves within
the organization. Because process mo
dels illustrate the activities of individuals in the company, the
process model can be used to determine how a database application interface is design.



User feedback documentation

Physical

Modeling

Physical modeling involves the actual design of a databas
e according to the requirements that were
established during logical modeling. Logical modeling mainly involves gathering the requirements of
the business, with the latter part of logical modeling directed toward the goals and requirements of the
24

DBMS Notes


GIET2013EC4SA





database.

Physical modeling deals with the conversion of the logical, or business model, into a
relational database model. When physical modeling occurs, objects are being defined at the schema
level. A schema is a group of related objects in a database. A database

design effort is normally
associated with one schema.

During physical modeling, objects such as tables and columns are created based on entities and
attributes that were defined during logical modeling. Constraints are also defined, including primary
keys
, foreign keys, other unique keys, and check constraints. Views can be created from database
tables to summarize data or to simply provide the user with another perspective of certain data. Other
objects such as indexes and snapshots can also be defined du
ring physical modeling. Physical
modeling is when all the pieces come together to complete the process of defining a database for a
business.

Physical modeling is database software specific, meaning that the objects defined during physical
modeling can var
y depending on the relational database software being used. For example, most
relational database systems have variations with the way data types are represented and the way data
is stored, although basic data types are conceptually the same among differen
t implementations.
Additionally, some database systems have objects that are not available in other database systems.

Typical deliverables of physical modeling include the following:



Server model diagrams


The server model diagram shows tables, columns,
and relationships within a database.



User feedback documentation Database design documentation

Conclusion

Understanding the difference between logical and physical modeling will help you build better
organized and more effective database systems.


Data ind
ependence

Data independence

is the type of

data

transparency that matters for a centralized

DBMS
. It refers to
the immunity of user

applications

to make changes in the definition and organization of data.

Physical data independence deals with hiding the d
etails of the storage structure from user
applications. The application should not be involved with these issues, since there is no difference in
the operation carried out against the data.

The data independence and operation independence together gives th
e feature of data abstraction.
There are two levels of data independence.

First level

25

DBMS Notes


GIET2013EC4SA





The

logical

structure of the data is known as the

schema definition
. In general, if a user application
operates on a subset of the

attributes

of a

relation
, it should not be a
ffected later when new attributes
are added to the same relation. Logical data independence indicates that the conceptual schema can
be changed without affecting the existing schemas.

Second level

The physical structure of the data is referred to as "physi
cal data description". Physical data
independence deals with hiding the details of the storage structure from user applications. The
application should not be involved with these issues since, conceptually, there is no difference in the
operations carried
out against the data. There are two types of data independence:

1.

Logical data independence: The ability to change the logical (conceptual) schema without
changing the External schema (User View) is called logical data independence. For example,
the addition

or removal of new entities, attributes, or relationships to the conceptual schema
should be possible without having to change existing external schemas or having to rewrite
existing application programs.

2.

Physical data independence: The ability to change t
he physical schema without changing the
logical schema is called physical data independence. For example, a change to the internal
schema, such as using different file organization or storage structures, storage devices, or
indexing strategy, should be pos
sible without having to change the conceptual or external
schemas.

3.

View level data independence: always independent no effect, because there doesn't exist any
other level above view level.

Data Independence Types

Data independence has two types. They are:

1.

Physical Independence

2.

Logical Independence.

Data independence can be explained as follows: Each higher level of the data architecture is immune
to changes of the next lower level of the architecture.

Physical Independence:

The logical scheme stays unchanged even though the storage space or type
of some data is changed for reasons of optimization or reorganization. In this external schema does
not change. In this internal schema changes may be required due to some physical sc
hema were
reorganized here. Physical data independence is present in most databases and file environment in
26

DBMS Notes


GIET2013EC4SA





which hardware storage of encoding, exact location of data on disk, merging of records, so on this are
hidden from user.

Logical Independence:

The e
xternal scheme may stay unchanged for most changes of the logical
scheme. This is especially desirable as the application software does not need to be modified or newly
translated.


Data abstraction

In

computer science
,

abstraction

is the process by which

data

and

programs

are defined with
a

representation

similar in form to it
s meaning (
semantics
), while hiding away
the

implementation

de
tails. Abstraction tries to reduce and factor out details so that the

programmer

can focus on a few concepts at a time. A system can have several

abstraction
layers

whereby different meanings and amounts of detail are exposed to the programmer. For
example,

low
-
leve
l

abstraction layers expose details of the

computer hardware

where the program
is

run
, while high
-
level layers deal with the

business logic

of the program.

The follow
ing English definition of abstraction helps to understand how this term applies to computer
science, IT and objects:

abstraction
-

a concept or idea not associated with any specific instance

Abstraction captures only those details about an object that are
relevant to the current perspective.
The concept originated by analogy with

abstraction in mathematics
. The mathematical technique of
abstraction begins w
ith mathematical

definitions
, making it a more technical approach than the
general concept of

abstraction in p
hilosophy
. For example, in both computing and in mathematics,

numbers

are concepts in the

programmin
g languages
, as founded in mathematics. Implementation
details depend on the hardware and software, but this is not a restriction because the computing
concept of number is still based on the mathematical concept.

In

computer programming
, abstraction can apply to control or to data:

Control abstraction

is the
abstraction of actions while

data abstraction

is that of data structures.



Control abstraction involves the
use of

subprograms

and related concepts

control flows



Data abstraction allows handling data bits in meaningf
ul ways. For example, it is the basic
motivation behind

datatype
.

One can regard the notion of an

object

(from

object
-
oriented programming
) as an attempt to
combine abstractions of data and code.

The same abstract definition can be used

as a common

interface

for a family of objects with
different implementations and behaviors but which share the same meaning.
The

inheritance

mechanism in object
-
oriented programming can be used to define an

abstract
class

as the common interface.

27

DBMS Notes


GIET2013EC4SA





The recommendation that programmers use abstractions whenever suitable in order to avoid
duplication (usually

of code
) is known as the

abstraction principle
. The
requirement that a
programming language provide suitable abstractions is also called the abstraction principle.


Data abstraction enforces a clear separation between the

abstract

properties of a

data type

and
the

concrete

details of its implementation. The abstract properties are those that are visible to client
code that makes use of the data type

the

interface

to the data type

while the concrete
implementation is kept entirely private,
and indeed can change, for example to incorporate efficiency
improvements over time. The idea is that such changes are not supposed to have any impact on client
code, since they involve no difference in the abstract behaviour.

For example, one could define

an

abstract data type

called

lookup table

which uniquely
associates

keys

with

values
, and in which values may be retrieved by specifying their corresponding
keys. Such

a lookup table may be implemented in various ways: as a

hash table
, a

binary search tree
,
or ev
en a simple linear

list

of (key:value) pairs. As far as client code is concerned, the abstract
properties of the type are the same in each case.

Of course, this all relies
on getting the details of the interface right in the first place, since any changes
there can have major impacts on client code. As one way to look at this: the interface forms
a

contract

on agreed behaviour between the data type and client code; anything
not spelled out in the
contract is subject to change without notice.

Languages that implement data abstraction include

Ada

and

Modula
-
2
.

Object
-
oriented

languages are
commonly claimed
[
to offer data abstraction; however, their

inheritance

concept tends to put information
in the interface that more properly belongs in the implementation; thus, changes to such information
ends up impacting client code,
leading directly to the

Fragile binary interface problem
.

SQL

SQL is a standard language for accessing databases.

Our SQL tutorial will
teach you how to use SQL to access and manipulate data in: MySQL, SQL
Server, Access, Oracle, Sybase, DB2, and other database systems.

SQL is a standard language for accessing and manipulating databases.


What is SQL?



SQL stands for Structured Query Langu
age



SQL lets you access and manipulate databases



SQL is an ANSI (American National Standards Institute) standard


What Can SQL do?

28

DBMS Notes


GIET2013EC4SA







SQL can execute queries against a database



SQL can retrieve data from a database



SQL can insert records in a database



SQL ca
n update records in a database



SQL can delete records from a database



SQL can create new databases



SQL can create new tables in a database



SQL can create stored procedures in a database



SQL can create views in a database



SQL can set permissions on tables,
procedures, and views


SQL is a Standard
-

BUT....

Although SQL is an ANSI (American National Standards Institute) standard, there are different versions
of the SQL language.

However, to be compliant with the ANSI standard, they all support at least the m
ajor commands (such
as SELECT, UPDATE, DELETE, INSERT, WHERE) in a similar manner.

Note:

Most of the SQL database programs also have their own proprietary extensions in addition to the
SQL standard!


Using SQL in Your Web Site

To build a web site that
shows data from a database, you will need:



An RDBMS database program (i.e. MS Access, SQL Server, MySQL)



To use a server
-
side scripting language, like PHP or ASP



To use SQL to get the data you want



To use HTML / CSS


RDBMS

RDBMS stands for Relational Data
base Management System.

RDBMS is the basis for SQL, and for all modern database systems such as MS SQL Server, IBM DB2,
Oracle, MySQL, and Microsoft Access.

The data in RDBMS is stored in database objects called tables.

A table is a collection of related d
ata entries and it consists of columns and rows.

29

DBMS Notes


GIET2013EC4SA





Database Tables

A database most often contains one or more tables. Each table is identified by a name (e.g.
"Customers" or "Orders"). Tables contain records (rows) with data.

In this tutorial we will use the

well
-
known North

wind sample database (included in MS Access and MS
SQL Server).

Below is a selection from the "Customers" table:

CustomerID

CustomerName

ContactName

Address

City

PostalCode

Country

1

Alfreds Futterkiste

Maria Anders

Obere Str. 57

Berlin

12209

Germany

2

Ana Trujillo
Emparedados y
helados

Ana Trujillo

Avda. de la
Constitución
2222

México
D.F.

05021

Mexico

3

Antonio Moreno
Taquería

Antonio
Moreno

Mataderos 2312

México
D.F.

05023

Mexico

4

Around the Horn

Thomas
Hardy

120 Hanover
Sq.

London

WA1 1DP

UK

5

Berglunds
snabbköp

Christina
Berglund

Berguvsvägen 8

Luleå

S
-
958 22

Sweden

The table above contains five records (one for each customer) and seven columns (CustomerID,
CustomerName, ContactName, Address, City, PostalCode, and Country).


SQL Statements

Most of the actions you need to perform on a database are done with SQL statements.

The following SQL statement selects all the records in the "Customers" table:

Example

SELECT * FROM Customers;


In this tutorial we will teach you all about
the different SQL statements.


Keep in Mind That...

30

DBMS Notes


GIET2013EC4SA







SQL is NOT case sensitive: SELECT is the same as select


Semicolon after SQL Statements?

Some database systems require a semicolon at the end of each SQL statement.

Semicolon is the standard way to
separate each SQL statement in database systems that allow more
than one SQL statement to be executed in the same call to the server.

In this tutorial, we will use semicolon at the end of each SQL statement.


Some of The Most Important SQL Commands



SELECT

-

extracts data from a database



UPDATE

-

updates data in a database



DELETE

-

deletes data from a database



INSERT INTO

-

inserts new data into a database



CREATE DATABASE

-

creates a new database



ALTER DATABASE

-

modifies a database



CREATE TABLE

-

creates a

new table



ALTER TABLE

-

modifies a table



DROP TABLE

-

deletes a table



CREATE INDEX

-

creates an index (search key)



DROP INDEX

-

deletes an index

File organization

File organization
is the methodology which is applied to structured computer files. Files
contain
computer records which can be documents or information which is stored in a certain way for later
retrieval. File organization refers primarily to the logical arrangement of data (which can itself be
organized in a system of records with correlatio
n between the fields/columns) in a file system. It should
not be confused with the physical storage of the file in some types of storage media. There are certain
basic types of computer file, which can include files stored as blocks of data and streams of
data,
where the information streams out of the file while it is being read until the end of the file is
encountered.

We will look at two components of file organization here:

1.

The way the internal file structure is arranged and

2.

The external file as it is pr
esented to the O/S or program that calls it. Here we will also examine the
concept of file extensions.

We will examine various ways that files can be stored and organized. Files are presented to the
application as a stream of bytes and then an EOF (end of
file) condition.

31

DBMS Notes


GIET2013EC4SA





A program that uses a file needs to know the structure of the file and needs to interpret its contents.


Internal File Structure

-

Methods and Design Paradigm

It is a high
-
level design decision to specify a system of file organization for
a computer software
program or a computer system designed for a particular purpose. Performance is high on the list of
priorities for this design process, depending on how the file is being used. The design of the file
organization usually depends mainly o
n the system environment. For instance, factors such as whether
the file is going to be used for transaction
-
oriented processes like OLTP or Data Warehousing, or
whether the file is shared among various processes like those found in a typical distributed s
ystem or
standalone. It must also be asked whether the file is on a network and used by a number of users and
whether it may be accessed internally or remotely and how often it is accessed.

However, all things considered the most important considerations m
ight be:

1.

Rapid access to a record or a number of records which are related to each other.

2.

The Adding, modification, or deletion of records.

3.

Efficiency of storage and retrieval of records.

4.

Redundancy, being the method of ensuring data integrity.

A file
should be organized in such a way that the records are always available for processing with no
delay. This should be done in line with the activity and volatility of the information.

Types of File Organization

Organizing a file depends on what kind of file

it happens to be: a file in the simplest form can be a text
file, (in other words a file which is composed of ascii (American Standard Code for Information
Interchange) text.) Files can also be created as binary or executable types (containing elements ot
her
than plain text.) Also, files are keyed with attributes which help determine their use by the host
operating system.

Techniques of File Organization

The three techniques of file organization are:

1.

Heap (unordered)

2.

Sorted

1.

Sequential (SAM)

2.

Line Sequential

(LSAM)

3.

Indexed Sequential (ISAM)

3.

Hashed or Direct

In addition to the three techniques, there are four methods of organizing files. They are

sequential,
line
-
sequential, indexed
-
sequential, inverted list

and

direct or hashed access

organization.

32

DBMS Notes


GIET2013EC4SA





Sequential

Organization

A sequential file contains records organized in the order they were entered. The order of the records is
fixed. The records are stored and sorted in physical, contiguous blocks within each block the records
are in sequence.

Records in these f
iles can only be read or written sequentially.

Once stored in the file, the record cannot be made shorter, or longer, or deleted. However, the record
can be

updated

if the length does not change. (This is done by replacing the records by creating a new
file.) New records will always appear at the end of the file.

If the

order of the records

in a file is not important,

sequential organization

will suffice, no matter
how many records you may have. Sequential output is also useful for report printing or
seq
uential
reads

which some programs prefer to do.

Line
-
Sequential Organization

Line
-
sequential files are like sequential files, except that the records can contain only characters as
data. Line
-
sequential files are maintained by the native byte stream files
of the operating system.

In the COBOL environment, line
-
sequential files that are created with WRITE statements with the
ADVANCING phrase can be directed to a printer as well as to a disk.

Indexed
-
Sequential Organization

Key searches are improved by this s
ystem too. The single
-
level indexing structure is the simplest one
where a file, whose records are pairs, contains a key pointer. This

pointer

is the position in the data file
of the record with the given key. A subset of the records, which are evenly spac
ed along the data file,
is indexed, in order to mark intervals of data records.

This is how a key search is performed: the search key is compared with the index keys to find the
highest index key coming in front of the search key, while a linear search is
performed from the record
that the index key points to, until the search key is matched or until the record pointed to by the next
index entry is reached. Regardless of double file access (index + data) required by this sort of search,
the access time redu
ction is significant compared with sequential file searches.

Let's examine, for sake of example, a simple linear search on a 1,000 record

sequentially
organized

file. An average of 500 key comparisons are needed (and this assumes the search keys are
unifor
mly distributed among the data keys). However, using an index evenly spaced with 100 entries,
the total number of comparisons is reduced to 50 in the index file plus 50 in the data file: a five to one
reduction in the operations count!

Hierarchical extensi
on of this scheme is possible since an index is a sequential file in itself, capable of
indexing in turn by another second
-
level index, and so forth and so on. And the exploit of the
hierarchical decomposition of the searches more and more, to decrease the

access time will pay
increasing dividends in the reduction of processing time. There is however a point when this advantage
33

DBMS Notes


GIET2013EC4SA





starts to be reduced by the increased cost of storage and this in turn will increase the index access
time.

Hardware for Index
-
Sequ
ential Organization is usually Disk
-
based, rather than tape. Records are
physically ordered by primary key. And the index gives the physical location of each record. Records
can be accessed sequentially or directly, via the index. The index is stored in a
file and read into
memory at the point when the file is opened. Also, indexes must be maintained.

Life sequential organization the data is stored in physical contiguous box. How ever the difference is in
the use of indexes. There are three areas in the dis
c storage:



Primary Area:
-
Contains file records stored by key or ID numbers.



Overflow Area:
-
Contains records area that cannot be placed in primary area.



Index Area:
-
It contains keys of records and there locations on the disc.

Inverted List

In file organizat
ion, this is a file that is indexed on many of the attributes of the data itself. The inverted
list method has a single index for each key type. The records are not necessarily stored in a sequence.
They are placed in the are data storage area, but indexes

are updated for the record keys and location.

Here's an example, in a company file, an index could be maintained for all

products
, another one might
be maintained for

product types
. Thus, it is faster to search the indexes than every record. These types
o
f file are also known as

"inverted indexes."

Nevertheless,

inverted list files

use more media space
and the storage devices get full quickly with this type of organization. The benefits are apparent
immediately because searching is fast. However, updating
is much slower.

Content
-
based queries in text retrieval systems use

inverted indexes

as their preferred mechanism.
Data items in these systems are usually stored

compressed

which would normally slow the retrieval
process, but the compression algorithm will

be chosen to support this technique.

When querying a file there are certain circumstances when the query is designed to be

modal

which
means that rules are set which require that different information be held in the index. Here's an
example of this
modality: when phrase querying is undertaken, the particular algorithm requires that
offsets to word classifications are held in addition to document numbers.

Direct or Hashed Access

With

direct or hashed access

a portion of disk space is reserved and a “h
ashing” algorithm computes
the record address. So there is additional space required for this kind of file in the store. Records are
placed randomly through out the file. Records are accessed by addresses that specify their disc
location. Also, this type o
f file organization requires a disk storage rather than tape. It has an excellent
search retrieval performance, but care must be taken to maintain the indexes. If the indexes become
corrupt, what is left may as well go to the bit
-
bucket, so it is as well t
o have regular backups of this kind
of file just as it is for all stored valuable data!

External File Structure and File Extensions

34

DBMS Notes


GIET2013EC4SA





Microsoft Windows

and

MS
-
DOS

File Systems

The external structure of a file depends on whether it is being created on a

FAT

or

NTFS

partition. The
maximum filename length on a

NTFS

partition is 256 characters, and 11 characters on

FAT

(
8
character name+"."+3 character extension.
)

NTFS

filenames

keep their
case,

whereas

FAT

filenames

have no concept of case

(but case is ignored when performing a search
under

NTFS

Operating System
). Also, there is the new

VFAT

which permits 256 character filen
ames.

UNIX

and

Apple Macintosh

File Systems

The concept of directories and files is fundamental to the UNIX operating
system. On

Microsoft
Windows
-
based operating systems, directories are depicted as

folders

and moving about is
accomplished by clicking on the different icons. In

UNIX
, the directories are arranged as a hierarchy
with the

root directory

being at the top of the tree. The

root

directory is always depicted as

/
. Within
the

/

directory, there are subdir
ectories (e.g.: etc and sys).

Files

can be written to any directory
depending on the permissions. Files can be

readable
,

writable

and/or
executable
.

Organizing files using Libraries

With the advent of Microsoft Windows 7 the concept of file organization
and management has
improved drastically by way of use of powerful tool called Libraries. A Library is file organization system
to bring together related files and folders stored in different locations of the local as well as network
computer such that thes
e can be accessed centrally through a single access point. For instance,
various images stored in different folders in the local computer or/and across a computer network can
be accumulated in an Image Library. Aggregation of similar files can be manipulat
ed, sorted or
accessed conveniently as and when required through a single access point on a computer desktop by
use of a Library. This feature is particularly very useful for accessing similar content of related content,
and also, for managing projects usi
ng related and common data.