MS BackOffice Unleashed

What Is Normalization?
Denormalizing a Database
- Criteria for Denormalization
- Methods of Denormalization
Using Indexes
Things to Consider when Planning a Database
Summary

— 32 —

SQL Server Database Design and Planning

A good database design is fundamental to the success of any application. Database design primarily consists of two parts: logical design and physical design. Logical database design follows certain rules and results in a structure that can be translated into a physical model. A logical data model does not depend on the relational database engine you intend to use. A physical database design makes extensive use of the features of the underlying database engine to yield high performance for the application. SQL Server database engine features have been discussed in previous chapters. The topics described in this chapter include some of the rules for logical database design, indexing techniques, and their implementation in SQL Server.

What Is Normalization?

A database is a collection of related information. Data is stored in database tables consisting of rows and columns. Usually, a database includes numerous tables, each keeping relevant information. One might ask, Why keep data in so many tables, and why not have just one large table that has all the columns? This question can be answered by keeping two facts in perspective:

You want to store data on the disk in a way that is most efficient in disk space usage.
You want to present data to the user with the fastest response time.

To an organization, it is important to have a system that can be implemented with low cost and that yields high performance. Unfortunately, one goal is usually achieved at the expense of the other. Typically, you can achieve performance improvements by keeping redundant and duplicate data in the system. Normalization is a technique of spreading data across multiple tables so that relevant data information is kept together based on certain guidelines. Normalization results in controlled redundancy of data and therefore provides a good balance between disk space usage and performance. There are five rules of normalization. Most database designers, however, follow first three rules. The rules of normalization can be explained effectively by taking a real-life example. Consider the following scenario:

ABC Corp is a software development company developing multiple software packages. An employee within ABC Corp is known by a unique employee id, but in general an employee is recognized by her first name. Human Resources personnel keep track of some important attributes of an employee, such as exempt or nonexempt, salary, bonus, and gross salary. ABC Corp assigns a unique project ID for each project that is currently under development. Each project has a start date and an end date. Employees usually work on multiple projects and come and go as governed by the project requirements.

The goal for ABC Corp is to design a database that keeps track of all the employees working on various projects. From the scenario just explained, you can see the two main database entities here: Employee and Project. Each entity has some attributes associated with it that describe the entity. An attribute that uniquely defines an entity is called the primary key. For example, employee id is the primary key for the Employee entity because it uniquely identifies an employee. Some of the attributes are explained in the following table:

Attributes Name Abbreviated Name
Employee Entity
employee id emp_id
employee name name
salary salary
bonus bonus
gross salary gros_sal
exempt exempt
Project Entity
project id proj_id
project name p_name
project start date p_start
project end date p_end
employee project start date e_p_start
employee project end date e_p_end

Abbreviated attribute names will be used in the examples throughout this chapter. Now, to design a database, the simplest method is to create an employee_project table with all the columns in it. The primary key for this table is a composite key of emp_id + proj_id. Table 32.1 depicts a completely denormalized employee_project table containing a few records.

Table 32.1. Creating a denormalized employee_project table. The primary key is emp_id + proj_id.

emp_id	name	exempt	salary	bonus	gros_sal	proj_id	p_name	p_start	p_end	e_p_start
100	Jeff	Y	50000	1000	51000	P100	internet	1/1/95	1/1/97	3/3/96
100	Jeff	Y	50000	1000	51000	P200	admin	1/1/96	1/1/98	4/4/96
200	Sam	N	60000	0000	60000	P100	internet	1/1/95	1/1/97	2/2/95
300	Sue	Y	70000	1000	71000	P300	intranet	1/1/96	1/1/97	1/1/96
400	Dave	Y	50000	2000	52000	P300	intranet	1/1/96	1/1/97	1/1/96
500	John	N	80000	5000	85000	P300	intranet	1/1/96	1/1/97	3/18/96
500	John	N	80000	5000	85000	P100	internet	1/1/95	1/1/97	1/1/96

The following section describes the three rules of normalization.

The First Rule of Normalization

The first rule of normalization requires removal of repeating data values and specifies that no two rows should be identical.

In the example shown in Table 32.1, for each employee the columns name, salary, bonus, and gros_sal are being repeated in each row of the table. To conform with the first rule of normalization, these values should be removed and stored separately along with the emp_id column. Therefore, you create a new table called employee_profile consisting of the columns that were being repeated. The resulting structures of the employee_profile and employee_project tables are shown in Tables 32.2 and 32.3.

Table 32.2. The employee_profile table in first normal form. The primary key is emp_id.

emp_id	name	exempt	salary	bonus	gros_sal
100	Jeff	Y	50000	1000	51000
200	Sam	N	60000	0000	60000
300	Sue	Y	70000	1000	71000
400	Dave	Y	50000	2000	52000
500	John	N	80000	5000	85000

Table 32.3. The employee_project table. The primary key is emp_id + proj_id.

emp_id	proj_id	p_name	p_start	p_end	e_p_start
100	P100	internet	1/1/95	1/1/97	3/3/96
100	P200	admin	1/1/96	1/1/98	4/4/96
200	P100	internet	1/1/95	1/1/97	2/2/95
300	P300	intranet	1/1/96	1/1/97	1/1/96
400	P300	intranet	1/1/96	1/1/97	1/1/96
500	P300	intranet	1/1/96	1/1/97	3/18/96
500	P100	internet	1/1/96	1/1/95	1/1/96

The benefits gained here are obvious. The employee_project table is much smaller now. SQL Server can now pack more rows per data page. A fewer number of I/Os will be required to read pages, and more information can be cached in SQL Server memory.

The Second Rule of Normalization

An entity is in second normal form if it conforms to the first normal form and all nonkey attributes are fully dependent on the entire primary key. If the primary key consists of multiple attributes, the nonkey attributes should depend on the entire key and not just a part of the key. All nonkey attributes that are dependent on a part of the key should be removed and stored separately along with the attributes they are dependent on.

Take the example shown in Table 32.3. The primary key for the employee_project table is a composite key consisting of the emp_id and proj_id columns. To conform with the second normal form, all the nonkey attributes (such as p_start and p_end) should completely depend on the primary key (emp_id + proj_id). However, attributes such as project name (p_name), project start date (p_start), and project end date (p_end) are dependent only on the proj_id attribute and not on the emp_id. This is a violation of the second normal form. Therefore, these attributes need to be removed from the employee_project table and stored in a different table called project_profile. This is shown in Table 32.4 and Table 32.5.

Table 32.4. The project_profile table in second normal form. The primary key is proj_id.

proj_id	p_name	p_start	p_end
P100	internet	1/1/95	1/1/97
P200	admin	1/1/96	1/1/98
P300	intranet	1/1/96	1/1/97

Table 32.5. The employee_project table. The primary key is emp_id + proj_id.

emp_id	proj_id	e_p_start
100	P100	3/3/96
100	P200	4/4/96
200	P100	2/2/95
300	P300	1/1/96
400	P300	1/1/96
500	P300	3/18/96
500	P100	1/1/96

The Third Rule of Normalization

An entity is in third normal form if it already conforms to the first two normal forms and no nonkey attribute is dependent on any other nonkey attributes. All such attributes are required to be removed from the table.

For example, the attribute gros_sal in Table 32.2 is calculated as follows:

gros_sal = salary + bonus

The attribute gros_sal is a dependent on two other nonkey attributes: salary and bonus. This is a violation of the third normal form. Because gros_sal is a computed field and can always be calculated using other two (salary and bonus) fields, it should be removed from the table. The resultant table structure is shown in Table 32.6.

Table 32.6. The employee_profile table in third normal form. The primary key is emp_id.

emp_id	name	exempt	salary	bonus
100	Jeff	Y	50000	1000
200	Sam	N	60000	0000
300	Sue	Y	70000	1000
400	Dave	Y	50000	2000
500	John	N	80000	5000

As you can see, by following the rules of normalization, one large table (employee_project) has been broken into three smaller tables (employee_profile, project_profile, and employee_project). The following section explains the benefits of normalization.

Why Normalize?

After you have designed your database to be in the third normal form, you can create physical tables in the database. As a result of normalization, because data is split into multiple tables with fewer columns, you can store more data (that is, more rows) on each page of the database. Some of the advantages of using normalization are listed here:

Because information is logically kept together, normalization provides a better overall understanding of the system.
A database in its third normal form has far less data than the one without normalization. This can result in substantial cost savings in terms of disk space usage.
Index creation and data sorts are faster because of smaller tables.
More rows per page means that fewer I/Os are performed and more data can be cached in memory.
SQL Server allows only one clustered index per table. Using normalization rules, you would typically have multiple smaller tables; therefore, you can create more clustered indexes.
Because the amount of redundant data is reduced, it is faster and easier to maintain data integrity.

Denormalizing a Database

After a database has been normalized to third normal form, sometimes it is necessary to backtrack from normalization for performance reasons. This step of rolling back from normalization is called denormalization. The sole purpose of denormalizing a database is to improve performance. One of the major benefits of normalization techniques is the reduction of redundant data. This results in efficient usage of disk storage space. But at the same time, because data is now split into multiple tables, joins are required across tables to retrieve information that resides in various tables. This method results in slower performance for queries. Continuing with the scenario of ABC Corp, assume that you want to find out the names of all the employees who are working on a project with project ID P100. For a normalized database (Tables 32.5 and 32.6), you can get the desired results to be returned from the query by joining the employee_profile and employee_project tables. The query example is shown here:

SELECT name

FROM employee_profile, employee_project

WHERE employee_profile.emp_id = employee_project.emp_id

AND employee_project.proj_id = "P100"

If the database was denormalized (see Table 32.1), you could have gotten the same results without the joins. The query example is as shown here:

SELECT name

FROM employee_project

WHERE proj_id = "P100"

In the first example, storage efficiency is achieved at the cost of performance. In the second example, better performance is achieved at the cost of higher usage of disk space. You can see that there is a trade-off between performance and cost. Therefore, to build a system with low cost and high performance, you must first normalize your database; then if you experience any performance bottlenecks, you should consider denormalizing the database.

Criteria for Denormalization

Before you denormalize a database, consider the following points:

Be sure to have a good overall understanding of the logical design of the system. This knowledge helps in determining how the other parts of the application are going to be affected when you change one part of the system.
Try not to denormalize the entire database. Instead, focus on specific queries that are executed more frequently and that are suffering from performance problems. Determine the expected response time for each such query.
Understand the volume of data and type of queries (such as select, delete, insert, update) that will be run against specific tables.
Determine whether there is a need for derived columns. Derived columns are ones that are computed from other columns.
Understand data integrity issues. One must realize that with denormalization, more redundant data will be included. Therefore, maintaining data integrity will be more difficult, and data modifications might be slower.
Understand storage technique for the data. Usage of segments can also improve performance without having to denormalize database tables.

If you are experiencing application-wide performance problems, denormalizing should not be the first step you take to try to fix the problem. Before you denormalize your database, make sure that other components of the system (such as hardware, network, and SQL Server) are configured optimally.

Methods of Denormalization

Various methods can be employed to denormalize a database table. One or more of these methods can be used to achieve the desired performance goals. Some of the most common methods for denormalization are discussed in the following paragraphs.

Horizontal Partitioning

In this method of denormalization, a large table is split into multiple smaller tables based on specific criteria. Each resultant table has the same data structure (that is, the same column names and column properties) as the parent table but contains mutually exclusive data. This method is commonly used for tables that contain an extremely large number of rows. Queries on such large tables are sometimes quite slow because of the large amount of data being scanned. Horizontal splitting of the data can be explained by taking the example of the employee_profile table. Assume that most of the queries on the employee_profile table pertain to exempt employees (Table 32.6) and that many rows in the table have employees of both types (exempt and nonexempt). This table can be split into two tables, based on whether an employee is an exempt or a nonexempt employee. Tables 32.7 and 32.8 depict two tables, exempt_emp_profile and nonexempt_emp_profile, that are a result of a horizontal split of the employee_profile table.

Table 32.7. Horizontal partitioning: the exempt_emp_profile table.

emp_id	name	exempt	salary	bonus
100	Jeff	Y	50000	1000
300	Sue	Y	70000	1000
400	Dave	Y	50000	2000

Table 32.8. Horizontal partitioning: the nonexempt_emp_profile table.

emp_id	name	exempt	salary	bonus
200	Sam	N	60000	0000
500	John	N	80000	5000

Horizontal partitioning is quite useful when a large amount of historical information is kept along with the currently active data. If historical information is used only occasionally, it can be placed in multiple tables (for example, one table for each month), and active data can be stored in a separate table. Programming with tables that are horizontally split, however, can sometimes become very complicated.

Vertical Partitioning

A database in SQL Server consists of multiple 2K pages. The number of rows on a page depends on the width of the table (plus some overhead). This means that the wider the table, the fewer the number of rows per page. Significant performance gains can be achieved by reducing the number of I/Os on a table. Vertical splitting is method of reducing the width of a table by splitting the table into two or more tables: all frequently used columns are kept together in one table, and the others are kept in another table. This method results in a reduction of table width. Therefore, more rows can be accommodated per page, a fewer number of I/Os will be generated, and more data can be cached in SQL Server memory. Take the example of the employee_profile table. If name and salary are the only two columns that are accessed frequently, the employee_profile table can be partitioned vertically as shown in Tables 32.9 and 32.10.

Table 32.9. Vertical partitioning: the primary_employee_profile table.

emp_id	name	salary
100	Jeff	50000
200	Sam	60000
300	Sue	70000
400	Dave	50000
500	John	80000

Table 32.10. Vertical partitioning: the secondary_employee_profile table.

emp_id	name	exempt	salary	bonus
100	Jeff	Y	50000	1000
200	Sam	N	60000	0000
300	Sue	Y	70000	1000
400	Dave	Y	50000	2000
500	John	N	80000	5000

Adding Redundant and Computed Columns

Joins are usually expensive in terms of CPU usage and from an I/O point of view. Performance gains can be achieved by reducing the number of joins in a query. You can do this by keeping some redundant information in a table so that you can avoid joins in some frequently executed queries. Assume that you want to find out the names of all the employees who are working on a project with project ID P100. For a normalized database (Table 32.5 and 32.6), you can get the desired results to be returned from the query by joining the employee_profile and employee_project tables. The query is shown here:

SELECT name

FROM employee_profile, employee_project

WHERE employee_profile.emp_id = employee_project.emp_id

AND employee_project.proj_id = "P100"

Now if you denormalize the employee_project table and add an additional column, name, to the table (see Table 32.11), you eliminate the need for a join in the query. Although Table 32.11 is in violation of the first normal form, it gives better performance because of the absence of a join. The query is as shown here:

SELECT name

FROM employee_project

WHERE proj_id = "P100"

Table 32.11. Adding a redundant column to the employee_project table. Denormalization with a violation of the first normal form.

emp_id	name	proj_id	e_p_start
100	Jeff	P100	3/3/96
100	Jeff	P200	4/4/96
200	Sam	P100	2/2/95
300	Sue	P300	1/1/96
400	Dave	P300	1/1/96
500	John	P300	3/18/96
500	John	P100	1/1/96

Some queries compute aggregate values derived from one or more columns of a table. Computation of aggregate values (such as mathematical calculations) can be highly CPU intensive. If such computations are frequent enough, they can impact performance adversely. One of the techniques for dealing with such situations is to create an additional column that stores the derived value. For the employee_profile table (Table 32.6), if the gross salary (salary + bonus) is being calculated quite often, it might be more beneficial to add a column called gross_salary to the employee_profile table. A trigger can be created on this table that can maintain the value in the gross_salary column. Although this is a violation of the third normal form, it might provide significant performance improvements in an application that uses several derived columns. Table 32.12 illustrates this method of denormalizing the database.

Table 32.12. Adding a redundant column to a table.

emp_id	name	exempt	salary	bonus	gross_salary
100	Jeff	Y	50000	1000	51000
200	Sam	N	60000	0000	60000
300	Sue	Y	70000	1000	71000
400	Dave	Y	50000	2000	52000
500	John	N	80000	5000	85000

Using Indexes

SQL Server provides indexes for faster access to the data and to enforce the uniqueness of the data in the database tables. Creating indexes on a table improves the query search performance. Indexes are one of the most important aspects of database design. They are essential for faster retrieval of data from various tables. You create indexes on a table by specifying column names that constitute a key. SQL Server provides two types of indexes: clustered index and nonclustered index. Both of these indexes are B-Tree structures and require separate storage space in the database. You can create indexes on tables either by using the CREATE INDEX statement or by using the UNIQUE or PRIMARY KEY integrity constraint of the CREATE TABLE clause of T-SQL. SQL Server allows only one clustered index per table (soon I will explain why) and up to 249 nonclustered indexes. Before I get into the specifics of each type of index, it is important that you understand the storage and data-access techniques of SQL Server.

SQL Server Data Access Technique

A SQL Server database is created with the following logical storage hierarchy:

Page (consists of 2K).
Extent (consists of eight 2K pages).
Allocation unit (consists of 32 extents).

Pages

A page is the minimum allocation unit in a SQL Server database. The size of each page is 2048 bytes (2K). Pages allocated to a table are maintained in the form of a double link list; that is, each page has a pointer to the next as well as the previous page in the link. SQL Server stores the following information on each page of a database:

Page Header (32 bytes per page): The header information of each page keeps information about the current, next, and previous pages for the table within an extent.
Meta-Data Information (54 bytes per page): Each row on the data page has some overhead associated with it. These bytes don’t store any table data. Information stored in these bytes is used by SQL Server to retrieve row atributes such as total number of columns in a row, row location on page, row number and so forth. Following are some of the attributes associated with a row:
Column Information (1 byte): Usually, there are columns in a table that allow null or have variable-length data types (for example, varchar). This byte stores the number of columns that either allow nulls or have variable-length data types. The value stored in this byte is 0 if the table has no variable-length or null columns.
Row ID (1 byte): Each row in a page is assigned a unique row number. This byte stores the row number of the row. Because this is a 1-byte field, the maximum row number is 256. Therefore, the maximum number of rows per page is 256.
Row Offset Table (2 bytes): Because columns of a table might consist of variable-length data types or allow nulls, row length is not expected to be fixed. Therefore, it is necessary to store the starting position of each row on the page. This is accomplished by the use of a row offset table. The starting position of each row is stored in the row offset table. The row offset table is located at the end of the page, and it requires two bytes for each row on the page. It starts at the last byte of the page and grows backward as rows are added to the page.
Variable Length Column Information(5 bytes): This overhead is incurred only for the tables that allow nulls for the column property or have variable-length data types. These bytes stores information such as row size and the starting location of each variable-length column.
Data Rows (1962 bytes): The total space left for the data row on a page is 2048 – (32 + 54) = 1962. This is the maximum space available for data rows. All the rows on a page are arranged sequentially. The number of rows per page will depend on the width of the table. A data row for a table must fit within a page. Therefore, the total width (the sum of all column data type storage lengths) of the table cannot exceed 1962 bytes.

Figure 32.1 shows a typical page structure in SQL Server.

Because each row must fit completely on a page, if a table has row width greater than 981 bytes (1962 divided by 2), the remaining space on the page is rendered useless because no other row can fit into rest of the space.

FIGURE 32.1. Page structure in a SQL Server database.

Extents

An extent in SQL Server is a block of eight contiguous 2K pages (16K). Extents are used for space allocation for tables in a database. Space is always allocated to tables one extent at a time. A new extent is allocated to a table at the time of table creation or when a previously allocated extent becomes full. An extent is deallocated from a table when all the pages on that extent don't contain any data. Each extent within an allocation unit is a doubly linked list; that is, each extent has pointers to the previous and the next extent in the link list. Page links are maintained only within an extent, and links to other pages on a different extent are maintained through extent level links. Figure 32.2 shows a typical extent structure.

FIGURE 32.2. Extents in a SQL Server database.

Allocation Units

When a new database is created with the CREATE DATABASE T-SQL command, SQL Server internally divides the total space into multiples of 256 2K page blocks. Each such block is called an allocation unit. Each allocation unit is divided into 32 extents, each extent consisting of eight contiguous pages. An allocation unit keeps track of space usage within an allocation unit. Each allocation unit is a doubly linked list; that is, each allocation unit has pointers to the previous and the next allocation unit in the linked list. The first page of each allocation unit keeps information about the next and previous allocation unit pointers. This first page of each allocation unit keeps track of free extents within an allocation unit. It also keeps track of the extents that have been allocated to a specific table. A representation of the storage scheme of an allocation unit is shown in Figure 32.3.

FIGURE 32.3. Allocation units in SQL Server.

Because space allocation for tables is always in terms of extents, no two tables can share an extent within an allocation unit. Multiple tables can, however, share the same allocation unit. A table can span multiple allocation units, depending on the number of rows in the table. If no a table has indexes, data is always added at end of the last page of the table. For performance reasons, SQL Server tries to allocate data for a table in contiguous extents in the current allocation unit. If no contiguous extents are available, it tries to allocate an extent within the same allocation unit. If no extents are available within the allocation unit, it tries to allocate an extent on allocation units on which the table already has extents.

In the example shown in Figure 32.4, Table A and Table B are using separate extents on Allocation Unit 1. Table A also has an extent allocation on Allocation Unit 2. The first page of each table stores information about all the allocation units the tables is split across. In the example shown in Figure 32.4, the first page of Table A keeps pointers to Allocation Units 1 and 2. For Table B, the pointers are only to Allocation Unit 1. The first page of each allocation unit stores the information about the extents that have been allocated to various tables. The location of the first page for each table is stored in the sysindexes system table. Now you'll see what happens if a user requests some data from Table A that resides on Page 5 of Extent 18 on Allocation Unit 2 (see Figure 32.4). SQL Server fetches the starting data page number of Table A from the sysindexes system table. Because the first page of a table has pointers to all the allocation units on which table data is present, this page has pointers to Allocation Units 1 and 2. The first page of Allocation Unit 1 points to Extent 2 and Extent 31. SQL Server reads all the pages on both the extents, then continues on to Allocation Unit 2, which points to Extent 18 for Table A. SQL Server starts reading pages in Extent 18 until it reaches Page 5, where it finds the requested data.

FIGURE 32.4. Locating table data in SQL Server.

This whole process of reading all the pages in a table is called a table scan. Table scans are very slow, because SQL Server starts reading from the first page and continues reading until it finds the desired result set. For tables that have many records and joins that are complex and involve such tables, the table scan will be very slow and unacceptable. That's why indexes are created on tables—to eliminate table scans and access the data directly rather than sequentially. Indexes in SQL Server are created as B-Tree structures, and they provide more direct access to the data. Similar to data pages, index pages are also stored in 2K pages. Unlike data pages, index pages store values of indexed columns and pointers to the data. The following paragraphs explain the B-Tree structure.

B-Tree Structure

There are various methods of index storage structures. SQL Server indexes are stored as B-Tree (Balanced Tree) structures. As the name suggests, indexes are stored in the form of a balanced tree with multiple levels. Indexes use separate storage space on SQL Server. There are three levels of a B-Tree:

Root Level
Leaf Level
Intermediate Level

Root Level

The root level is the highest level of an index. Usually, index pages contain more rows than the data pages simply because index pages store only the key column values and pointers to the data pages. Depending on the number of rows in a table, the root level points either to the data page or to a lower level that is an intermediate level. There is only one page at the root level.

Leaf Level

Leaf level is the lowest level of an index. This level stores the pointer values that point directly to the data page. There might be multiple pages at the leaf level, depending on the number of rows in a table.

Intermediate Level

All the levels between the root level and the leaf level are termed as intermediate level. The number of intermediate levels depends on the total number of records in a table and the size of the key. A larger key and a larger number of records results in a greater number of intermediate levels. These levels are numbered in increasing order from leaf level (which is also called Level 0) to the root level.

An example is necessary here to demonstrate how indexes are maintained. Assume that the table employee is indexed on a column called empid. Also assume that the data page can store three rows per page and the index page can store five rows per page (in reality, there will be many more rows on data and index pages). In the beginning, let's say that the employee table contains five records, as shown in Figure 32.5.

FIGURE 32.5. Basic B-Tree structure.

In the example shown in Figure 32.5, there is only one level of index: the leaf level. Now if three more records are inserted in this table, one data page is added to accommodate the new rows, and two additional index pages are required to maintain the B-Tree. This structure is depicted in Figure 32.6.

FIGURE 32.6. B-Tree structure after the data insert.

You can see from Figure 32.6 that there is an extra level of index pages to support data in the table. As more data gets added to this table, more index pages are utilized, and more intermediate levels are created. If you create an additional index on the same table, a separate B-Tree is created to maintain that index.

As mentioned before, SQL Server provides two types of indexes: clustered and nonclustered. They are both stored as B-Tree structures. Both of the indexes are covered in the following paragraphs.

Clustered Index

A clustered index enables you to create an index in which the physical order of the rows on data pages is the same as the physical order of the keys on index pages. That means data is physically sorted on the pages in the order of the indexed keys. Therefore, if you create a clustered index on empid, the B-Tree shown in Figure 32.6 will look like the one shown in Figure 32.7.

FIGURE 32.7. Clustered indexes in SQL Server.

If you notice closely, the key values on the data pages and the leaf pages are identical. Therefore, to eliminate redundancy, leaf pages are substituted by the data pages. That means that the bottom of the leaf page is actually a data page and there is one fewer level for indexing. This is shown in Figure 32.8.

FIGURE 32.8. B-Tree structure for clustered indexes.

Because there can be only one physical order of the key, you can have only one clustered index per table. Also note that the pointers in a clustered index do not point directly to data rows, but to the page on which the rows reside. It's now time to examine the effect of various data manipulation statements on a table with a clustered index on it.

Data Manipulation with Clustered Indexes

When a row is inserted into a table, because data is physically sorted on the pages, some rows move down to accommodate the new row. A row is added at the end of the page only when the key value is greater than the previous maximum value for the key. For example, in Figure 32.8 if you insert a record with empid = 750, then to accommodate this row, rows with empid = 800 and 900 move down one position. SQL Server also modifies the B-Tree structure to accommodate this insert. The result is shown in Figure 32.9.

FIGURE 32.9. Insert operation with a clustered index.

A delete operation is just the opposite of the insert statement. For example, if you delete a row from the table shown in Figure 32.9 in which empid = 750, then rows with empid = 800 and 900 move up one row and the B-Tree is now modified to point to the page with empid = 800.

An update statement is essentially a delete followed by an insert statement. Therefore, if you update empid 750 with 850, there is physical movement of the rows for the delete statement and then for the insert statement.

SQL Server enables you to write queries in such a way that an update can be performed "in place." That means there is no physical movement of the row. To perform an update in place, you must follow certain rules, including the following ones:
A query cannot update the columns participating in the clustered index.
The table can't have an update trigger defined on it.
The table must not be used for data replication.
An update in place results in better performance of the query due to the lack of physical movement of the row and the lower amount of data logging by SQL Server. For a complete list of rules for an update in place, refer to the SQL Server documentation.

If you refer to Figure 32.9, you will notice that all the data pages are completely full and there is no place to add a new row. In such cases if an insert takes place, SQL Server adds a new page to the link list from the extent and splits the information 50-50 on the old page and the new page. Figure 32.10 shows an example of page splitting in which a new record is inserted with empid = 950.

FIGURE 32.10. Page splitting in SQL Server.

Nonclustered Index

Nonclustered indexes are also B-Tree structures. A nonclustered index differs from a clustered index in the following ways:

Data is not physically sorted on the pages in the order of the key. Therefore, data is always inserted at the end of the last data page in the link list.
Unlike with a clustered index, the leaf page of a nonclustered index is not a data page. The leaf page of a nonclustered index stores the key value, page number, and ROWID of the row to provide direct access to the row. ROWIDs point to the offset table on the page, which in turn points to the data for the row.
You can create up to 249 nonclustered indexes on a table. Each index is a separate B-Tree structure and requires storage space in the database.

A typical B-Tree structure for a nonclustered index looks as shown in Figure 32.11. Because there is an additional level of indexing, a nonclustered index requires extra storage space.

FIGURE 32.11. B-Tree structure for a nonclustered index.

Data Manipulation with Nonclustered Indexes

When a row is inserted into a table with a nonclustered index, data is always added at the end of the page. But index pages are always kept in the sorted order. The row ID of a row is a pointer to an offset table on each data page. The offset table stores the starting location of data for each row on the page. The starting position of the offset table is always at the last byte of the page. The offset table grows backward when records are inserted on a page using a 2-byte pointer for each row. The advantage of using ROWID and the offset table is that when data gets inserted or deleted from a page, existing ROWIDs and corresponding index pointers for other rows remain unaffected. For an insert statement, a new ROWID is assigned for the row, and changes are made to the offset table to point to the location of data. Figure 32.12 illustrates a scenario in which a record with empid = 750 is inserted to the table. You will notice that the record is added at the end of the data page with a new ROWID, and unlike with clustered indexes, there is no physical movement of the rest of the records.

FIGURE 32.12. Insert operation with a nonclustered index.

As a result of the delete operation, the row is deleted from the data page, and all the rows below the deleted one move up. The ROWID for the other records remains unaffected. The offset table is modified to reflect the new location of the data for the records that physically moved on the page. Now for index pages, the B-Tree is modified to remove the reference for the deleted row. Because the leaf page stores the ROWID of the rows, other entries in the B-Tree remain unaffected by the delete operation. Figure 32.13 illustrates a scenario in which a record with empid = 700 is deleted from the table. You will notice that the ROWIDs for the other records remain unchanged.

FIGURE 32.13. Delete operation with a nonclustered index.

The Advantages of Indexes

Indexes are primarily used for faster retrieval of data. Creating appropriate indexes for a database is one of the most important aspects of database design. Although indexes speed up search operations, the inserts, deletes, and updates might be slow, because each of these operations might result in modifying the B-Tree index structure for the table. An index created without any consideration for performance usually results in slower data modification statements. Therefore, it is very important to carefully choose indexes in a database and measure the overall performance impact on the system.

SQL Server indexes are mostly transparent to the users. SQL Server's cost-based query optimizer decides whether to use an index on a table for a particular query. If more than one index is on a table, the query optimizer chooses the index that is most effective in terms of the number of I/Os. SQL Server does, however, allow an application developer to pass hints to the query optimizer to use a specific index in a query. The following broad guidelines might help you decide whether to use an index:

Columns that are often specified in the ORDER BY clause should be indexed so that SQL Server makes use of the indexed order and eliminates an additional step of sorting the data after retrieval.
Columns that are often searched for a range of values are good candidates for a clustered index. This is true because after the first value in the range is located, subsequent values are guaranteed to be physically adjacent rows.
Columns that are frequently used in table joins should be indexed simply because SQL Server can locate rows faster if the columns are in sorted order.
Creating an index on a column that has a low distribution value does not make much sense. For example, creating an index on a column named sex will not bring any benefits because only two values are possible: male and female.
Because data is physically sorted in case of a clustered index, searches on a clustered index are almost always faster than those on a nonclustered one. But data modification statements on a table with clustered index might be slower compared with those on a nonclustered index.

Things to Consider when Planning a Database

Database planning is one of the most complex parts of application design. The following paragraphs describe the steps necessary to ensure that you have a robust and high-performance database application.

Database Design

A good database design is fundamental to the success of any application environment. Database design can be split into two parts:

Logical design
Physical design

Logical Design

Logical design is the first step when you start planning your application. This step involves collecting user requirements. Database designers take these specifications and identify entities, attributes, relationship between entities (one to many, many to many, and such), primary keys, and so on. Based on this information, an entity relationship model is built for the system.. An entity relationship diagram should satisfy at least the first three rules of normalization. After this step, you are ready to translate your logical model into a physical one.

Physical Design

When the logical design is complete and conforms to the rules of normalization, entities become tables and attributes become columns in a table. This step involves the following actions:

Identifying tables, columns, and column properties (nullable, data types, rules, defaults, and so on)
Identifying primary keys and other useful access paths (indexes) to data
Estimating the size of each table and of the total database
Determining the size of the transaction log
Identifying whether segments are useful for the application
Planning for placement of data, segments, indexes, and logs on various devices to reduce I/O contention and to provide maximum performance and full recoverability in the event of a disk failure
Creating devices, databases, and database objects on SQL Server with the previously listed facts in mind
Adding logins and users and enforcing security

There also are some additional steps one should plan for. Although these steps are not related to the physical design of the database, they are important to provide maximum uptime for a database. These are the necessary actions:

Creating database backup devices
Creating a comprehensive backup and recovery plan
Creating schedules for routine tasks such as running DBCC (for verifying data integrity), UPDATE STATISTICS (for better performance), and so on

Planning Client Configuration

It is always important to know what kind of resources will be required to run an application smoothly. An end-user application might suffer from performance problems if not enough memory is available on the client or if the hardware is slow. You should also determine whether the end-users will be running any other resource-intensive applications on the client along with this application. Therefore, at the time of designing the frontend application, estimate the amount of resources (memory, CPU, disk space) that will be required in order to run the application smoothly.

Planning Server Configuration

In a client-server environment, the server usually does most of the work. Therefore, it is important that the hardware is configured to provide an excellent price/performance ratio. The following sections discuss various components of the server configuration.

Hardware Components

This step deals with defining hardware resources for the machine on which SQL Server will be installed. These are some of the important subsystems of server hardware that can make a significant impact on performance:

System processors
Disk subsystem
Physical memory

It is very important that sufficient resources always be available on the server hardware. All the previously listed parts of hardware work together. A system that suffers from performance problems at the hardware level might indicate that one subsystem is idle while another one has a very high percentage of utilization. For example, low CPU utilization might indicate that the processor is waiting on the disk subsystem to finish I/Os.

System Processors

To provide a sufficient number of processors on the machine, determine the following items of information:

The number of concurrent users on the SQL Server
The types of transactions (I/O bound or CPU bound)
The expected throughput and response time
The need for SMP hardware

SQL Server supports Symmetric Multiprocessing (SMP). An SMP hardware has more than one CPU, and each CPU can handle a load independent of other CPUs. For SMP hardware, a SQL Server thread is not tied to a particular processor of the machine and might therefore execute on a different processor each time. This allows SQL Server to run multiple tasks simultaneously, resulting in excellent load balancing.

Disk Subsystem

Choosing the right kind of disk subsystem is an important aspect of application design. This steps allows you to determine these facts:

The number of controllers on the hardware to split I/O work to yield high throughput
The number of disk drives to effectively meet capacity utilization
The level of fault tolerance

The placement of data on devices makes a significant impact on the performance of the system. It is quite important that you keep a transaction log on a separate disk (and mirror it, if possible) to provide full recovery in the event of a disk crash. Keeping nonclustered indexes on one device and data on another device (by using SQL Server segments) usually results in good performance of the system. If the database and transaction logs are to be backed up on disk drives, you should consider that fact also in your capacity planning. You should also forecast the growth of the database and acquire disk space accordingly.

Memory

Memory dedicated to SQL Server is primarily used to cache data and procedures. SQL Server also provides the capability to keep tempdb in RAM. The total physical memory on the machine is of one of the important factors governing the performance of SQL Server. Insufficient memory on SQL Server causes more I/Os from the disk. Because accessing a disk is much slower than accessing the memory, a large number of physical I/Os can lead to performance degradation. If more memory is installed on a server, it can cache a large amount of data, and many requests for data can be satisfied from memory rather than via physical I/Os. SQL Server can use a maximum of 2GB memory on a machine. The amount of memory installed on a machine is primarily determined by the size of the database.

Network Configuration

It is important to choose an appropriate network protocol. For Microsoft SQL Server, the Named Pipes protocol delivers the highest performance. TCP/IP, however, is the most widely used protocol for SQL Server. Using TCP/IP protocols, UNIX clients can also communicate with SQL Server. SQL Server can listen on several protocols simultaneously. Because listening to multiple protocols can add to overheads, choose the ones that are appropriate for your network environment, and disable the others from SQL Server.

A network bottleneck is usually caused by available network bandwidth, network traffic, and the speed of the network. A poorly designed application can create a network bottleneck by retrieving hundreds of rows from the database, even though only a few rows are necessary for the frontend application. Some of the network-related bottlenecks can be eliminated by efficient application and database design and by using database objects such as stored procedures, rules, and triggers. Other options include using a faster network, keeping the machine in a subnet with less traffic, and adding network redundancy.

SQL Server Configuration

All the previously mentioned components (client hardware, network, server hardware) are essential for a well-performing system. But they also add to the overall cost of the system, and investment on these components is therefore limited by the budget constraints. No matter how many resources you provide on the client, the network, and the hardware, there is no substitute for an efficient database design, well-written queries, and a well-configured SQL Server. A database administrator can configure SQL Server to provide excellent performance and throughput without compromising the data integrity. SQL Server provides various parameters that can be configured to provide optimal performance for a particular environment.

Denormalization

To improve performance, sometimes it is necessary to backtrack from a fully normalized database. These are the various methods of using denormalization techniques:

Horizontal partitioning of data
Vertical partitioning of data
Adding redundant and computed columns

These methods of denormalization were discussed earlier in this chapter, in the section Methods of Denormalization. Choose the method that provides the best performance improvement for your application design.

Summary

A good database design should conform to the third normal form. A logical model with the third normal form can be easily translated into a physical model. Denormalization techniques can be utilized for performance reasons. You should, however, have very good understanding of the logical and physical model of the system before making an attempt at denormalizing data. The next chapter explores the areas where SQL Server can be used.

Attributes Name	Abbreviated Name
Employee Entity
employee id	emp_id
employee name	name
salary	salary
bonus	bonus
gross salary	gros_sal
exempt	exempt
Project Entity
project id	proj_id
project name	p_name
project start date	p_start
project end date	p_end
employee project start date	e_p_start
employee project end date	e_p_end