MS BackOffice Unleashed

Previous Page TOC Next Page



— 32


SQL Server Database Design and Planning


A good database design is fundamental to the success of any application. Database design primarily consists of two parts: logical design and physical design. Logical database design follows certain rules and results in a structure that can be translated into a physical model. A logical data model does not depend on the relational database engine you intend to use. A physical database design makes extensive use of the features of the underlying database engine to yield high performance for the application. SQL Server database engine features have been discussed in previous chapters. The topics described in this chapter include some of the rules for logical database design, indexing techniques, and their implementation in SQL Server.

What Is Normalization?


A database is a collection of related information. Data is stored in database tables consisting of rows and columns. Usually, a database includes numerous tables, each keeping relevant information. One might ask, Why keep data in so many tables, and why not have just one large table that has all the columns? This question can be answered by keeping two facts in perspective:

To an organization, it is important to have a system that can be implemented with low cost and that yields high performance. Unfortunately, one goal is usually achieved at the expense of the other. Typically, you can achieve performance improvements by keeping redundant and duplicate data in the system. Normalization is a technique of spreading data across multiple tables so that relevant data information is kept together based on certain guidelines. Normalization results in controlled redundancy of data and therefore provides a good balance between disk space usage and performance. There are five rules of normalization. Most database designers, however, follow first three rules. The rules of normalization can be explained effectively by taking a real-life example. Consider the following scenario:

ABC Corp is a software development company developing multiple software packages. An employee within ABC Corp is known by a unique employee id, but in general an employee is recognized by her first name. Human Resources personnel keep track of some important attributes of an employee, such as exempt or nonexempt, salary, bonus, and gross salary. ABC Corp assigns a unique project ID for each project that is currently under development. Each project has a start date and an end date. Employees usually work on multiple projects and come and go as governed by the project requirements.

The goal for ABC Corp is to design a database that keeps track of all the employees working on various projects. From the scenario just explained, you can see the two main database entities here: Employee and Project. Each entity has some attributes associated with it that describe the entity. An attribute that uniquely defines an entity is called the primary key. For example, employee id is the primary key for the Employee entity because it uniquely identifies an employee. Some of the attributes are explained in the following table:
Attributes Name Abbreviated Name
Employee Entity
employee id emp_id
employee name name
salary salary
bonus bonus
gross salary gros_sal
exempt exempt
Project Entity
project id proj_id
project name p_name
project start date p_start
project end date p_end
employee project start date e_p_start
employee project end date e_p_end

Abbreviated attribute names will be used in the examples throughout this chapter. Now, to design a database, the simplest method is to create an employee_project table with all the columns in it. The primary key for this table is a composite key of emp_id + proj_id. Table 32.1 depicts a completely denormalized employee_project table containing a few records.

Table 32.1. Creating a denormalized employee_project table. The primary key is emp_id + proj_id.

emp_id name exempt salary bonus gros_sal proj_id p_name p_start p_end e_p_start
100 Jeff Y 50000 1000 51000 P100 internet 1/1/95 1/1/97 3/3/96
100 Jeff Y 50000 1000 51000 P200 admin 1/1/96 1/1/98 4/4/96
200 Sam N 60000 0000 60000 P100 internet 1/1/95 1/1/97 2/2/95
300 Sue Y 70000 1000 71000 P300 intranet 1/1/96 1/1/97 1/1/96
400 Dave Y 50000 2000 52000 P300 intranet 1/1/96 1/1/97 1/1/96
500 John N 80000 5000 85000 P300 intranet 1/1/96 1/1/97 3/18/96
500 John N 80000 5000 85000 P100 internet 1/1/95 1/1/97 1/1/96

The following section describes the three rules of normalization.

The First Rule of Normalization


The first rule of normalization requires removal of repeating data values and specifies that no two rows should be identical.

In the example shown in Table 32.1, for each employee the columns name, salary, bonus, and gros_sal are being repeated in each row of the table. To conform with the first rule of normalization, these values should be removed and stored separately along with the emp_id column. Therefore, you create a new table called employee_profile consisting of the columns that were being repeated. The resulting structures of the employee_profile and employee_project tables are shown in Tables 32.2 and 32.3.

Table 32.2. The employee_profile table in first normal form. The primary key is emp_id.

emp_id name exempt salary bonus gros_sal
100 Jeff Y 50000 1000 51000
200 Sam N 60000 0000 60000
300 Sue Y 70000 1000 71000
400 Dave Y 50000 2000 52000
500 John N 80000 5000 85000

Table 32.3. The employee_project table. The primary key is emp_id + proj_id.

emp_id proj_id p_name p_start p_end e_p_start
100 P100 internet 1/1/95 1/1/97 3/3/96
100 P200 admin 1/1/96 1/1/98 4/4/96
200 P100 internet 1/1/95 1/1/97 2/2/95
300 P300 intranet 1/1/96 1/1/97 1/1/96
400 P300 intranet 1/1/96 1/1/97 1/1/96
500 P300 intranet 1/1/96 1/1/97 3/18/96
500 P100 internet 1/1/96 1/1/95 1/1/96

The benefits gained here are obvious. The employee_project table is much smaller now. SQL Server can now pack more rows per data page. A fewer number of I/Os will be required to read pages, and more information can be cached in SQL Server memory.

The Second Rule of Normalization


An entity is in second normal form if it conforms to the first normal form and all nonkey attributes are fully dependent on the entire primary key. If the primary key consists of multiple attributes, the nonkey attributes should depend on the entire key and not just a part of the key. All nonkey attributes that are dependent on a part of the key should be removed and stored separately along with the attributes they are dependent on.

Take the example shown in Table 32.3. The primary key for the employee_project table is a composite key consisting of the emp_id and proj_id columns. To conform with the second normal form, all the nonkey attributes (such as p_start and p_end) should completely depend on the primary key (emp_id + proj_id). However, attributes such as project name (p_name), project start date (p_start), and project end date (p_end) are dependent only on the proj_id attribute and not on the emp_id. This is a violation of the second normal form. Therefore, these attributes need to be removed from the employee_project table and stored in a different table called project_profile. This is shown in Table 32.4 and Table 32.5.

Table 32.4. The project_profile table in second normal form. The primary key is proj_id.

proj_id p_name p_start p_end
P100 internet 1/1/95 1/1/97
P200 admin 1/1/96 1/1/98
P300 intranet 1/1/96 1/1/97

Table 32.5. The employee_project table. The primary key is emp_id + proj_id.

emp_id proj_id e_p_start
100 P100 3/3/96
100 P200 4/4/96
200 P100 2/2/95
300 P300 1/1/96
400 P300 1/1/96
500 P300 3/18/96
500 P100 1/1/96

The Third Rule of Normalization


An entity is in third normal form if it already conforms to the first two normal forms and no nonkey attribute is dependent on any other nonkey attributes. All such attributes are required to be removed from the table.

For example, the attribute gros_sal in Table 32.2 is calculated as follows:

gros_sal = salary + bonus

The attribute gros_sal is a dependent on two other nonkey attributes: salary and bonus. This is a violation of the third normal form. Because gros_sal is a computed field and can always be calculated using other two (salary and bonus) fields, it should be removed from the table. The resultant table structure is shown in Table 32.6.

Table 32.6. The employee_profile table in third normal form. The primary key is emp_id.

emp_id name exempt salary bonus
100 Jeff Y 50000 1000
200 Sam N 60000 0000
300 Sue Y 70000 1000
400 Dave Y 50000 2000
500 John N 80000 5000

As you can see, by following the rules of normalization, one large table (employee_project) has been broken into three smaller tables (employee_profile, project_profile, and employee_project). The following section explains the benefits of normalization.

Why Normalize?


After you have designed your database to be in the third normal form, you can create physical tables in the database. As a result of normalization, because data is split into multiple tables with fewer columns, you can store more data (that is, more rows) on each page of the database. Some of the advantages of using normalization are listed here:


Denormalizing a Database


After a database has been normalized to third normal form, sometimes it is necessary to backtrack from normalization for performance reasons. This step of rolling back from normalization is called denormalization. The sole purpose of denormalizing a database is to improve performance. One of the major benefits of normalization techniques is the reduction of redundant data. This results in efficient usage of disk storage space. But at the same time, because data is now split into multiple tables, joins are required across tables to retrieve information that resides in various tables. This method results in slower performance for queries. Continuing with the scenario of ABC Corp, assume that you want to find out the names of all the employees who are working on a project with project ID P100. For a normalized database (Tables 32.5 and 32.6), you can get the desired results to be returned from the query by joining the employee_profile and employee_project tables. The query example is shown here:

SELECT name

FROM employee_profile, employee_project

WHERE employee_profile.emp_id = employee_project.emp_id

AND employee_project.proj_id = "P100"

If the database was denormalized (see Table 32.1), you could have gotten the same results without the joins. The query example is as shown here:

SELECT name

FROM employee_project

WHERE proj_id = "P100"

In the first example, storage efficiency is achieved at the cost of performance. In the second example, better performance is achieved at the cost of higher usage of disk space. You can see that there is a trade-off between performance and cost. Therefore, to build a system with low cost and high performance, you must first normalize your database; then if you experience any performance bottlenecks, you should consider denormalizing the database.

Criteria for Denormalization


Before you denormalize a database, consider the following points:



If you are experiencing application-wide performance problems, denormalizing should not be the first step you take to try to fix the problem. Before you denormalize your database, make sure that other components of the system (such as hardware, network, and SQL Server) are configured optimally.


Methods of Denormalization


Various methods can be employed to denormalize a database table. One or more of these methods can be used to achieve the desired performance goals. Some of the most common methods for denormalization are discussed in the following paragraphs.

Horizontal Partitioning

In this method of denormalization, a large table is split into multiple smaller tables based on specific criteria. Each resultant table has the same data structure (that is, the same column names and column properties) as the parent table but contains mutually exclusive data. This method is commonly used for tables that contain an extremely large number of rows. Queries on such large tables are sometimes quite slow because of the large amount of data being scanned. Horizontal splitting of the data can be explained by taking the example of the employee_profile table. Assume that most of the queries on the employee_profile table pertain to exempt employees (Table 32.6) and that many rows in the table have employees of both types (exempt and nonexempt). This table can be split into two tables, based on whether an employee is an exempt or a nonexempt employee. Tables 32.7 and 32.8 depict two tables, exempt_emp_profile and nonexempt_emp_profile, that are a result of a horizontal split of the employee_profile table.

Table 32.7. Horizontal partitioning: the exempt_emp_profile table.

emp_id name exempt salary bonus
100 Jeff Y 50000 1000
300 Sue Y 70000 1000
400 Dave Y 50000 2000

Table 32.8. Horizontal partitioning: the nonexempt_emp_profile table.

emp_id name exempt salary bonus
200 Sam N 60000 0000
500 John N 80000 5000


Horizontal partitioning is quite useful when a large amount of historical information is kept along with the currently active data. If historical information is used only occasionally, it can be placed in multiple tables (for example, one table for each month), and active data can be stored in a separate table. Programming with tables that are horizontally split, however, can sometimes become very complicated.


Vertical Partitioning

A database in SQL Server consists of multiple 2K pages. The number of rows on a page depends on the width of the table (plus some overhead). This means that the wider the table, the fewer the number of rows per page. Significant performance gains can be achieved by reducing the number of I/Os on a table. Vertical splitting is method of reducing the width of a table by splitting the table into two or more tables: all frequently used columns are kept together in one table, and the others are kept in another table. This method results in a reduction of table width. Therefore, more rows can be accommodated per page, a fewer number of I/Os will be generated, and more data can be cached in SQL Server memory. Take the example of the employee_profile table. If name and salary are the only two columns that are accessed frequently, the employee_profile table can be partitioned vertically as shown in Tables 32.9 and 32.10.

Table 32.9. Vertical partitioning: the primary_employee_profile table.

emp_id name salary
100 Jeff 50000
200 Sam 60000
300 Sue 70000
400 Dave 50000
500 John 80000

Table 32.10. Vertical partitioning: the secondary_employee_profile table.

emp_id name exempt salary bonus
100 Jeff Y 50000 1000
200 Sam N 60000 0000
300 Sue Y 70000 1000
400 Dave Y 50000 2000
500 John N 80000 5000

Adding Redundant and Computed Columns

Joins are usually expensive in terms of CPU usage and from an I/O point of view. Performance gains can be achieved by reducing the number of joins in a query. You can do this by keeping some redundant information in a table so that you can avoid joins in some frequently executed queries. Assume that you want to find out the names of all the employees who are working on a project with project ID P100. For a normalized database (Table 32.5 and 32.6), you can get the desired results to be returned from the query by joining the employee_profile and employee_project tables. The query is shown here:

SELECT name

FROM employee_profile, employee_project

WHERE employee_profile.emp_id = employee_project.emp_id

AND employee_project.proj_id = "P100"

Now if you denormalize the employee_project table and add an additional column, name, to the table (see Table 32.11), you eliminate the need for a join in the query. Although Table 32.11 is in violation of the first normal form, it gives better performance because of the absence of a join. The query is as shown here:

SELECT name

FROM employee_project

WHERE proj_id = "P100"

Table 32.11. Adding a redundant column to the employee_project table. Denormalization with a violation of the first normal form.

emp_id name proj_id e_p_start
100 Jeff P100 3/3/96
100 Jeff P200 4/4/96
200 Sam P100 2/2/95
300 Sue P300 1/1/96
400 Dave P300 1/1/96
500 John P300 3/18/96
500 John P100 1/1/96

Some queries compute aggregate values derived from one or more columns of a table. Computation of aggregate values (such as mathematical calculations) can be highly CPU intensive. If such computations are frequent enough, they can impact performance adversely. One of the techniques for dealing with such situations is to create an additional column that stores the derived value. For the employee_profile table (Table 32.6), if the gross salary (salary + bonus) is being calculated quite often, it might be more beneficial to add a column called gross_salary to the employee_profile table. A trigger can be created on this table that can maintain the value in the gross_salary column. Although this is a violation of the third normal form, it might provide significant performance improvements in an application that uses several derived columns. Table 32.12 illustrates this method of denormalizing the database.

Table 32.12. Adding a redundant column to a table.

emp_id name exempt salary bonus gross_salary
100 Jeff Y 50000 1000 51000
200 Sam N 60000 0000 60000
300 Sue Y 70000 1000 71000
400 Dave Y 50000 2000 52000
500 John N 80000 5000 85000

Using Indexes


SQL Server provides indexes for faster access to the data and to enforce the uniqueness of the data in the database tables. Creating indexes on a table improves the query search performance. Indexes are one of the most important aspects of database design. They are essential for faster retrieval of data from various tables. You create indexes on a table by specifying column names that constitute a key. SQL Server provides two types of indexes: clustered index and nonclustered index. Both of these indexes are B-Tree structures and require separate storage space in the database. You can create indexes on tables either by using the CREATE INDEX statement or by using the UNIQUE or PRIMARY KEY integrity constraint of the CREATE TABLE clause of T-SQL. SQL Server allows only one clustered index per table (soon I will explain why) and up to 249 nonclustered indexes. Before I get into the specifics of each type of index, it is important that you understand the storage and data-access techniques of SQL Server.

SQL Server Data Access Technique


A SQL Server database is created with the following logical storage hierarchy:


Pages

A page is the minimum allocation unit in a SQL Server database. The size of each page is 2048 bytes (2K). Pages allocated to a table are maintained in the form of a double link list; that is, each page has a pointer to the next as well as the previous page in the link. SQL Server stores the following information on each page of a database:

Figure 32.1 shows a typical page structure in SQL Server.



Because each row must fit completely on a page, if a table has row width greater than 981 bytes (1962 divided by 2), the remaining space on the page is rendered useless because no other row can fit into rest of the space.

FIGURE 32.1. Page structure in a SQL Server database.

Extents

An extent in SQL Server is a block of eight contiguous 2K pages (16K). Extents are used for space allocation for tables in a database. Space is always allocated to tables one extent at a time. A new extent is allocated to a table at the time of table creation or when a previously allocated extent becomes full. An extent is deallocated from a table when all the pages on that extent don't contain any data. Each extent within an allocation unit is a doubly linked list; that is, each extent has pointers to the previous and the next extent in the link list. Page links are maintained only within an extent, and links to other pages on a different extent are maintained through extent level links. Figure 32.2 shows a typical extent structure.

FIGURE 32.2. Extents in a SQL Server database.

Allocation Units

When a new database is created with the CREATE DATABASE T-SQL command, SQL Server internally divides the total space into multiples of 256 2K page blocks. Each such block is called an allocation unit. Each allocation unit is divided into 32 extents, each extent consisting of eight contiguous pages. An allocation unit keeps track of space usage within an allocation unit. Each allocation unit is a doubly linked list; that is, each allocation unit has pointers to the previous and the next allocation unit in the linked list. The first page of each allocation unit keeps information about the next and previous allocation unit pointers. This first page of each allocation unit keeps track of free extents within an allocation unit. It also keeps track of the extents that have been allocated to a specific table. A representation of the storage scheme of an allocation unit is shown in Figure 32.3.

FIGURE 32.3. Allocation units in SQL Server.

Because space allocation for tables is always in terms of extents, no two tables can share an extent within an allocation unit. Multiple tables can, however, share the same allocation unit. A table can span multiple allocation units, depending on the number of rows in the table. If no a table has indexes, data is always added at end of the last page of the table. For performance reasons, SQL Server tries to allocate data for a table in contiguous extents in the current allocation unit. If no contiguous extents are available, it tries to allocate an extent within the same allocation unit. If no extents are available within the allocation unit, it tries to allocate an extent on allocation units on which the table already has extents.

In the example shown in Figure 32.4, Table A and Table B are using separate extents on Allocation Unit 1. Table A also has an extent allocation on Allocation Unit 2. The first page of each table stores information about all the allocation units the tables is split across. In the example shown in Figure 32.4, the first page of Table A keeps pointers to Allocation Units 1 and 2. For Table B, the pointers are only to Allocation Unit 1. The first page of each allocation unit stores the information about the extents that have been allocated to various tables. The location of the first page for each table is stored in the sysindexes system table. Now you'll see what happens if a user requests some data from Table A that resides on Page 5 of Extent 18 on Allocation Unit 2 (see Figure 32.4). SQL Server fetches the starting data page number of Table A from the sysindexes system table. Because the first page of a table has pointers to all the allocation units on which table data is present, this page has pointers to Allocation Units 1 and 2. The first page of Allocation Unit 1 points to Extent 2 and Extent 31. SQL Server reads all the pages on both the extents, then continues on to Allocation Unit 2, which points to Extent 18 for Table A. SQL Server starts reading pages in Extent 18 until it reaches Page 5, where it finds the requested data.

FIGURE 32.4. Locating table data in SQL Server.

This whole process of reading all the pages in a table is called a table scan. Table scans are very slow, because SQL Server starts reading from the first page and continues reading until it finds the desired result set. For tables that have many records and joins that are complex and involve such tables, the table scan will be very slow and unacceptable. That's why indexes are created on tables—to eliminate table scans and access the data directly rather than sequentially. Indexes in SQL Server are created as B-Tree structures, and they provide more direct access to the data. Similar to data pages, index pages are also stored in 2K pages. Unlike data pages, index pages store values of indexed columns and pointers to the data. The following paragraphs explain the B-Tree structure.

B-Tree Structure


There are various methods of index storage structures. SQL Server indexes are stored as B-Tree (Balanced Tree) structures. As the name suggests, indexes are stored in the form of a balanced tree with multiple levels. Indexes use separate storage space on SQL Server. There are three levels of a B-Tree:


Root Level

The root level is the highest level of an index. Usually, index pages contain more rows than the data pages simply because index pages store only the key column values and pointers to the data pages. Depending on the number of rows in a table, the root level points either to the data page or to a lower level that is an intermediate level. There is only one page at the root level.

Leaf Level

Leaf level is the lowest level of an index. This level stores the pointer values that point directly to the data page. There might be multiple pages at the leaf level, depending on the number of rows in a table.

Intermediate Level

All the levels between the root level and the leaf level are termed as intermediate level. The number of intermediate levels depends on the total number of records in a table and the size of the key. A larger key and a larger number of records results in a greater number of intermediate levels. These levels are numbered in increasing order from leaf level (which is also called Level 0) to the root level.

An example is necessary here to demonstrate how indexes are maintained. Assume that the table employee is indexed on a column called empid. Also assume that the data page can store three rows per page and the index page can store five rows per page (in reality, there will be many more rows on data and index pages). In the beginning, let's say that the employee table contains five records, as shown in Figure 32.5.

FIGURE 32.5. Basic B-Tree structure.

In the example shown in Figure 32.5, there is only one level of index: the leaf level. Now if three more records are inserted in this table, one data page is added to accommodate the new rows, and two additional index pages are required to maintain the B-Tree. This structure is depicted in Figure 32.6.

FIGURE 32.6. B-Tree structure after the data insert.

You can see from Figure 32.6 that there is an extra level of index pages to support data in the table. As more data gets added to this table, more index pages are utilized, and more intermediate levels are created. If you create an additional index on the same table, a separate B-Tree is created to maintain that index.

As mentioned before, SQL Server provides two types of indexes: clustered and nonclustered. They are both stored as B-Tree structures. Both of the indexes are covered in the following paragraphs.

Clustered Index


A clustered index enables you to create an index in which the physical order of the rows on data pages is the same as the physical order of the keys on index pages. That means data is physically sorted on the pages in the order of the indexed keys. Therefore, if you create a clustered index on empid, the B-Tree shown in Figure 32.6 will look like the one shown in Figure 32.7.

FIGURE 32.7. Clustered indexes in SQL Server.

If you notice closely, the key values on the data pages and the leaf pages are identical. Therefore, to eliminate redundancy, leaf pages are substituted by the data pages. That means that the bottom of the leaf page is actually a data page and there is one fewer level for indexing. This is shown in Figure 32.8.

FIGURE 32.8. B-Tree structure for clustered indexes.

Because there can be only one physical order of the key, you can have only one clustered index per table. Also note that the pointers in a clustered index do not point directly to data rows, but to the page on which the rows reside. It's now time to examine the effect of various data manipulation statements on a table with a clustered index on it.

Data Manipulation with Clustered Indexes

When a row is inserted into a table, because data is physically sorted on the pages, some rows move down to accommodate the new row. A row is added at the end of the page only when the key value is greater than the previous maximum value for the key. For example, in Figure 32.8 if you insert a record with empid = 750, then to accommodate this row, rows with empid = 800 and 900 move down one position. SQL Server also modifies the B-Tree structure to accommodate this insert. The result is shown in Figure 32.9.

FIGURE 32.9. Insert operation with a clustered index.

A delete operation is just the opposite of the insert statement. For example, if you delete a row from the table shown in Figure 32.9 in which empid = 750, then rows with empid = 800 and 900 move up one row and the B-Tree is now modified to point to the page with empid = 800.

An update statement is essentially a delete followed by an insert statement. Therefore, if you update empid 750 with 850, there is physical movement of the rows for the delete statement and then for the insert statement.



SQL Server enables you to write queries in such a way that an update can be performed "in place." That means there is no physical movement of the row. To perform an update in place, you must follow certain rules, including the following ones:
A query cannot update the columns participating in the clustered index.
The table can't have an update trigger defined on it.
The table must not be used for data replication.
An update in place results in better performance of the query due to the lack of physical movement of the row and the lower amount of data logging by SQL Server. For a complete list of rules for an update in place, refer to the SQL Server documentation.


If you refer to Figure 32.9, you will notice that all the data pages are completely full and there is no place to add a new row. In such cases if an insert takes place, SQL Server adds a new page to the link list from the extent and splits the information 50-50 on the old page and the new page. Figure 32.10 shows an example of page splitting in which a new record is inserted with empid = 950.

FIGURE 32.10. Page splitting in SQL Server.

Nonclustered Index


Nonclustered indexes are also B-Tree structures. A nonclustered index differs from a clustered index in the following ways:

A typical B-Tree structure for a nonclustered index looks as shown in Figure 32.11. Because there is an additional level of indexing, a nonclustered index requires extra storage space.

FIGURE 32.11. B-Tree structure for a nonclustered index.

Data Manipulation with Nonclustered Indexes

When a row is inserted into a table with a nonclustered index, data is always added at the end of the page. But index pages are always kept in the sorted order. The row ID of a row is a pointer to an offset table on each data page. The offset table stores the starting location of data for each row on the page. The starting position of the offset table is always at the last byte of the page. The offset table grows backward when records are inserted on a page using a 2-byte pointer for each row. The advantage of using ROWID and the offset table is that when data gets inserted or deleted from a page, existing ROWIDs and corresponding index pointers for other rows remain unaffected. For an insert statement, a new ROWID is assigned for the row, and changes are made to the offset table to point to the location of data. Figure 32.12 illustrates a scenario in which a record with empid = 750 is inserted to the table. You will notice that the record is added at the end of the data page with a new ROWID, and unlike with clustered indexes, there is no physical movement of the rest of the records.

FIGURE 32.12. Insert operation with a nonclustered index.

As a result of the delete operation, the row is deleted from the data page, and all the rows below the deleted one move up. The ROWID for the other records remains unaffected. The offset table is modified to reflect the new location of the data for the records that physically moved on the page. Now for index pages, the B-Tree is modified to remove the reference for the deleted row. Because the leaf page stores the ROWID of the rows, other entries in the B-Tree remain unaffected by the delete operation. Figure 32.13 illustrates a scenario in which a record with empid = 700 is deleted from the table. You will notice that the ROWIDs for the other records remain unchanged.

FIGURE 32.13. Delete operation with a nonclustered index.

The Advantages of Indexes


Indexes are primarily used for faster retrieval of data. Creating appropriate indexes for a database is one of the most important aspects of database design. Although indexes speed up search operations, the inserts, deletes, and updates might be slow, because each of these operations might result in modifying the B-Tree index structure for the table. An index created without any consideration for performance usually results in slower data modification statements. Therefore, it is very important to carefully choose indexes in a database and measure the overall performance impact on the system.

SQL Server indexes are mostly transparent to the users. SQL Server's cost-based query optimizer decides whether to use an index on a table for a particular query. If more than one index is on a table, the query optimizer chooses the index that is most effective in terms of the number of I/Os. SQL Server does, however, allow an application developer to pass hints to the query optimizer to use a specific index in a query. The following broad guidelines might help you decide whether to use an index:


Things to Consider when Planning a Database


Database planning is one of the most complex parts of application design. The following paragraphs describe the steps necessary to ensure that you have a robust and high-performance database application.

Database Design


A good database design is fundamental to the success of any application environment. Database design can be split into two parts:


Logical Design

Logical design is the first step when you start planning your application. This step involves collecting user requirements. Database designers take these specifications and identify entities, attributes, relationship between entities (one to many, many to many, and such), primary keys, and so on. Based on this information, an entity relationship model is built for the system.. An entity relationship diagram should satisfy at least the first three rules of normalization. After this step, you are ready to translate your logical model into a physical one.

Physical Design

When the logical design is complete and conforms to the rules of normalization, entities become tables and attributes become columns in a table. This step involves the following actions:

There also are some additional steps one should plan for. Although these steps are not related to the physical design of the database, they are important to provide maximum uptime for a database. These are the necessary actions:


Planning Client Configuration


It is always important to know what kind of resources will be required to run an application smoothly. An end-user application might suffer from performance problems if not enough memory is available on the client or if the hardware is slow. You should also determine whether the end-users will be running any other resource-intensive applications on the client along with this application. Therefore, at the time of designing the frontend application, estimate the amount of resources (memory, CPU, disk space) that will be required in order to run the application smoothly.

Planning Server Configuration


In a client-server environment, the server usually does most of the work. Therefore, it is important that the hardware is configured to provide an excellent price/performance ratio. The following sections discuss various components of the server configuration.

Hardware Components

This step deals with defining hardware resources for the machine on which SQL Server will be installed. These are some of the important subsystems of server hardware that can make a significant impact on performance:

It is very important that sufficient resources always be available on the server hardware. All the previously listed parts of hardware work together. A system that suffers from performance problems at the hardware level might indicate that one subsystem is idle while another one has a very high percentage of utilization. For example, low CPU utilization might indicate that the processor is waiting on the disk subsystem to finish I/Os.

System Processors

To provide a sufficient number of processors on the machine, determine the following items of information:

SQL Server supports Symmetric Multiprocessing (SMP). An SMP hardware has more than one CPU, and each CPU can handle a load independent of other CPUs. For SMP hardware, a SQL Server thread is not tied to a particular processor of the machine and might therefore execute on a different processor each time. This allows SQL Server to run multiple tasks simultaneously, resulting in excellent load balancing.

Disk Subsystem

Choosing the right kind of disk subsystem is an important aspect of application design. This steps allows you to determine these facts:

The placement of data on devices makes a significant impact on the performance of the system. It is quite important that you keep a transaction log on a separate disk (and mirror it, if possible) to provide full recovery in the event of a disk crash. Keeping nonclustered indexes on one device and data on another device (by using SQL Server segments) usually results in good performance of the system. If the database and transaction logs are to be backed up on disk drives, you should consider that fact also in your capacity planning. You should also forecast the growth of the database and acquire disk space accordingly.

Memory

Memory dedicated to SQL Server is primarily used to cache data and procedures. SQL Server also provides the capability to keep tempdb in RAM. The total physical memory on the machine is of one of the important factors governing the performance of SQL Server. Insufficient memory on SQL Server causes more I/Os from the disk. Because accessing a disk is much slower than accessing the memory, a large number of physical I/Os can lead to performance degradation. If more memory is installed on a server, it can cache a large amount of data, and many requests for data can be satisfied from memory rather than via physical I/Os. SQL Server can use a maximum of 2GB memory on a machine. The amount of memory installed on a machine is primarily determined by the size of the database.

Network Configuration


It is important to choose an appropriate network protocol. For Microsoft SQL Server, the Named Pipes protocol delivers the highest performance. TCP/IP, however, is the most widely used protocol for SQL Server. Using TCP/IP protocols, UNIX clients can also communicate with SQL Server. SQL Server can listen on several protocols simultaneously. Because listening to multiple protocols can add to overheads, choose the ones that are appropriate for your network environment, and disable the others from SQL Server.

A network bottleneck is usually caused by available network bandwidth, network traffic, and the speed of the network. A poorly designed application can create a network bottleneck by retrieving hundreds of rows from the database, even though only a few rows are necessary for the frontend application. Some of the network-related bottlenecks can be eliminated by efficient application and database design and by using database objects such as stored procedures, rules, and triggers. Other options include using a faster network, keeping the machine in a subnet with less traffic, and adding network redundancy.

SQL Server Configuration


All the previously mentioned components (client hardware, network, server hardware) are essential for a well-performing system. But they also add to the overall cost of the system, and investment on these components is therefore limited by the budget constraints. No matter how many resources you provide on the client, the network, and the hardware, there is no substitute for an efficient database design, well-written queries, and a well-configured SQL Server. A database administrator can configure SQL Server to provide excellent performance and throughput without compromising the data integrity. SQL Server provides various parameters that can be configured to provide optimal performance for a particular environment.

Denormalization


To improve performance, sometimes it is necessary to backtrack from a fully normalized database. These are the various methods of using denormalization techniques:

These methods of denormalization were discussed earlier in this chapter, in the section Methods of Denormalization. Choose the method that provides the best performance improvement for your application design.

Summary


A good database design should conform to the third normal form. A logical model with the third normal form can be easily translated into a physical model. Denormalization techniques can be utilized for performance reasons. You should, however, have very good understanding of the logical and physical model of the system before making an attempt at denormalizing data. The next chapter explores the areas where SQL Server can be used.

Previous Page Page Top TOC Next Page