O'Reilly FYI

News from within O'Reilly

Relational Database Technology: A Crash Course

 
By Kathryn Barrett
November 6, 2008 | Comments: 4
smallerASPcover.gifThe following is an excerpt from the newly released Programming ASP.NET 3.5 by Jesse Liberty, Dan Maharry, and Dan Hurwitz. According to them, ASP.NET 3.5 is possibly the fastest, most efficient, most reliable, and best-supported way to create interactive web applications today. Combined with the development tools available from Microsoft (both free and commercial), it is incredibly easy to create websites that look great and perform well.

Appendix B. Relational Database Technology: A Crash Course

You can use .NET to access data from any data source: relational databases, object databases, flat files, and text files. The vast majority of web applications, however, will access data from a relational database such as SQL Server, Oracle, or MySQL. Though we could certainly write an entire book on relational databases and another on SQL, the essentials of these technologies are not hard to understand.

Tip

All of the examples in this appendix assume you are working with either SQL Server or SQL Server Express. Users of other relational databases will find that the lessons learned here transfer well to their environment, but be careful with applications such as Oracle, MySQL, and Access that use a different variation of SQL.

A database is a repository of data. A relational database organizes your data into tables that are connected to one another by the data they contain. For example, one table might contain a customer's information and a second table might contain information about orders, and each order is owned by an individual customer.

Similarly, you might have one table of cars and a second table of car parts. Each part can be in one or more cars, and each car is made up of many parts. Or you might have a table for bugs and a table for developers. Each bug is owned by one developer, and each developer has a list of bugs he owns.

Tables, Records, and Columns

The principal division of a database is into tables. Every table in a relational database is organized into rows, where each row represents a single record. The rows are organized into columns. All the rows in a table have the same column structure. For example, a Customer table might have columns for the customer ID (the unique ID of the customer placing the order), the name of the customer, the email address of the customer, and so forth.

It is common to make an analogy between tables and classes and between rows and objects. The Customer table, for example, tells you a great number of things about a customer, just as a Customer class does. Each row in the Customer table describes a particular customer, much as an instance of the Customer would do.

This analogy is compelling, but limited. Tables, like classes, typically describe one logical entity and all of what you know about that entity. The mapping isn't reflexive, though—classes can contain properties that actually map across several different tables. This is similar to the notion of relationships between tables to ensure that consistent data in a database is not reflected back as relationships between classes.

One of the challenges facing an object-oriented programmer is overcoming these and other design differences between the object model on the one hand and the database model on the other. Typically, the interface between the backend relational database and the objects in the application is managed by creating a database interface layer of objects that negotiate between the creation of objects and the storage of information in the database tables, all created by hand. The past few years, however, have seen a number of products known as object-relational mappers, which can scan the structure of a database and, with a little help from the developer, produce a set of classes which represent the entities and relationships and make that transition between your code and your database that much easier. Language Integrated Query (LINQ), covered in Chapter 10, Presenting LINQ, actually includes two object-relational mappers:

  • LINQ to SQL looks after the simpler case where your classes generally have a one-to-one relationship with tables in the database.

  • LINQ to Entities (available with VS2008 Service Pack 1) deals with the more enterprise-level case where your classes generally map to information in more than one table in your database.

This distinction is not a hard and fast rule by any means, but it does make the general distinction between the two that Microsoft is using.

Several other mappers are also available, including:

Table Design

To understand the issues in table design, consider a database for recording orders (such as the AdventureWorksLT database used in this book). You need to know who placed each order, and it would be useful to know the email address, phone number, and other identifying information about each person as well.

You can imagine a form in which you display details about a given order, and in that detail page you offer the email address and phone number of the customer so the salesperson working on the order can contact that customer in case of a delay or stock issue.

You could store the identifying information with each order in an Orders table (named SalesOrderHeader in the database, for reasons we'll mention in a minute), but that would be inefficient. If John Doe placed 50 orders, you'd rather not repeat John Doe's email address and phone number in 50 records. It's also a data maintenance nightmare. If John Doe changes his email address and phone number, you'd have to make the change in 50 places.

Instead, the customer details are kept in a second table, called Customer, in which each row represents a single customer. In the Customer table, there will be a column for the CustomerID. Each customer will have a unique ID, and that field will be marked as the primary key for the Customer table. A primary key is the column or combination of columns that uniquely identifies a record in a given table.

The Orders table will use the PersonID column as a foreign key. A foreign key is a column (or combination of columns) that is a primary (or otherwise unique) key from a different table. The Orders table uses the CustomerID, which is the primary key in Customer, to identify which person placed the order. If you need to determine the email address for that person, you can use the CustomerID to look up the Customer record in the Customer table and that will give you all the detailed information about that person.

By "factoring out" the details of the person's address into a Customer table, you reduce the redundant information in each Order record. This process of taking out redundant information from your tables is called normalization.

The same process explains why an order is actually stored in two tables—SalesOrderDetail and SalesOrderHeader—rather than one. An order can be split into individual requests for products and so the SalesOrderHeader table stores the details of the order as a whole—when it is due, when it was shipped, how much it cost in total, and so on—whereas rows in the SalesOrderDetail table contain details for each separate product in the order.

Normalization

Normalization makes your use of the database more efficient, and it reduces the likelihood of data corruption. If you kept the customer's email address in the Customer table and in the SalesOrderHeader table, you would run the risk that a change in one table might not be reflected in the other. Thus, if you changed the person's email address in the Customer table, that change might not be reflected in every row in the SalesOrderHeader table (or it would be a lot of work to make sure that it was reflected). By keeping only the CustomerID in SalesOrderHeader, you are free to change the email address or other personal information in Customer, and the change will automatically be reflected for each order.

Just as Visual Basic and C# programmers want the compiler to catch bugs at compile time rather than at runtime, database programmers want the database to help them avoid data corruption. A compiler helps avoid bugs by enforcing the rules of the language. For example, in C# you can't use a variable you've not defined. SQL Server and other modern relational databases help you avoid bugs by enforcing constraints that you create. For example, the Customer table marks the CustomerID field as a primary key. This creates a primary key constraint in the database, which ensures that each CustomerID is unique. If you were to enter a person named Dan Maharry with a PersonID of 123, and then you were to try to add Jesse Liberty with a PersonID of 123, the database would reject the second record because of the primary key constraint. You would need to give one of these people a different, and unique, CustomerID.

Declarative Referential Integrity

Relational databases use Declarative Referential Integrity (DRI) to establish constraints on the relationships among the various tables. For example, you might declare a constraint on the SalesOrderHeader table that dictates that no order may have a CustomerID unless that CustomerID represents a valid record in the Customer table. This helps you avoid two types of mistakes. First, you cannot enter a record with an invalid CustomerID. Second, you cannot delete a Customer record if that CustomerID is used in any order. The integrity of your data and the relationships among records is thus protected.

SQL

The language of choice for querying and manipulating databases is Structured Query Language, often referred to as SQL. SQL is often pronounced "sequel." SQL is a declarative language, as opposed to a procedural language, and it can take awhile to get used to working with a declarative language if you are used to languages such as Visual Basic and C#.

Most programmers tend to think in terms of a sequence of steps: "Find me all the orders, then get the customer's ID, then use that ID to look up that customer's records in Customer, then get me the email address." In a declarative language, such as SQL, you declare the entire query and the query engine returns a set of results. You are not thinking about a set of steps; rather, you are thinking about designing and "shaping" a set of data. Your goal is to make a single declaration that will return the right records. You do that by creating temporary "wide" tables that include all the fields you need and then filtering for only those records you want: "Widen the SalesOrderHeader table with the Customer table, joining the two on the CustomerID, then filter for only those that meet my criteria."

The heart of SQL is the query. A query is a statement that returns a set of records from the database. Typically, queries are in this form:

Select column,column,column from table where column = value

For example, you might like to see information about the customers served by Janet. To do so you would write:

Select FirstName, LastName, CompanyName from Customer where SalesPerson = 'Janet'

Joining Tables

SQL is capable of much more powerful queries (see the upcoming "SQL Select Statement" sidebar for the full syntax as provided by Microsoft).

For example, suppose you'd like to know about all the orders by all the customers served by Janet after January 1, 1997. You might create this query:

Select c.FirstName, c.LastName, c.CompanyName, o.OrderDate, o.TotalDue 
   from Customers c
inner join SalesOrderHeader o on o.customerID = c.customerID
where c.SalesPerson = 'Janet' and o.OrderDate > '1/1/97'

At first glance, you appear to be selecting orderDate from the Customer table, but that is not possible because the Customer table does not have an orderDate. You must take care to identify the table that each field comes from by prefixing each field with an "alias" defined in the query. The key phrase is:

inner join SalesOrderHeader o on o.customerID = c.customerID

It is as though the join phrase creates a temporary table that is the width of the Customer table and the SalesOrderHeader table joined together. The on keyword dictates how the tables are joined. In this case, the tables are joined on the CustomerID column in Customer and in SalesOrderHeader.

Each record in Customer (represented by the alias c) is joined to the appropriate record in Orders (represented by the alias o) when the CustomerID fields match in both records.

When you join two tables you can say "get every record that exists in one table but not necessarily the other" (this is called an outer join), or you can say, as we've done here, "get only those records that exist in both tables" (called an inner join).

Tip

Inner joins are the default, so writing join is the same as writing inner join.

The inner join shown in the preceding code snippet says: get only the records in Orders that match the records in Customer by having the same value in the CustomerID field (on o.customerID = c.customerID).

The where clause further constrains the search to those records where the SalesPerson field in Customer is an exact match for the string Janet and where the OrderDate in the Orders table is greater than January 1, 1997:

where c.SalesPerson = 'Janet' and o.orderDate > '1/1/97'

But because Orders is the only table with an orderDate column, there is no ambiguity if we use just the column name:

where c.SalesPerson = 'Janet' and orderDate > '1/1/97'

It's often easier to read, though, if all columns are given a prefix rather than you having to figure out which table each column is part of. SQL is able to translate the string "1/1/97" into a DateTime representing January 1, 1997.

Using SQL to Manipulate the Database

You can use SQL to search for and retrieve data and to create, update, and delete tables and generally manage and manipulate the content and the structure of the database. For example, you can update the FirstName of a specific company:

Update Customer set FirstName = 'Jesse' where CustomerId = '124'

You can add a new row to the Customer table:

INSERT INTO [Customer]
 ([FirstName], [LastName], [SalesPerson])
VALUES
 ('Dan', 'Maharry', 'Janet')

And you can delete a row from the table as well:

DELETE from Customer where FirstName='Dan' and LastName='Maharry'

Further Reference

In this book, the database server of choice is Microsoft's SQL Server 2005, but as we mentioned earlier, it is not the only kid on the block and there are a number of different free versions of the major databases to tide you over while you learn. They all have a lot of good documentation installed with them as well, making them well worth a read.

Likewise, there are many good books and websites to teach you a lot more about database design and SQL queries than you'll ever need (probably). For example:

  • http://www.sqljunkies.com

  • http://www.sqlservercentral.com

  • Head First SQL by Lynn Beighley (O'Reilly), a cracking beginner's guide to SQL

  • SQL Visual Quickstart Guide by Chris Fehily (Peachpit), a slimmer alternative to Head First SQL

  • SQL Queries for Mere Mortals by John L. Viescas and Michael J. Hernandez (Addison-Wesley), the book to look at once you've got the basics of SQL in your head

  • Professional SQL Server 2005 Programming by Rob Vieira (Wrox), the book to take you outside of just writing SQL and into thinking about SQL Server as a whole programming platform

  • Expert Oracle Database Architecture: 9i and 10g Programming Techniques by Thomas Kyte (Apress), the equivalent to Vieira's book for Oracle users


You might also be interested in:

4 Comments

Normalization makes your use of the database more efficient, and it reduces the likelihood of data corruption. If you kept the customer's email address in the Customer table and in the SalesOrderHeader table, you would run the risk that a change in one table might not be reflected in the other. Thus, if you changed the person's email address in the Customer table, that change might not be reflected in every row in the SalesOrderHeader table (or it would be a lot of work to make sure that it was reflected). By keeping only the CustomerID in SalesOrderHeader, you are free to change the email address or other personal information in Customer, and the change will automatically be reflected for each order.

This is the prime example where normalization fails. The SalesOrderHeader table should record the information that was current at the time of the order. The Orders table is supposed to hold an accurate record of the business transaction. This includes storing user data that was valid at the time of the order, not the one that is current six months later. In fact, that would mean tampering with data that you might want to use for legal disputes later on which is not a good idea. Auto updating the address of a user's past transaction is a plain silly idea (sorry :). Normalization doesn't buy you anything and only gets you in trouble.

>> Normalization doesn't buy you anything and only gets you in trouble.

On the contrary. The problem with the given example is that it is not normalised enough. A customer can have more than one address simultaneously - billing, shipping, legal - as well as the historical dimension you mention. Furthermore, more than one customer can share the same address.

So a proper solution would have Address as a separate table which is linked to Customer through an intersection table (to capture history). The Order would then merely have to link to the appropriate record in the CustomerAddress intersection table.

In my experience whats you into trouble is having to decide whether John Smith, Smith John and J Smith are three instances of the same customer.

Cheers, APC

On the contrary. The problem with the given example is that it is not normalised enough.

This is another valid conclusion, although, when adding natural object behaviour to a schema you end up with a 9-way-JOIN that will not perform very well. But I am biased towards the web here.

>> you end up with a 9-way-JOIN that will not perform very
>> well. But I am biased towards the web here.

I suppose I am biased towards applications working correctly. To a certain extent this is derives from the nature of the application.

To get a different result depending on which load balancer you hit doesn't really matter with something like Google. But it altogether matters with say online banking. I would rather my internet bank takes 10 secs to tell me I have 6000UKP in my account than incorrectly tell me I have 12000UKP in 0.1secs. Your tastes may vary.

Of course, in an ideal world we would deliver systems which are both fast and correct. Unfortunately this requires a degree of co-operation between data modellers, developers and DBAs which is usually lacking.

Cheers, APC

 

Popular Topics

Browse Books

Archives

Or, visit our complete archives.

FYI Topics

Recommended for You

Got a Question?