(Warning: this is a rather technical post, skip it if RDBMS and T-SQL sound like weird sexual practices to you, or at least jump to the end which concludes on a business perspective.)
As I’m redoing SciFan from the ground up to take it to the next level (i.e. to generate real money and go beyond its current limits), I’m spending time renormalizing its database, and more generally speaking, I’m considering doing more things at the database level rather than in higher-level code. I was already using stored procedures so I’m not starting from scratch, but the core of the database is a legacy left by someone else who was a bit sloppy in some places.
I’m learning about foreign key constraints to enforce referential integrity and I’m tempted to use cascading referential integrity constraints to avoid orphan or outdated data when rows are updated or deleted. I’m aware this should be done with caution since you might wipe out more than you think if you’re careless, so I’ve run a couple tests on a spare database to make sure I knew what I was doing. (I’ve already written and executed destructive SQL queries in the past, so I’d rather stop being stupid on that front — kind of like the time I bought more shares of a stock I was trying to dump.) But it’s attractive to know the database will sort of maintain itself by automatically deleting those obsolete rows in the ‘pivot’ tables used to maintain many-to-many relationships.
What I’m wondering about is, what’s the current practice among web architects and developers these days? Are people quite loose at the database level and stricter in their code, or do they indeed make extensive use of constraints (not only foreign keys but also check constraints to determine valid values, as well as more basic stuff such as disabling NULL values and choosing the most appropriate datatype), triggers, and stored procedures? It seems more healthy to do as much as you can in the database — why let your applications put bad data into it if you can prevent it in the first place? Curious minds want to know. For all I know I’m just late to the party, but when I see the nonsensical restrictions on most web forms that are unable to properly handle even slight format variations ("error, you didn’t write your phone number like a machine would have, please reformat your brain"), I suspect many developers are just lazy if not dumb in the way they approach data handling both on the front-end and back-end.
Since some of these features are product-specific (for instance in the Microsoft world cascading referential integrity constraints were introduced with SQL Server 2000), I wonder whether databases might end up being under-used in web environments. I guess corporate applications using Oracle stand on one end of the spectrum (where it seems it’s been a common practice for a while), MySQL/PHP sites on the other, and Microsoft shops somewhere in between, but I can use more education on this issue.
To frame this in terms of project management and business benefits, the idea is to enforce the logic behind your database as strictly as possible to guarantee the quality of the data on an ongoing basis. To protect itself from GIGO (garbage in, garbage out), see for instance how eBay is letting people who sell media products (CDs, DVDs et al.) select their item from a pre-existing list (I believe this is grown out of the former Half.com). If you’re in ecommerce and your data is not clean, people who browse or search your site will be led to dead-ends or won’t see some of the things they might be interested in, resulting in frustrated and confused customers and lower sales.
The downside to doing more in the database is that your developers need strong database skills, which again is probably already the case in big corporations and within online leaders, but what’s happening further down the totem pole? The other possible caveat I can see is potentially more work to change platforms or RDBMS versions, as here we’re leaving standard SQL domain to enter extensions and behaviors that change between products and releases. The next generation of Microsoft dev tools will take this one step further, but again this is drifting from the lowest common denominator. Portability has a cost.
Data integrity and quality are at stake, and while we’re at it I’d like advice on how to avoid duplicate entries at the data entry level, and clean them when they’re already in there (cheap and lightweight, thanks, I’m not in the market for something like Trillium). Moreover, performance can also benefit greatly (thanks to indexes, properly written stored procedures, maybe views as well, though I need to investigate if/when to use the later). What this means is that database skills are critical yet may lack the visibility they deserve within the skill set necessary to make a web project succeed. Do employers test the ability of web developers candidates to write efficient SQL queries and milk the database engine for all it’s worth?
08/21/04 update: what comes out must come in: Please, please, please, learn about injection attacks!
06/03/05 update: Databases as Services.