I’ve been researching data architecture practices. Along my travels I came across the book, A Practical Guide to Enterprise Architecture (Coad Series). In the book, James McGovern, Scott W. Ambler, Michael E. Stevens, James Linn, Vikas Sharan, and Elias K. Jo write about effective data architecture practices.
Key Practices
With the benefit of experience, McGovern, Ambler, Stevens, Linn, Sharan, and Jo identify the following practices for effective data architecture:
- Establish an infrastructure that allows for rapid changes in business requirements and database technologies.
- Data that need to be shared and current should be centralized.
- Avoid monolithic database design.
- Use access rules to protect data.
- Data validation should occur at multiple tiers within an N-Tier architecture.
- Data should be replicated based on business requirements and only when necessary.
- Replicated data should be read-only.
- The replication topology implementation should meet existing business needs.
- Transparency of location of data by accessing applications should be encouraged.
- Implement tools, processes, and policies to ensure data integrity.
- When designing a database for performance, it should include the total performance and not just local performance.
- Prefer open standards to proprietary extensions.
- Protect credit card information using encryption.
Establish an Infrastructure that Allows for Rapid Changes in Business Requirements and Database Technologies
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
In most organizations, business requirements change frequently. The design of your data infrastructure must be agile and allow for easy and rapid implemented changes to data models. Likewise, technology advances continue to outpace all but the most flexible organizations. Vendors are constantly adding new features to their engines that allow them to perform previously impossible tasks. The infrastructure must allow for replacing the database technology if necessary.
Data that Need to be Shared and Current Should Be Centralized
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
Transactional data that are high volume and shared across locations should be centralized. Additionally, if the data for all locations must be current, centralization becomes mandatory. Some organizations have tried to work around this by replicating updates to distributed databases in remote locations with limited success. Using a replication approach increases the complexity of an organization’s data architecture and results in increased network traffic.
Data Validation Should Occur at Multiple Tiers within an N-Tier Architecture
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
Data can be represented in many forms, including EDI streams, XML documents, and SOAP messages. Validation in these scenarios should occur before data are sent across the network. This reduces errors within a distributed architecture and can increase performance in many situations. For example, if a GUI-based application captured user-entered data in a data entry screen, this can be corrected before data are written to the database. In a distributed environment, it may be more efficient to validate the data within the GUI instead of it having to travel through multiple geographically dispersed systems, only for it to be rejected by a database constraint.
Data Should Be Replicated Based on Business Requirements and Only When Necessary
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
Our next best practice states that it is better to have a single version of a data source whenever possible. The only viable reasons for maintaining a replicated environment are either to increase performance or to create a decision support database that mirrors an online transaction processing (OLTP) database.
Replicated Data Should Be Read-Only
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
It is preferable that data architecture includes an authoritative source for data. This is where all updates should be directed. Allowing users to update replicated data unnecessarily complicates the replication environment and introduces potential data integrity problems. In a distributed environment, using a single source to guarantee data consistency is easier than managing updates in a multiple data source environment.
When Designing a Database for Performance, It Should Include the Total Performance and Not Just Local Performance
When tuning the data tier for performance, McGovern, Ambler, Stevens, Linn, Sharan, and Jo suggest the following:
- Limit the number of indexes in a database.
- Limit ad hoc data access.
- Reduce the number of rows in a result set.
- Limit the number of joins.
- Avoid sorting.
Prefer Open Standards to Proprietary Extensions
McGovern, Ambler, Stevens, Linn, Sharan, and Jo write the following:
By utilizing a separate data access tier, data architecture can minimize the impact of calling applications since the APIs used by higher tiers do not change. The implementation of stored procedures is another form of proprietary access to a database that should be avoided. Stored procedures are difficult to migrate if a different implementation of a database is required. … Use of database triggers is a also a proprietary extension of a database. It is a good practice to use database triggers for the sole purpose of supporting referential integrity and no more.
Key Take Aways
Here’s my key take aways:
- Avoid replicating data where you can, except for read-only data. Replicated data leads to consistency issues.
- Decouple the application from the underlying data store.
- Keep your data architecture flexible.
- Correct and validate the data as close to the source as possible to reduce roundtrips, and validate throughout your layers and tiers.
Additional Resources
- SQL Server Performance Guidelines (GuidanceShare)
- Data Patterns (MSDN)
[…] Read the rest of this great post here […]