diff --git a/Physical_modelling.tex b/Physical_modelling.tex index d29e293..ba6e1c3 100755 --- a/Physical_modelling.tex +++ b/Physical_modelling.tex @@ -1,129 +1,124 @@ \documentclass{llncs} -\title{A Graphical Notation for Modelling Physical Database Structures} -\author{Antonia Pillay \and Nigel Stanger} -\institute{Department of Information Science, University of Otago, Dunedin, New Zealand \email{nstanger@infoscience.otago.ac.nz}} + +\usepackage{subfigure} +\usepackage{graphicx} + + +\title{A Graphical Notation for \\ Physical Database Modelling} +\author{Antonia Pillay\inst{1} \and Nigel Stanger\inst{2}} +\institute{???? Malaysia \and Department of Information Science, University of Otago, Dunedin, New Zealand \email{nstanger@infoscience.otago.ac.nz}} + \begin{document} + \maketitle + \begin{abstract} -In this paper we describe a graphical notation for modelling physical -database structures. Such a notation will provide database -administrators with a means to model the physical structures of new and -existing databases, and thus enable them to make more proactive and -informed tuning decisions, compared to existing database monitoring -tools. +In this paper we describe a graphical notation for physical database +modelling. This notation provides database administrators (DBAs) with a +means to model the physical structure of new and existing databases, +thus enabling them to make more proactive and informed tuning decisions, +compared to existing database monitoring tools. \end{abstract} \section{Introduction} -The design and implementation of a database system, as with most -information systems, typically goes through several phases -\cite{BeDa-P-2003}, including requirements analysis, conceptual -modelling, logical modelling, physical modelling and implementation. -Conceptual, logical and physical modelling are of particular interest -here, as each represents a step in a progression from a higher to a -lower level of abstraction \cite{Tsic-D-1978}. Conceptual models are -typically very high-level and abstract, using techniques such as -entity-relationship modelling, and are derived from the requirements. -Logical level models represent the structure of a database in a form -that is closer to the final physical representation, yet still -sufficiently abstract to isolate application from the physical -representation details (ref). Logical level models are typically derived -from the corresponding conceptual models by some transformation process -(ref), and use formalisms such as the relational model. +As with most information systems, the design and implementation of a +database goes through several phases, including conceptual, logical and +physical modelling \cite{BeDa-P-2003}. These three phases are of +particular interest, as they embody the progression from higher to lower +levels of abstraction \cite{Tsic-D-1978}. Conceptual models are +typically highly abstract, using techniques such as entity-relationship +modelling. Logical models represent the database structure in a form +that is closer to the physical representation, yet still sufficiently +abstract to isolate applications from the physical representation +\cite{Codd-EF-1970}, and are expressed using formalisms such as the +relational model. A logical model for a database can be derived by +transforming the corresponding conceptual model. -Physical level models represent the structure of a database in terms of -the physical storage structures provided by a specific database -management system (DBMS) such as Oracle or DB2. Once again, the physical -model for a database can be derived from the logical level model by some -form of transformation \cite{Bato-DS-1985,Conn-TM-2002}. Because of it's -low level of abstraction, modelling the physical level of a database is -typically one of the last phases of implementation, and tends to be very -product-specific. Consequently, physical level structures are not -typically modelled using graphical notations, unlike models at the -higher levels of abstraction. (refs?) +Physical models represent the database structure in terms of the +physical storage implementation of a specific database management system +(DBMS) such as Oracle or DB2. A physical model for a database can be +derived by transforming the corresponding logical model +\cite{Bato-DS-1985,Conn-TM-2002}. Because of their low abstraction +level, physical level database models have tended to not be expressed +using graphical notations, unlike models at higher levels of +abstraction. -Physical modelling is, however, equally as important as, if not -\emph{more} important than the higher levels of modelling, because it is -the physical level that has the greatest impact on the performance of a -database \cite{BeDa-P-2003}. It is therefore somewhat surprising that -there have been relatively few attempts to devise a graphical physical -modelling notation that can be used to represent the physical structures -of a database. Use of such a notation provides several advantages -\cite{BeDa-P-1992-PDD,Conn-TM-2002,Will-J-1992}: - +Physical level modelling, however, is equally as important as, if not +\emph{more} important than the higher levels, because it is the physical +level that determines the performance of a database \cite{BeDa-P-2003}. +It is somewhat surprising that there have been relatively few attempts +to devise a graphical physical modelling notation, because such a +notation can provide several advantages +\cite{Conn-TM-2002,BeDa-P-1992-PDD,Will-J-1992}: \begin{itemize} - \item it can reduce complexity and thus aid communication and - improve understandability \cite{Tuft-ER-1997}; + \item it can reduce complexity and thus improve understandability + \cite{Tuft-ER-1997}; \item it can provide a more complete and integrated display of - tables and performance tuning techniques in a database; + performance tuning techniques in a database; \item database developers can be more confident about the design decisions that they make for the performance of the database; - \item database performance problems are more easily visaualised + \item database performance problems are more easily visualised using a graphical notation; and - \item a specific methodology is developed and used, thus allowing - the developer to resolve physical performance issues in a more - systematic manner. + \item a specific methodology is developed and used, thus enabling + developers to resolve physical performance issues more + systematically. \end{itemize} -These benefits are embodied in database performance monitoring tools -such as Quest's SpotLight (ref), which provides a higher-level -visualisation of a database's structures in order to easily identify and -highlight performance problems. Such tools, however, are primarily -intended as \emph{monitoring} tools rather than as \emph{design} tools. -They may therefore unintentionally encourage database administrators -(DBAs) into a \emph{reactive} mode of continually ``tweaking'' the -database to resolve performance issues, rather than a \emph{proactive} -mode of anticipating problems, by means of prior analysis and modelling. +These benefits are embodied in modern database performance monitoring +tools, which provide higher-level visualisations of a database's +internals in order to easily identify and highlight performance +problems. Such tools, however, are primarily \emph{monitoring} tools +rather than \emph{design} tools. They may therefore unintentionally +encourage DBAs into a \emph{reactive} mode of continually ``tweaking'' +the database to resolve performance issues, rather than a +\emph{proactive} mode of anticipating and designing for expected usage. It may also be difficult for a DBA using such tools to gain a clear and comprehensive overview of all the tuning techniques that are in use within a particular database \cite{Core-MJ-1997-OracleDW}. -In this paper we propose a graphical notation for modelling the physical -structures of a database system, which builds upon a methodology for -physical database design proposed by Beynon-Davies -\cite{BeDa-P-1992-PDD}. In Section~\ref{sec-techniques}, we provide a -brief overview of current commonly used physical tuning techniques. We -then proceed in Section~\ref{sec-previous} to discuss prior approaches -to physical modelling and design. Section~\ref{sec-notation} introduces -our proposed notation, and Section~\ref{sec-future} discusses possible -future work. The paper concludes in Setion~\ref{sec-conclusion}. +In this paper we propose a graphical notation for physical database +modelling. In Sect.~\ref{sec-techniques}, we provide a brief overview of +commonly used physical tuning techniques. We then discuss in +Sect.~\ref{sec-previous} two earlier approaches upon which our work is +partially based. Sect.~\ref{sec-notation} introduces our proposed +notation, and Sect.~\ref{sec-future} discusses possible future work. The +paper concludes in Sect.~\ref{sec-conclusion}. -\section{Physical tuning techniques} +\section{Physical Tuning Techniques} \label{sec-techniques} -DBMSs tend to be I/O bound, so the main bottleneck in any database -system is usually the speed of offline storage such as disk drives. -Retrieving data from a hard disk is theoretically about six orders of -magnitude slower than retrieving the same data from RAM\footnote{On the -order of milliseconds (\(10^{-3}\)) for disk versus nanoseconds -(\(10^{-9}\)) for RAM.}. The ultimate aim of any physical tuning -strategy is thus to minimise the impact of slow storage, either by -directly reducing the number of physical disk accesses required, or by -parallelising access to the disk in order to reduce contention for a -limited resource. +Database management is generally an I/O bound task, so the main +performance bottleneck in most databases will be the speed of offline +storage such as disk drives. Retrieving data from a hard disk is +theoretically about six orders of magnitude slower than retrieving data +from RAM\footnote{On the order of milliseconds (\(10^{-3}\)) for disk +versus nanoseconds (\(10^{-9}\)) for RAM.}. The aim of any physical +tuning strategy must therefore be to minimise the impact of slow +storage, either by directly reducing the number of physical disk +accesses required, or by parallelising access to disk in order to reduce +contention. These considerations have led to the development of five general -physical tuning techniques, which have been implemented to various -degrees by most mainstream DBMS products: - -% following sections way too long --- shorten considerably +physical tuning techniques, which are implemented to various degrees by +most modern mainstream DBMS products: \begin{description} \item[Indexes] reduce the number of physical disk accesses required - to retrieve a specific record, most typically by building a B+-tree + to retrieve a specific record, most typically by building a B+-tree \cite{Knut-DE-1997-Art} based on some key value. Without any indexes, a DBMS often has little choice but to perform a sequential scan in order to locate a specific record. @@ -157,13 +152,11 @@ % typical database queries (ref), but can suffer in high update % environments due to the need to also update the index. - \item[Hashing] is a method of quickly locating specific records by - passing a key value through a \emph{hash function}. This function - returns the physical location of a hash bucket, which contains a - pointer to the associated physical record (check). Hashing schemes - typically require only a single disk access to retrieve a specific - record. + passing a key value to a \emph{hash function}. This function returns + the physical location of a hash bucket, which contains a pointer to + the associated physical record (check). Hashing schemes typically + require only a single disk access to retrieve a specific record. % Hashing is another pure disk access minimisation technique that performs % well for exact key queries on very large tables. Rather than build a @@ -174,13 +167,11 @@ % a specific record, but perform poorly for queries that require the % retrieval of multiple records. - \item[Clustering] minimises disk access by ensuring that related - records (such as an order header and its assocaited order lines) are - physically adjacent to each other on disk - \cite{Chan-N-2003-clustering}. This usually means that related - records will be stored in the same database block, and can thus all - be retrieved with a single disk access. + records (such as an order header and its associated order lines) are + physically adjacent on disk. This usually means that related records + will be stored in the same database block, and can thus be retrieved + with a single disk access. % Clustering is a disk access minimisation technique that can be applied % when two different record types are logically related, and thus commonly @@ -196,16 +187,15 @@ % with a single disk access. Clustering can, however, be expensive to % maintain in a high update environment. - \item[Partitioning] provides parallel access paths to data by - physically splitting a table into disjoint parts and ideally placing - them on separate disks. This is particularly advantageous in - situations where multiple users wish to access different subsets of - a set of records, because it provides a separate physical access - path to each of the partitions. Partitioning can also reduce the - number of disk accesses required, because the amount of data to be - searched in each partition is smaller than if the table were not - partitioned. + physically splitting a table into disjoint parts and placing them on + separate disks. This is particularly advantageous when multiple + users wish to access different subsets of a set of records, because + it provides a separate physical access path to each of the + partitions. +% Partitioning can also reduce the number of disk accesses +% required, because there are fewer records to scan in each partition +% than if the table were not partitioned. % In a typical untuned database, all data will be stored on the same disk. % If the database is heavily accessed, then contention for the single I/O @@ -222,12 +212,11 @@ % because the amount of data to be searched in each partition is smaller % than if the database were not partitioned. - - \item[Replication] provides parallel access to data by making - multiple copies of the same records and ideally placing them on - separate disks. This is particularly advantageous if multiple users - need to access the same sets of records, but is more complex to - manage due to the need to keep replicas synchronised. + \item[Replication] provides parallel access paths to data by making + multiple copies of the same records and placing them on separate + disks. This is particularly advantageous when multiple users need to + access the same sets of records, but is more complex to manage due + to the need to keep replicas synchronised. % Replication is an access parallelism technique in which multiple copies % of the same records are placed on different disks, thus providing @@ -235,200 +224,287 @@ \end{description} - -% \subsection{Discussion} - -The physical tuning techniques described above are normally used in -different parts of the database to achieve different effects. To choose -an appropriate physical tuning technique, the DBA must consider various -factors that will either benefit the users of the database or improve -the performance of the database as a whole. While most of the techniques -can be combined to varying degrees, simply applying all techniques is -usually not an optimal solution, because each technique has different -conditions in which it excels. What are optimal conditions for one -technique may be the exact opposite for another, so the DBA needs to be -able to model all the available information in order to develop an -appropriate physical design. +These techniques are normally applied to different parts of a database +to achieve different effects. In order to choose an appropriate physical +tuning technique, the DBA must consider various factors that may benefit +only some users of the database, or may improve the performance of the +database as a whole. While most of the techniques can be combined to +varying degrees, simply applying all techniques is usually not optimal, +because each technique excells under different conditions. That is, what +are optimal conditions for one technique may be the exact opposite for +another, so the DBA needs to be able to model all the available +information in order to develop an appropriate physical design. -\section{Prior physical modelling techniques} +\section{Prior Physical Modelling Techniques} \label{sec-previous} To achieve an effective physical design requires a large amount of -information, particularly with regard to the (predicted or actual) -volume and usage of data within the database \cite{BeDa-P-2003}. -Incorporating this information into a graphical model can provides a -more concise and clearer overview of the physical aspects of a database -system. In this section we discuss three previous efforts at modelling -such information in a graphical manner. +information, particularly with regard to the predicted or actual volume +and usage of data within the database \cite{BeDa-P-2003}. Incorporating +this information into a graphical model can provide a more concise and +clearer overview of the physical aspects of a database system. In this +section we briefly discuss two previous efforts at modelling such +information in a graphical manner. -\subsection{Agile modeling (Ambler)} +\subsection{Agile Modeling (Ambler)} -The notation suggested by Ambler -\cite{Ambl-SW-2003-ADT,Ambl-SW-2004-ObjPrimer3} is a type of model that -can be developed in UML using class boxes. He states that since Unified -Modelling Language (UML) does not cover data (i.e., ER) modelling yet, -he presents the solution in this paper. Ambler has argued for some time -for the presence of data modelling in UML \cite{Ambl-SW-1998-BOA}, and -has suggested various ways that it should be done. It is enlightening to -know that Ambler is not alone in his quest for adopting an industry -standard in data modelling. Other methodologies like -\cite{Naib-EJ-2001-UMLDD} have recognized the need as well. This model -type is built on the practice of UML 2.0 of separating core methodology. -Ambler admits that the methodology that he has presented is not perfect -and it focuses on the physical modelling of a relational database. In -this model he also tries to stress style issues that according to him -are not appropriate for a proper UML profile \cite{Ambl-SW-2003-ADT}. +Ambler proposed a physical modelling notation based on the Unified +Modeling Language (UML), as part of a larger effort to produce a +``traditional'' style data modelling profile for the UML +\cite{Ambl-SW-2003-ADT,Ambl-SW-2004-ObjPrimer3}. Ambler and others have +argued the need for such a profile for some time +\cite{Ambl-SW-1998-BOA,Naib-EJ-2001-UMLDD}. -This model suggests that a class box without a stereotype in a physical -database design is a table. It also represents views that have -dependencies on the table structures. Tables, entities and views are all -modelled using class boxes. The class boxes that appear on the logical -and conceptual model are entities so the stereotype is optional. -Similarly, it is assumed that any class box without a stereotype on a -physical data model is a table. Ambler's methodology seems to be limited -in terms of annotations. All representations whether tables or indices -are modelled using class boxes. This representation can sometimes be -confusing and exact identification of tables and indices in the system -has to be known by the database designer. In this model, relationships -are modelled using associations that are derived from the logical data -model. He further discusses his methodology for modelling attributes and -columns, keys, constraints and triggers, stored procedures and sections. -Ambler states that requirements for something should be identified -before it is built. He states the requirements of each model, e.g., -requirements that are needed to model entities and tables, requirements -that are needed to model relationships, etc. - -An example of Ambler's notation is shown in Figure~\ref{fig-Ambler}. +Ambler's notation focuses on the physical modelling of relational +databases. The notation uses class boxes without stereotypes to +represent physical tables, while indexes are represented by class boxes +with the stereotype \verb|<>|, as illustrated in +Fig.~\ref{fig-Ambler}. There appear to be no stereotypes for other +physical tuning techniques such as partitioning, although these could be +easily incorporated. \begin{figure} - \caption{Ambler's physical model notation (adapted from \cite{Ambl-SW-2003-ADT})} + \includegraphics[width=\columnwidth,keepaspectratio]{Ambler} + \caption{Ambler's physical modelling notation (adapted from \cite{Ambl-SW-2003-ADT})} \label{fig-Ambler} \end{figure} -Even though a lot of detail is covered in his suggestion, Ambler's -methodology somehow seems relatively weak. It is much too general in -annotations and has no specific description. The methodology seems to -consistently confuse the logical and physical design. For example: both -are modelled using class boxes but are later explained as representing -different objects. However, this model will definitely be useful as a -stepping-stone to work towards an official UML data modelling profile. +Ambler's approach suffers from two serious disadvantages. First, the +notation is very limited in the types of symbol used. All physical +level constructs are represented by class boxes, which in a complex +diagram could make distinguishing them difficult. This limitation +probably arises from the constraints on developing a new notation within +the existing UML framework. + +Second, his approach appears to consistently confuse the logical and +physical levels of abstraction: the same notations are used to represent +not only physical but also logical and conceptual elements +\cite{Ambl-SW-2003-ADT}. This confusion is illustrated by the inclusion +of a view (a non-physical construct) in Fig.~\ref{fig-Ambler}. + +In summary, while Ambler's notation does graphically model the physical +level of a database, the similarity of the graphical symbols and the +evident confusion between the physical and logical levels diminish its +usefulness. + +% . He states that since Unified +% Modelling Language (UML) does not cover data (i.e., ER) modelling yet, +% he presents the solution in this paper. Ambler has argued for some time +% for the presence of data modelling in UML \cite{Ambl-SW-1998-BOA}, and +% has suggested various ways that it should be done. It is enlightening to +% know that Ambler is not alone in his quest for adopting an industry +% standard in data modelling. Other methodologies like +% \cite{Naib-EJ-2001-UMLDD} have recognized the need as well. This model +% type is built on the practice of UML 2.0 of separating core methodology. +% Ambler admits that the methodology that he has presented is not perfect +% and it focuses on the physical modelling of a relational database. In +% this model he also tries to stress style issues that according to him +% are not appropriate for a proper UML profile \cite{Ambl-SW-2003-ADT}. + +% This model suggests that a class box without a stereotype in a physical +% database design is a table. It also represents views that have +% dependencies on the table structures. Tables, entities and views are all +% modelled using class boxes. The class boxes that appear on the logical +% and conceptual model are entities so the stereotype is optional. +% Similarly, it is assumed that any class box without a stereotype on a +% physical data model is a table. Ambler's methodology seems to be limited +% in terms of annotations. All representations whether tables or indices +% are modelled using class boxes. This representation can sometimes be +% confusing and exact identification of tables and indices in the system +% has to be known by the database designer. In this model, relationships +% are modelled using associations that are derived from the logical data +% model. He further discusses his methodology for modelling attributes and +% columns, keys, constraints and triggers, stored procedures and sections. +% Ambler states that requirements for something should be identified +% before it is built. He states the requirements of each model, e.g., +% requirements that are needed to model entities and tables, requirements +% that are needed to model relationships, etc. +% +% An example of Ambler's notation is shown in Fig.~\ref{fig-Ambler}. + +% Even though a lot of detail is covered in his suggestion, Ambler's +% methodology somehow seems relatively weak. It is much too general in +% annotations and has no specific description. The methodology seems to +% consistently confuse the logical and physical design. For example: both +% are modelled using class boxes but are later explained as representing +% different objects. However, this model will definitely be useful as a +% stepping-stone to work towards an official UML data modelling profile. -\subsection{Physical design using an entity model (Beynon-Davies)} +\subsection{Physical Design Using an Entity Model (Beynon-Davies)} -The closest model that demonstrates the link between conceptual, logical -and physical design work is by Beynon-Davies \cite{BeDa-P-1992-PDD}. He -suggests a method that can be used for the physical design process. The -paper defines physical modelling as ``the transformation of the logical -model into a definition of the physical model suitable for a specific -software/hardware configuration'' \cite{BeDa-P-1992-PDD}. +Beynon-Davies proposed a method for analysing and modelling the physical +usage patterns of a database \cite{BeDa-P-1992-PDD}. In his method, +various aspects of the physical performance of a database are measured, +such as the size and expected growth rates of tables (volume analysis), +the volatility of tables, and the frequency of transactions (usage +analysis). The data obtained from these analyses are then used to +annotate a logical level entity-relationship diagram (ERD) of the +database, producing what is known as a \emph{composite usage map} (see +Fig.~\ref{fig-Beynon-Davies} on page~\ref{fig-Beynon-Davies} for an +example). -According to Beynon-Davies one of the first steps that must be taken to -move from logical to physical database design is to establish estimates -of the average and maximum number of instances per entity (volume -analysis). Volatility analysis is represented in the same way. Similar -to volume analysis, volatility analysis cannot be used as a measure if a -table is continually growing. It is most effective if the size of the -table is static. Figure 9 summarizes Beynon-Davies's final modelling -method; he does not model all the individual annotations but generalizes -the entity relationship diagrams into a general model. +Beynon-Davies' method provides a very good mechanism for representing +the usage statistics of database in a coherent manner, but is somewhat +complex to execute without some form of automation. Our experience with +teaching this method at undergraduate level shows that even with a +relatively small database, the designer can quickly become overwhelmed +by the sheer volume of usage data involved. -It is understandable that the reason behind this generalization is to -minimize complexity and since individual annotations are developed in -the beginning, references can be made to them if needed. This might be -effective for small systems that have a limited number of tables. -However, for medium to large-scale projects, Beynon-Davies suggests that -a systematic analysis of volatility and usage should be done. This -analysis would constitute a map of the database application that would -be maintained by the DBA. This meets the objective of providing an -overview of all the performance tuning techniques and data in place. +In addition, Beynon-Davies' method does not produce any conclusions as +to which physical tuning methods should be implemented---rather it +provides much of the information required to enable these decisions to +be made. Beynon-Davies' method is thus more a notation for summarising +the physical usage patterns of a database, rather than a notation for +physical modelling per se. + +% The closest model that demonstrates the link between conceptual, logical +% and physical design work is by Beynon-Davies \cite{BeDa-P-1992-PDD}. He +% suggests a method that can be used for the physical design process. The +% paper defines physical modelling as ``the transformation of the logical +% model into a definition of the physical model suitable for a specific +% software/hardware configuration'' \cite{BeDa-P-1992-PDD}. +% +% According to Beynon-Davies one of the first steps that must be taken to +% move from logical to physical database design is to establish estimates +% of the average and maximum number of instances per entity (volume +% analysis). Volatility analysis is represented in the same way. Similar +% to volume analysis, volatility analysis cannot be used as a measure if a +% table is continually growing. It is most effective if the size of the +% table is static. Figure 9 summarizes Beynon-Davies's final modelling +% method; he does not model all the individual annotations but generalizes +% the entity relationship diagrams into a general model. +% +% It is understandable that the reason behind this generalization is to +% minimize complexity and since individual annotations are developed in +% the beginning, references can be made to them if needed. This might be +% effective for small systems that have a limited number of tables. +% However, for medium to large-scale projects, Beynon-Davies suggests that +% a systematic analysis of volatility and usage should be done. This +% analysis would constitute a map of the database application that would +% be maintained by the DBA. This meets the objective of providing an +% overview of all the performance tuning techniques and data in place. -\section{A new physical notation} +\section{A New Physical Notation} \label{sec-notation} -In this section we discuss a new notation for graphically representing -physical database models. While Ambler's notation is in many ways closer -to a physical modelling notation, the evident confusion between the -physical and logical levels diminishes its usefulness. Beynon-Davies' -notation, on the other hand, simply adds information relevant to -physical tuning to a standard entity-relationship diagram. Our -experience in teaching this technique at undergraduate level is that the -designer can quickly become overwhelemed by a ``mountain of numbers''. -In addition, Beynon-Davies' method does not provide any conclusions as -to which physical tuning methods should be implemented---rather it -provides the information required in order to draw these conclusions. -Beynon-Davies' method is thus more of a notation for summarising the -physical usage of a database, rather than a notation for representing -physical database structures. +Both of the notations discussed in the previous sections are limited in +their ability to graphically model the physical level of a database. +Ambler's notation lacks clarity and is thus potentially confusing, while +Beynon-Davies' notation only summarises the physical usage patterns of a +database rather than providing an actual physical level database model. +We have therefore adopted aspects from both approaches to devise a +graphical notation that enables database designers to graphically model +the common physical database tuning techniques discussed in +Sect.~\ref{sec-techniques}. -We have therefore taken Beynon-Davies' work one step further and -developed a graphical notation for representing the physical database -tuning techniques discussed in Section~\ref{sec-techniques}. - -The graphical notations use in this model is not complex and cluttered. -This is to enable a database designer to simplify the physical design in -a notation that will be easy to understand but comprehensive. Some of -the graphical notations use in this model was adapted from other models -and a few were modified by the author to suit the model. - -The table below illustrates some of the graphical representations that -were available to choose from. - -Clustering is represented by nesting one entity inside another, as shown -in Figure~\ref{fig-notation} (adapted from \cite{BeDa-P-1992-PDD}). This -notation is intuitive, and clearly indicates the field(s) on which the -records are clustered. - -Partitioning is represented by splitting an entity into either vertical -or horizontal partitions, as shown in Figure~\ref{fig-notation} (adapted -from \cite{Silb-A-2002-4E}). Once again, the notation is intuitive, and -allows the partition definitions to be easily listed. - -Replication is indicated by including multiple copies of the same -entity, with a diagonal bar across the corner of each replica (adapted -from a notation used in data flow diagrams \cite{Gane-C-1979}). - -Indexes are represented by a small tree-like symbol within an entity. -The index key is listed next to this symbol. Hashing is represented in a -similar way, but using an ``H'' symbol instead of a tree symbol. +The notations that we have adopted for this notation are shown in +Fig.~\ref{fig-notation}. Some of these are adapted from other notations, +while some we have created ourselves. The symbols have been chosen so as +to be as intuitive as possible, while not being so complex as to produce +a cluttered and confusing diagram. Physical models may be developed +using this notation either with or without a prior Beynon-Davies style +analysis. \begin{figure} - \caption{Graphical representations for physical structures} + \centering + \subfigure[Table]{\label{fig-notation-table}\includegraphics[scale=0.9]{notation-table}} + \hfill + \subfigure[Indexes]{\label{fig-notation-index}\includegraphics[scale=0.9]{notation-index}} + \hfill + \subfigure[Hashing]{\label{fig-notation-hash}\includegraphics[scale=0.9]{notation-hash}} \\ + \hfill + \subfigure[Clustering]{\label{fig-notation-cluster}\includegraphics[scale=0.9]{notation-cluster}} + \hfill + \subfigure[Partitoning]{\label{fig-notation-partition}\includegraphics[scale=0.9]{notation-partition}} + \hfill \\ + \subfigure[Replication]{\label{fig-notation-replica}\includegraphics[scale=0.9]{notation-replica}} + \caption{Notations for physical tuning techniques} \label{fig-notation} \end{figure} +A physical table is represented by a simple box, as shown in +Fig.~\ref{fig-notation-table}. This is similar to most logical and +conceptual level ERD notations. -\subsection{Example of use} +Indexes are represented by a small tree-like symbol within a table, as +shown in Fig.~\ref{fig-notation-index}. The index key is listed next to +this symbol. Composite keys are indicated by a grouping symbol. Hashing +is represented in a similar way, but uses an ``H'' symbol instead of a +tree symbol (see Fig.~\ref{fig-notation-hash}). -Consider the entity-relationship diagram shown in Figure~\ref{fig-ERD}, -which depicts a database system for a consumer electronics manufacturer. -A corresponding Beynon-Davies' annotated model is shown in -Figure~\ref{fig-Beynon-Davies}. Analysis of this figure enables us to -draw certain conclusions about the physical characteristics of this -database: +Clustering is represented by nesting one table inside another, as shown +in Fig.~\ref{fig-notation-cluster} (adapted from +\cite{BeDa-P-1992-PDD}). This notation is intuitive, and clearly +indicates the field(s) on which the records are clustered. +Partitioning is represented by splitting a table into either vertical or +horizontal partitions according to the style of partitioning, as shown +in Fig.~\ref{fig-notation-partition} (adapted from +\cite{Silb-A-2002-4E}). Once again, the notation is intuitive, and +allows the partition definitions to be easily indicated. + +Replication is indicated by including multiple copies of the same table, +with a diagonal bar across the corner of each replica, as shown in +Fig.~\ref{fig-notation-replica}. This is adapted from a similar notation +used in data flow diagrams \cite{Gane-C-1979}. + +% The graphical notations use in this model is not complex and cluttered. +% This is to enable a database designer to simplify the physical design in +% a notation that will be easy to understand but comprehensive. Some of +% the graphical notations use in this model was adapted from other models +% and a few were modified by the author to suit the model. +% +% The table below illustrates some of the graphical representations that +% were available to choose from. + +Consider the entity-relationship diagram shown in Fig.~\ref{fig-ERD}, +which depicts a database for a consumer electronics manufacturer. A +corresponding Beynon-Davies' composite usage map based on the fourteen +most significant transactions is shown in Fig.~\ref{fig-Beynon-Davies}. +The arrows represent physical access paths, while the number attached to +each access path indicates the number of disk accesses per hour along +that path. The diagram clearly highlights some potential performance +problem areas in the database, for example: \begin{itemize} - \item conclusions\ldots (see 321 assignment material) + \item There are many disk accesses on the access paths between the + Sale\_head/Sale\_line and the Order\_head/Order\_line tables. Since + these are tables that will normally be accessed together, both could + perhaps be candidates for clustering (depending on the mix of update + versus read operations). + \item The appear to be multiple transactions acessing the Staff + table. This could imply a need for partitioning. + + \item There is an extremely high access rate on the Customer table. + Further examination, however, reveals that this rate only occurs for + a short period once per month, and that the transaction in question + only requires read access. Replication of the Customer table could + therefore be a suitable solution. + \end{itemize} \begin{figure} + \includegraphics[width=\columnwidth,keepaspectratio]{ERD} \caption{ERD of the example database} \label{fig-ERD} \end{figure} \begin{figure} - \caption{Beynon-Davies annotated model for the example database} + \includegraphics[width=\columnwidth,keepaspectratio]{Beynon-Davies} + \caption{Beynon-Davies composite usage map for the example database} \label{fig-Beynon-Davies} \end{figure} -Figure~\ref{fig-physical-model} shows a physical model of this database -system using our proposed notation. +The suggestions above can be represented as a physical model using our +proposed modelling notation, as shown in Fig.~\ref{fig-physical-model}. +Note that we have placed indexes on all primary keys as a matter of +course. \begin{figure} \caption{Physical database model for the example database} @@ -436,12 +512,14 @@ \end{figure} -\section{Future research} +\section{Future Research} \label{sec-future} - symbols for different types of index (reverse-key, index-organised tables, R-trees, bitmap indexes). +- physical placement information, i.e., what devices partitions go on, etc. + - evaluation of the notation with undergraduate students \& comparison with Beynon-Davies (preliminary results by time of publication).