Round pegs into square holes: data warehouses for the hardware impaired
ROUND PEGS INTO SQUARE HOLES: DATA WAREHOUSES FOR THE HARDWARE IMPAIRED
Michael Davis, Bassett Consulting Services, Inc., North Haven, Connecticut
Abstract
Third, looking towards the future, Bassett wanted to
One of the dirty secrets of most data warehouse
position the applications it developed so that when
projects is that they require vast amounts of disk
Version 7 is released, SAS variable names up to 32
drive space and robust computers for successful
characters long can be accommodated without re-
coding and recompiling. Similarly, when it was
organization that lusts for a comprehensive data
necessary to create time-series data sets, we
warehouse can finance the tab for the hardware that
wanted to be able to look-up a five character
may be required to create the warehouse of their
variable prefix easily. The data dictionary approach
Bassett Consulting Services has helped two clients
Fourth, it was clear that no matter how well any data
make better use of limited computing resources
warehouse might be designed, there would always
through a two step approach. First, an active data
be some essential information not catalogued in the
dictionary permits "point and click" selection,
data warehouse. Thus a data consolidation
renaming, joining, and sub-setting of data, even
structure should be created that would allow our
clients to integrate data from SAS data sets,
cartridges or in hierarchical files.
RDBMSs, and flat-files stored on disk or tape intotables for subsequent analysis and reporting.
Then "code engines" create SAS code on the fly toextract the requested data from their actual physical
This was the impetus for developing the Meta-
locations and transform the data into SAS data sets.
Master utilities and the Data Integrator
These data sets even contain variable labels and
interfaces. The Meta-Master utilities consist of a
formats, courtesy of the data dictionary, and are
data dictionary and SAS/AF applications designed
ready for further analytical processing.
to populate the dictionary, maintain the dictionary,and to define objects called variable lists. Introduction
When Bassett first began creating SAS/AF® FRAME
relationship between Data Integrator and Meta-
applications, our custom was to incorporate SAS
Master. Underneath the user interface layer of Data
data step program code through SCL SUBMIT
Integrator are the Meta-Master data dictionary, the
blocks, which were compiled along with the rest of
Meta-Master utilities, and the Data Integrator code
engines. The data dictionary is where the details for
dissatisfied with this approach for several reasons.
each data element are stored. The Meta-Masterutilities are used to automatically populate the data
First, our clients were asking for "point and click"
selection of variables. Within the programmingstructures that we used, it was not possible to
The core of Data Integrator is composed of "code
comply with their requests without custom
engines". These engines write SAS code to extract,
programming for each new data source.
sort, combine, and manipulate data as specified by
Compounding the problem was the realization that
lists called "job streams". These job streams are
the data steps incorporated into our applications
specified to through Data Integrator's "point and
involved much repetitive code. Bassett wanted to
click" interface, made possible by the Meta-Master
employ an approach that would facilitate the
adoption of object-orient programming techniques.
To better understand how Meta-Master and Data
Second, it seemed that the SAS System's best
Integrator work, let's examine how variable lists and
features for developing "point and click" interfaces
required that the data to be in a SAS data set orrelational database management system (RDBMS)and stored on disk. Unfortunately, our clients couldnot afford to devote the disk space necessary topermanently store their data warehouses on disk.
one for each month. Rather than having toredundantly define the data elements in the variousclaims tables for each month, we can use a single
SAS/AF FRAME Interface
libref (and table) to represent a typical month. Then, we merely have to tell Data Integrator for
Data Dictionary
which month (and year) we desire our data.
Our use of term "table" is similar to the SAS data set
Meta-Master Utilities
member. We decided to refer to data set membersas tables for two reasons. First, our definition oftables should be viewed as logical or symbolic,
Code Engines
while SAS data sets members are but one form ofphysical representation. Second, we wanted toreserve the possibility that we might extend the
Meta-Master utilities in the future to allow multipletables to point back to a single SAS data set, as is
Variable Lists
currently permitted with flat-files.
For the Meta-Master data dictionary and utilities to
The term "variable" refers to a single data element
properly function, each data element must have a
or SAS variable. The combination of libref, table,
unique identification. Meta-Master uses the SAS
and variable points to a unique data element in our
System two-level convention (<libref> .<member>)
system of categorization. However, during testing
to specify to which table data element (variable)
of the Meta-Master prototype on a rather large table
belongs. This is illustrated in Figure 2. Since the
with several thousand variables, it was decided to
way in which Meta-Master employs this form of
provide two additional means to sub-categorize
categorization has some unusual features. Let us
A variable "category" classifies similar variables forease of selection. Where variables aredistinguished only by the time period each variablerepresents or its position in a series, an "indexvalue" is assigned to it in the data dictionary. It is
Data Dictionary
important to note that the use of categories andindex values in the Meta-Master system is optional
and that categories and index values are only used
• TA AR1
to subset the number of variables shown for a table
Variable List Structure
The heart of the variable list is a list of the variablesto be included in a particular job step. When a
variable list is created, it is stored in a SAS catalogas an SLIST entry. When a variable list is created,the person creating the list is encouraged to supply
SAS users may be already familiar with the term
a narrative description to be stored with the SLIST
libref. A libref is an alias which tells the SAS
catalog entry. The structure of the variable list is
System the physical location of the library where it
can find a specified data set member. However,since the Data Integrator system is designed to
Although all of the code engines created for the
reference flat-files in a manner that is largely
Data Integrator thus far handle only one libref and
transparent to its users, Meta-Master extends the
table per variable list, Meta-Master stores the libref
definition of libref to include a single flat-file or
and table for each variable in the event that it
becomes desirable to relax this restriction in thefuture.
As an example, let us suppose that the Meta-Masterroutines were to be employed by a health insurer. We might have a series of patient claims data sets,
VARIABLE LIST (SLIST CATALOG ENTRY) Variables DROP=(dropped variables list) JOB STREAM LIST (SLIST CATALOG ENTRY) RENAME= (old variables=new variables) JOB STEP=INFILE WHERE=(WHERE clauses) JOB STEP=SORT
Other information is stored along with the list ofvariables for those job steps that may require thatinformation. These pieces of information are used
JOB STEP=MERGE
For example, if a job step requires that someincoming variables be renamed to prevent
"collisions", a list of original and renamed variablescan be stored in the variable list. Variables that are
Some of the code engines execute on the desktop
not required for the current step can be dropped. If
computer on which Data Integrator runs. Other
WHERE processing is required, it can be applied to
code engines take advantage of the SAS System's
only the data sets that require it and before the
ability to distribute program logic and submit code to
current observation is brought into the job step.
be executed on a remote computer, such as anMVS mainframe.
Variable lists are managed by the Meta-Masterutilities and routines. These list are referenced by
What do the different engines do? The INFILE
the Data Integrator application, which invokes
engine writes a data step to extract a list of
variables from a disk or tape flat-file. The SET
written in SCL (Screen Control Language). Data
engine brings in one or more SAS data sets for sub-
Integrator uses Job Stream lists to combine these
setting and transformation. The MERGE engine
elements and accomplish data extraction and
brings in two or more data sets and combines them
using BY variables as the merge key. Job Streams Lists
The SORT engine does what one expects. It sorts asingle SAS data set. However, one can use this
Each job stream list consists of at least one step and
engine to rename or drop variables and to subset
can have as many steps as desired. Job streams
the observations copied to the output data set. The
are stored as nested lists, in job step sequence,
output data set need not be the input data set.
within a SLIST catalog entry. The structure of atypical job streams is illustrated in Figure 4.
The TEMPLATE engine was developed toaccommodate those situations where complex data
Data Integrator processes a job stream one step at a
step logic was required. Rather than develop a
time. The type of step is specified by the list name
convoluted menu structure to handle all potential
under which the job step was stored. At the time
situations, the user keys a program template. For
this paper was prepared, Bassett had created code
some applications, the template is inserted exactly
engines to write SAS statements and other types of
as keyed. However, the TEMPLATE engine shines
code for the following types of job steps:
when Data Integrator uses the data dictionary to "fillin the blanks". The user merely keys in the wordINPUT on its own line and lists the variables to be
included in the data step. Data Integrator looks up
As shown in the preceding example, Data Integrator
the variable offsets and informats required to fill out
the INPUT statement. Thus a piece of program
statements, using information stored in the data
developed for a client which routinely stores needed
data on an MVS mainframe. After using DataIntegrator to create and remotely submit a MVS
batch job to extract the data, the COPY engine
submits a remote SAS program to copy the extract
Once the data has been copied to disk, theDOWNLOAD engine downloads the data to the
Interface to PC File Formats converts thedownloaded data to a Microsoft Excel.XLS
Using the same technology, other data step
processing or procedures can be accommodated. Further, jobs to be remotely submitted to any
platform supported by the SAS System, including
operating system and utility program statements,
can be cloned from existing engines. Additional Job Step Parameters
What kinds of additional information do the code
engines require? It depends on the code engine. As
an example, let us consider the SET code engine
when the job is to be both created and run on the
desktop platform. Five items must be specified:
The output libref and output member are inserted
into the DATA statement created by the SET code
engine. The drop list allows variables drawn from
input SAS data sets only for use by a sub-setting IF
statement to be dropped at the completion of the
A SET job step may contain more than one variable
list since a SET statement can specify more thanone SAS data set to be incorporated into a datastep. Data Integrator loops through all therequested SAS data sets as it builds the SETstatement. The data set options specified withineach variable list are inserted in the SET statementafter SAS each data set. Data Dictionary Features
Data Integrator's code engines become smartenough to remember the desired processing logic
One feature of the Meta-Master Data Dictionary is
whenever that SAS variable is selected.
the ability to store the data type for each element. Data Type can assume the following values:
With respect to the other fields shown on the screenshown in Figure 6, the Informat, Format, and Length
fields are familiar to SAS users and inherit their
features. Offset is the starting position of the field to
be read to create the SAS variable. It is required
when reading data from a tape or disk flat-file.
Libref, Table and Variable Name have been
Autoloader
discussed previously. Variable Type can be eithercharacter or numeric. Label is a forty character
The Meta-Master utilities would be far less useful if
variable which contains the SAS variable label.
it were necessary to key in the dictionaryinformation by hand. As observed in a preceding
Transformation is a unique feature of the Meta-
discussion, some of our clients routinely work with
Master Data Dictionary. For example, a tape data
tables containing thousands of variables.
set might have a transaction date, representing a
Fortunately, one of the Meta-Master utilities is a
Julian date, that is stored as a packed decimal. In
FRAME application that automatically loads the data
our applications, we want to use it as a SAS date
How is this done? For almost any table we might
When the code engines used to extract data are
insert into the data dictionary, we have one of three
run, they inspect the value of the Tranform(ation)
items. We might have a copy of the data set or one
data dictionary field. If it is not empty, Data
with a similar structure. When this is not the case,
Integrator creates an assignment statement which
we almost surely have a SAS program with an
incorporates the transformation. In our example, the
INPUT statement coded to read the raw data.
field on the tape might be read using the PD3. Informat. The appropriate transformation would be
Last, if the raw data was created by a productionprogram, we might have access to a COBOL record
description for the raw data file. In this case, weemploy the COB2SAS utilities available at no cost
and Data Integrator would build the following
from SAS Institute's Technical Support department.
assignment statement in the created data step:
COB2SAS is used to translate the record descriptioninto SAS INPUT and LABEL statements, which are
While transformations were incorporated into Meta-
How does the Autoloader work? In all cases,
Master for the purpose just described, they can be
Meta-Master reads the SAS System dictionary
used to solve other problems. These uses include
views to determine the table layout and to populate
summing variables across an observation and
the Meta-Master data dictionary. In the case where
counting observations either conditionally or
a table is drawn from a SAS data set or view, all of
the required information can be obtained from theSAS System dictionary views.
To illustrate the creation of a true/false variable, ifwe had entered a transform for a numeric variable,
In the case of tape and disk flat-files, two additional
items must be obtained. These two items are thevariable offsets and informats. This information is
obtained from parsing the INPUT statement.
where company is a character value containing the
To use the Autoloader to define a table created from
names of companies, testco would be true when
a flat-file, the user enters the name of the file where
company contained the word "BASSETT".
the SAS program containing an INPUT statementfor the file is located. An editor window is opened
The key thing to remember about transformations is
and the file is loaded. The user deletes the DATA
that once they are entered into the data dictionary,
and INPUT statements since the Autoloader
supplies them. Other statements, such as LABEL,
the tape data sets and execution time, the code
engine also inserts a SyncSort® extraction program
In addition to creating the SAS system dictionary
As part of this customized version of Data
views and parsing the INPUT statement for variable
Integrator, the client was furnished with a utility
offsets and informats, the Autoloader identifies
FRAME application that catalogs which clients are
appropiate variable formats based on the informats
stored on each tape cartridge of selected tape flat-
files. This information is used by the SyncSortextraction program so that only the cartridges
Building Job Streams
containing the selected clients information aremounted and read.
The user clicks on a radio box station to select thenext job step to be built. Clicking on the REVIEW
Conclusion
job stream station prints a detailed list of the jobstream under construction to the SAS log. If the
The Data Integrator and Meta-Master represent an
wrong job step is selected, clicking on the UNDO
object-oriented approach using SAS System
last step button backs out the last step entered.
software to rapidly develop custom data extractionand manipulation applications. These applications
To keep users from wondering if the job step they
may be used to populate data warehouses or for
just entered was accepted, a "stack" displays the job
steps that have been previously entered.
Since the analysis data does not need to be retained
Running Job Streams
on disk storage, this approach should be attractiveto SAS System users who may be working under
One feature of this interface is the librefs and
hardware or budget constraints yet do not want to
filerefs associated with the selected job stream are
sacrifice the "point and click" interfaces possible
verified before the job stream is actually executed.
If a libref or fileref is unassigned, a pop-up windowallows the user to make the assignment or cancel
An ancillary benefit of this approach is that a data
execution before any time is wasted running the job
dictionary is created. This dictionary is a good tool
for managing a data warehouse. Because thedictionary is stored as a SAS data set, it is relatively
Custom Interface for Remote MVS Batch Jobs
simple to interface it to other SAS applications.
For one client, Bassett created a custom interface
Acknowledgments
and code engines to meet their requirements. Thedata they wished to extract was stored in tape
cartridges on an MVS mainframe. The data sets
are registered trademarks of SAS Institute, Cary
were very large, typically occupying 50 or more IBM
3480 tape cartridges. Their analysts needed theability to extract data for a single client and
SyncSort is a registered trademark of Syncsort Inc.,
download it for analysis with a Microsoft Excel
Data Integrator, Meta-Master, and Data Builder are
When the client was shown the a prototype Data
trademarks of Bassett Consulting Services, Inc.
Integrator interface that allowed them to define andstore job stream lists, their reaction was that the
interface was needlessly complex and confusing. To satisfy their concern, we modified the Data
Integrator interface to build a disposable single step
job stream list. Thus users at this site only have to
The INFILE code engine was modified to submit the
extraction program in batch mode, along with therequired MVS JCL. To minimize both contention for
New Amsterdam Theatre, New Victory Theater The restoration of two historic theaters not only jump-starts Times Square’s recent revival but also enables theater- goers to revel in the area’s complex architectural history. Top: New Amsterdam Theatre Above: The new theatres, signage and interior detail which face each other across reflect different er
DISPONIBLES DANS VOTRE SALON LE REFLET CHALONNESPROGRAMME CHEVEUX FINS ET SANS VOLUMEGalbe et volume aérien…Les cheveux fins manquent souvent de volume et de tonus. Difficiles à coiffer, ils n’ont aucune tenue. Vos cheveux ont besoin d’être gainés et hydratés. Ils retrouvent ainsi peu à peu la texture qui donne à votre coiffure un volume aérien, un galbe merveilleux, de la racine au