home > Internet > Basic types of olap analysis. OLAP in financial management

Basic types of olap analysis. OLAP in financial management

In 1993, the founder of the relational approach to database construction, Edgar Codd and his partners (Edgar Codd, a mathematician and IBM Fellow), published an article initiated by Arbor Software (today the famous company Hyperion Solutions), entitled "Providing OLAP (online analytical processing) for analytical users", which formulated 12 features of OLAP technology, which were subsequently supplemented by six more. These provisions became the main content of a new and very promising technology.

Main features of OLAP technology (Basic):

multidimensional conceptual representation of data;
intuitive data manipulation;
availability and detail of data;
batch data extraction vs. interpretation;
OLAP analysis models;
client-server architecture (OLAP accessible from the desktop);
transparency (transparent access to external data);
multi-user support.

Special Features:

processing of unformalized data;
saving OLAP results: storing them separately from the source data;
exclusion of missing values;
Handling missing values.

Report presentation features:

flexibility in reporting;
standard reporting performance;
automatic setting physical level data extraction.

Dimension management:

universality of measurements;
unlimited number of dimensions and aggregation levels;
unlimited number of operations between dimensions.

Historically, today the term "OLAP" implies not only a multidimensional view of data from the end user, but also a multidimensional view of the data in the target database. This is precisely why the terms “Relational OLAP” (ROLAP) and “Multidimensional OLAP” (MOLAP) appeared as independent terms.

OLAP service is a tool for analyzing large volumes of data in real time. By interacting with the OLAP system, the user will be able to flexibly view information, obtain arbitrary data slices and perform analytical operations of drill-down, roll-up, end-to-end distribution, and comparison over time using many parameters simultaneously. All work with the OLAP system occurs in terms of the subject area and allows you to build statistically based models business situation.

OLAP software is a tool for the operational analysis of data contained in a warehouse. Main feature is that these tools are aimed at use not by a specialist in the field of information technology, not by an expert statistician, but by a professional in the applied field of management - a manager of a department, department, management, and, finally, a director. The tools are designed to allow the analyst to communicate with the problem, not with the computer. In Fig. Figure 6.14 shows a basic OLAP cube that allows you to evaluate data along three dimensions.

A multidimensional OLAP cube and a system of corresponding mathematical algorithms for statistical processing allows you to analyze data of any complexity at any time interval.

Rice. 6.14. Elementary OLAP cube

Having flexible mechanisms for data manipulation and visual display at his disposal (Fig. 6.15, Fig. 6.16), the manager first examines from different angles the data that may (or may not) be related to the problem being solved.

Next, he compares various business indicators with each other, trying to identify hidden relationships; can look at the data more closely, in detail, for example, breaking it down into components by time, region or customer, or, conversely, further generalize the presentation of information to remove distracting details. After this, using the statistical evaluation and simulation module, several options for the development of events are constructed, and the most acceptable option is selected from them.

Rice. 6.15.

A company manager, for example, may have a hypothesis that the spread of asset growth in various branches of the company depends on the ratio of specialists with technical and economic education in them. To test this hypothesis, the manager can request from the warehouse and display on a graph the ratio of interest for those branches whose asset growth in the current quarter decreased by more than 10% compared to last year, and for those which increased by more than 25%. He should be able to use a simple selection from the menu provided. If the results obtained significantly fall into two corresponding groups, then this should become an incentive for further testing of the hypothesis put forward.

Currently, a direction called dynamic modeling (Dynamic Simulation), which fully implements the above-mentioned FASMI principle, has received rapid development.

Using dynamic modeling, the analyst builds a model of a business situation that develops over time, according to a certain scenario. Moreover, the result of such modeling can be several new business situations, generating a tree of possible solutions with an assessment of the probability and prospects of each.

Rice. 6.16. Analytical IS for data extraction, processing and presentation of information

Table 6.3 shows the comparative characteristics of static and dynamic analysis.

4. Classification of OLAP products.

5. Operating principles of OLAP clients.

7. Areas of application of OLAP technologies.

8. An example of using OLAP technologies for analysis in sales.

1. The place of OLAP in the information structure of the enterprise.

The term "OLAP" is inextricably linked with the term "data warehouse" (Data Warehouse).

The data in the warehouse comes from operational systems (OLTP systems), which are designed to automate business processes. In addition, the repository can be replenished from external sources, such as statistical reports.

The purpose of the repository is to provide the “raw material” for analysis in one place and in a simple, understandable structure.

There is one more reason that justifies the appearance of a separate storage - complex analytical queries for operational information slow down the current work of the company, blocking tables for a long time and seizing server resources.

A repository does not necessarily mean a gigantic accumulation of data - the main thing is that it is convenient for analysis.

Centralization and convenient structuring are not all that an analyst needs. He still needs a tool for viewing and visualizing information. Traditional reports, even those built on a single repository, lack one thing - flexibility. They cannot be "twisted", "expanded" or "collapsed" to get the desired view of the data. If only he had a tool that would allow him to expand and collapse data simply and conveniently! OLAP acts as such a tool.

Although OLAP is not a necessary attribute of a data warehouse, it is increasingly being used to analyze the information accumulated in the warehouse.

Place OLAP V information structure enterprises (Fig. 1).

Picture 1. PlaceOLAP in the information structure of the enterprise

Operational data is collected from various sources, cleansed, integrated and stored in a relational store. Moreover, they are already available for analysis using various reporting tools. Then the data (in whole or in part) is prepared for OLAP analysis. They can be loaded into a special OLAP database or stored in relational storage. Its most important element is metadata, i.e. information about the structure, placement and transformation of data. Thanks to them, effective interaction of various storage components is ensured.

To summarize, we can define OLAP as a set of tools for multidimensional analysis of data accumulated in a warehouse.

2. Operational analytical data processing.

The OLAP concept is based on the principle of multidimensional data representation. In 1993, E. F. Codd addressed the shortcomings of the relational model, primarily pointing out the inability to "merge, view and analyze data in terms of multiple dimensions, that is, in the most understandable way for enterprise analysts", and defined the general requirements for OLAP systems that extend the functionality of relational DBMS and including multidimensional analysis as one of its characteristics.

According to Codd, a multi-dimensional conceptual view is a multiple perspective consisting of several independent dimensions along which specific sets of data can be analyzed.

Simultaneous analysis across multiple dimensions is defined as multivariate analysis. Each dimension includes areas of data consolidation, consisting of a series of successive levels of generalization, where each higher level corresponds to a greater degree of data aggregation for the corresponding dimension.

Thus, the Performer dimension can be determined by the direction of consolidation, consisting of the levels of generalization “enterprise - division - department - employee”. The Time dimension can even include two consolidation directions - “year - quarter - month - day” and “week - day”, since counting time by month and by week is incompatible. In this case, it becomes possible to arbitrarily select the desired level of detail of information for each of the dimensions.

The descent operation (drilling down) corresponds to the movement from higher stages of consolidation to lower ones; on the contrary, the lifting operation (rolling up) means movement from lower levels to higher ones (Fig. 2).

Figure 2.Dimensions and directions of data consolidation

3. Requirements for online analytical processing tools.

The multidimensional approach arose almost simultaneously and in parallel with the relational one. However, only starting from the mid-nineties, or rather from
1993, interest in MDBMS began to become widespread. It was this year that a new programmatic article by one of the founders of the relational approach appeared E. Codda, in which he formulated 12 basic requirements for the means of implementation OLAP(Table 1).

Table 1.

	Multidimensional data representation	Tools must support a conceptually multidimensional view of the data.
	Transparency	The user does not need to know what specific tools are used to store and process data, how the data is organized and where it comes from.
	Availability	The tools themselves must select and contact the best data source to generate an answer to a given request. Tools should provide automatic display of their own logic circuit into various heterogeneous data sources.
	Consistent Performance	Performance should be virtually independent of the number of Dimensions in the query.
	Client-server architecture support	The tools must work in a client-server architecture.
	Equality of all dimensions	None of the dimensions should be basic; they should all be equal (symmetrical).
	Dynamic processing of sparse matrices	Undefined values must be stored and handled in the most efficient way possible.
	Support for multi-user mode of working with data	The tools must provide the ability for more than one user to work.
	Support based operations various measurements	All multidimensional operations (such as Aggregation) must be applied uniformly and consistently to any number of any dimensions.
	Ease of data manipulation	The tools should have the most convenient, natural and comfortable user interface.
	Advanced data presentation tools	Tools must support various ways of visualizing (presenting) data.
	Unlimited number of dimensions and levels of data aggregation	There should be no limitation on the number of Dimensions supported.

Rules for evaluating OLAP class software products

The set of these requirements, which served as the actual definition of OLAP, should be considered as a guideline, and specific products should be assessed according to the degree to which they come close to meeting all requirements perfectly.

Codd's definition was later revised into the so-called FASMI test, which requires that the OLAP application provide the ability to quickly analyze shared multidimensional information.

Remembering Codd's 12 Rules is too burdensome for most people. It turned out that we can summarize the OLAP definition with only five keywords: Fast Analysis of Shared Multidimensional Information - or, for short - FASMI (translated from English:F ast A analysis of S hared M ultradimensional I information).

This definition was first formulated in early 1995 and has not needed to be revised since then.

FAST ( Fast ) - means that the system should be able to provide most responses to users within approximately five seconds. At the same time, the simplest requests are processed within one second and very few - more than 20 seconds. Research has shown that end users perceive a process as unsuccessful if results are not obtained after 30 seconds.

At first glance, it may seem surprising that when receiving a report in a minute that not so long ago took days, the user very quickly becomes bored while waiting, and the project turns out to be much less successful than in the case of an instant response, even at the cost of less detailed analysis.

ANALYSISmeans that the system can cope with any logical and statistical analysis characteristic of this application, and ensures its preservation in a form accessible to the end user.

It is not so important whether the analysis is performed in the vendor's own tools or in a related external software product such as a spreadsheet, just that all required analysis functionality must be provided in an intuitive way for end users. Analysis tools could include certain procedures, such as time series analysis, cost allocation, currency transfers, target searches, modification of multidimensional structures, non-procedural modeling, exception detection, data extraction and other application-dependent operations. Such capabilities vary widely among products, depending on the target orientation.

SHARED means that the system implements all privacy protection requirements (possibly down to the cell level) and, if multiple write access is necessary, ensures that modifications are blocked at the appropriate level. Not all applications require data writeback. However, the number of such applications is growing, and the system must be able to handle multiple modifications in a timely, secure manner.

MULTIDIMENSIONAL (Multidimensional) - this is a key requirement. If you had to define OLAP in one word, you would choose it. The system must provide a multi-dimensional conceptual view of data, including full support for hierarchies and multiple hierarchies, as this is clearly the most logical way to analyze businesses and organizations. There is no minimum number of dimensions that must be processed, as this also depends on the application, and most OLAP products have a sufficient number of dimensions for the markets they target.

INFORMATION - this is all. The necessary information must be obtained where it is needed. However, a lot depends on the application. The power of various products is measured in terms of how much input data they can process, but not how many gigabytes they can store. The power of the products varies widely - the largest OLAP products can handle at least a thousand times more data than the smallest. There are many factors to consider in this regard, including data duplication, RAM requirements, disk space usage, performance metrics, integration with information repositories, etc.

The FASMI test is a reasonable and understandable definition of the goals that OLAP is aimed at achieving.

4. ClassificationOLAP-products.

So, the essence of OLAP lies in the fact that the initial information for analysis is presented in the form of a multidimensional cube, and it is possible to arbitrarily manipulate it and obtain the necessary information sections - reports. In this case, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in various sections (dimensions), and allows interactive management of calculations and report form. The implementation of these operations is ensured OLAP -car (or car OLAP calculations).

Today, many products have been developed in the world that sell OLAP -technologies. To make it easier to navigate among them, classifications are used OLAP -products: by method of storing data for analysis and by location OLAP -cars. Let's take a closer look at each category OLAP products.

Classification by data storage method

Multidimensional cubes are built based on source and aggregate data. Both source and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, three methods of data storage are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP) ). Respectively, OLAP -products according to the method of data storage are divided into three similar categories:

1. In case of MOLAP , source and aggregate data are stored in a multidimensional database or in a multidimensional local cube.

2. In ROLAP -products source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database into multidimensional cubes occurs upon request OLAP tools.

3. In case of use HOLAP architecture, the original data remains in the relational database, and the aggregates are placed in the multidimensional one. Construction OLAP -cube executed on request OLAP - tools based on relational and multidimensional data.

Classification by location OLAP-cars.

On this basis OLAP -products are divided into OLAP servers and OLAP clients:

· In server OLAP - means of calculation and storage of aggregate data are performed by a separate process - the server. The client application receives only the results of queries against multidimensional cubes that are stored on the server. Some OLAP -servers support data storage only in relational databases, some only in multidimensional ones. Many modern OLAP -servers support all three methods of data storage:MOLAP, ROLAP and HOLAP.

MOLAP.

MOLAP is Multidimensional On-Line Analytical Processing, that is, Multidimensional OLAP.This means that the server uses a multidimensional database (MDB) to store data. The point of using MBD is obvious. It can efficiently store data that is multi-dimensional in nature, providing a means of quickly servicing database queries. Data is transferred from a data source to a multidimensional database, and the database is then aggregated. Pre-calculation is what speeds up OLAP queries because the summary data has already been calculated. Query time becomes a function solely of the time required to access a single piece of data and perform the calculation. This method supports the concept that work is done once and the results are then used again and again. Multidimensional databases are a relatively new technology. The use of MBD has the same disadvantages as most new technologies. Namely, they are not as stable as relational databases (RDBs), and are not optimized to the same extent. Another weakness of the MDB is the inability to use most multidimensional databases in the process of data aggregation, so it takes time for new information to become available for analysis.

ROLAP.

ROLAP is Relational On-Line Analytical Processing, that is, Relational OLAP.The term ROLAP means that the OLAP server is based on a relational database. Source data is entered into a relational database, typically in a star or snowflake schema, which helps reduce retrieval time. The server provides a multidimensional data model using optimized SQL queries.

There are a number of reasons for choosing a relational rather than a multidimensional database. RDB is a well-established technology with many opportunities for optimization. Real-world use resulted in a more refined product. In addition, RDBs support larger data volumes than MDBs. They are precisely designed for such volumes. The main argument against RDBs is the complexity of the queries required to retrieve information from a large database using SQL. An inexperienced SQL programmer could easily burden valuable system resources by trying to execute some similar query, which is much easier to execute in the MDB.

Aggregated/Pre-aggregated data.

Fast query implementation is an imperative for OLAP. This is one of the basic principles of OLAP - the ability to intuitively manipulate data requires rapid retrieval of information. In general, the more calculations that must be made to obtain a piece of information, the slower the response. Therefore, in order to keep query implementation time short, pieces of information that are usually accessed most often, but which also require calculation, are subject to preliminary aggregation. That is, they are counted and then stored in the database as new data. An example of the type of data that can be calculated in advance is summary data - for example, sales figures for months, quarters or years, for which the actual data entered is daily figures.

Different vendors have different methods for selecting parameters, requiring pre-aggregation and the number of pre-calculated values. The aggregation approach affects both the database and query execution time. If more values are being calculated, the likelihood that the user will request a value that has already been calculated increases, and therefore response time will be reduced by not having to request the original value to be calculated. However, if you calculate all possible values, this is not The best decision- in this case, the size of the database increases significantly, which will make it unmanageable, and the aggregation time will be too long. In addition, when numerical values are added to the database, or if they change, this information must be reflected in pre-calculated values that depend on the new data. Thus, updating the database can also take a long time in the case of a large number of pre-calculated values. Since the database typically runs offline during aggregation, it is desirable that the aggregation time is not too long.

OLAP - the client is structured differently. Construction of a multidimensional cube and OLAP -calculations are performed in the memory of the client computer.OLAP -clients are also divided into ROLAP and MOLAP.And some may support both data access options.

Each of these approaches has its own pros and cons. Contrary to popular belief about the advantages of server tools over client tools, in a number of cases the use of OLAP - the client may be more efficient and profitable for users to use OLAP servers.

When using an OLAP server, you need to learn 2 different systems, sometimes from different vendors - to create cubes on the server, and to develop a client application.

The OLAP client provides a single visual interface for describing cubes and setting up user interfaces for them.

So, in what cases can using an OLAP client be more effective and profitable for users than using an OLAP server?

· Economic feasibility of application OLAP -server occurs when the volumes of data are very large and overwhelming for OLAP -client, otherwise the use of the latter is more justified. In this case OLAP -The client combines high performance characteristics and low cost.

· Powerful PCs for analysts – another argument in favor OLAP -clients. When using OLAP -servers do not use this capacity.

Among the advantages of OLAP clients are the following:

· Implementation and maintenance costs OLAP - the client is significantly lower than the costs for OLAP server.

· Using OLAP - for a client with a built-in machine, data transmission over the network is performed once. By doing OLAP -operations of new data streams are not generated.

5. Operating principles OLAP-clients.

Let's look at the process of creating an OLAP application using a client tool (Figure 1).

Picture 1.Creating an OLAP application using the ROLAP client tool

The operating principle of ROLAP clients is a preliminary description of the semantic layer, behind which the physical structure of the source data is hidden. In this case, data sources can be: local tables, RDBMS. The list of supported data sources is determined by the specific software product. After this, the user can independently manipulate objects that he understands in terms of the subject area to create cubes and analytical interfaces.

The operating principle of the OLAP server client is different. In an OLAP server, when creating cubes, the user manipulates the physical descriptions of the database. At the same time, custom descriptions are created in the cube itself. The OLAP server client is configured only for the cube.

When creating a semantic layer, data sources - the Sales and Deal tables - are described in terms that the end user can understand and turn into “Products” and “Deals”. The “ID” field from the “Products” table is renamed to “Code”, and “Name” to “Product”, etc.

Then the Sales business object is created. A business object is a flat table on the basis of which a multidimensional cube is formed. When creating a business object, the “Products” and “Transactions” tables are merged by the “Code” field of the product. Since all table fields are not required for display in the report, the business object uses only the “Item”, “Date” and “Amount” fields.

In our example, based on the “Sales” business object, a report on product sales by month was created.

When working with an interactive report, the user can set filtering and grouping conditions with the same simple mouse movements. At this point, the ROLAP client accesses the data in the cache. The OLAP server client generates a new query to the multidimensional database. For example, by applying a filter by product in a sales report, you can get a report on sales of products that interest us.

All OLAP application settings can be stored in a dedicated metadata repository, in the application, or in a multidimensional database system repository.Implementation depends on the specific software product.

Everything that is included in these applications is a standard look at the interface, predefined functions and structure, and quick solutions for more or less standard situations. For example, financial packages are popular. Pre-built financial applications allow professionals to use familiar financial tools without having to design a database structure or conventional forms and reports.

The Internet is a new form of client. In addition, it bears the stamp of new technologies; a bunch of Internet solutions differ significantly in their capabilities in general and as an OLAP solution in particular. There are many advantages to generating OLAP reports over the Internet. The most significant seems to be the absence of the need for specialized software to access information. This saves the company a lot of time and money.

6. Selecting an OLAP application architecture.

When implementing an information and analytical system, it is important not to make a mistake in choosing the architecture of an OLAP application. The literal translation of the term On-Line Analytical Process - “online analytical processing” - is often taken literally in the sense that the data entering the system is quickly analyzed. This is a misconception - the efficiency of analysis is in no way related to the real time of updating data in the system. This characteristic refers to the response time of the OLAP system to user requests. At the same time, the analyzed data often represents a snapshot of information “as of yesterday” if, for example, the data in the warehouses is updated once a day.

In this context, the translation of OLAP as “interactive analytical processing” is more accurate. It is the ability to analyze data in an interactive mode that distinguishes OLAP systems from systems for preparing regulated reports.

Another feature of interactive processing in the formulation of the founder of OLAP E. Codd is the ability to “combine, view and analyze data from the point of view of multiple dimensions, i.e., in the most understandable way for corporate analysts.” Codd himself uses the term OLAP to refer exclusively to a specific way of presenting data at a conceptual level - multidimensional. At the physical level, data can be stored in relational databases, but in reality, OLAP tools typically work with multidimensional databases in which data is organized in a hypercube (Figure 1).

Picture 1. OLAP– cube (hypercube, metacube)

Moreover, the relevance of this data is determined by the moment the hypercube is filled with new data.

Obviously, the time it takes to create a multidimensional database depends significantly on the volume of data loaded into it, so it is reasonable to limit this volume. But how can one avoid narrowing the possibilities of analysis and depriving the user of access to all the information of interest? There are two alternative paths: Analyze then query and Query then analyze.

Followers of the first path propose loading generalized information into a multidimensional database, for example, monthly, quarterly, and annual results for departments. And if it is necessary to detail the data, the user is asked to generate a report using a relational database containing the required selection, for example, by day for a given department or by month and employees of the selected department.

Proponents of the second path, on the contrary, suggest that the user, first of all, decide on the data that he is going to analyze and load it into a microcube - a small multidimensional database. Both approaches differ at a conceptual level and have their own advantages and disadvantages.

The advantages of the second approach include the “freshness” of the information that the user receives in the form of a multidimensional report - a “microcube”. The microcube is formed based on the information just requested from the current relational database. Working with a microcube is carried out in an interactive mode - obtaining slices of information and its detailing within the microcube is carried out instantly. Another positive point is that the design of the structure and filling of the microcube is carried out by the user on the fly, without the participation of the database administrator. However, the approach also suffers from serious shortcomings. The user does not see the big picture and must decide in advance the direction of his research. IN otherwise the requested microcube may be too small and not contain all the data of interest, and the user will have to request a new microcube, then a new one, then another and another. The Query then analyze approach implements the BusinessObjects tool of the company of the same name and the tools of the company's Contour platformIntersoft Lab.

With the Analyze then query approach, the volume of data loaded into a multidimensional database can be quite large; filling must be carried out according to regulations and can take quite a lot of time. However, all these disadvantages pay off later when the user has access to almost all the necessary data in any combination. Access to source data in a relational database is carried out only as a last resort, when detailed information is needed, for example, on a specific invoice.

The operation of a single multidimensional database is practically not affected by the number of users accessing it. They only read the data available there, unlike the Query then analyze approach, in which the number of microcubes in the extreme case can grow at the same rate as the number of users.

This approach increases the load on IT services, which, in addition to relational ones, are also forced to maintain multidimensional databases.These services are responsible for timely automatic update data in multidimensional databases.

The most prominent representatives of the “Analyze then query” approach are the PowerPlay and Impromptu tools from Cognos.

The choice of both the approach and the tool that implements it depends primarily on the goal being pursued: you always have to balance between budget savings and improving the quality of service for end users. It should be taken into account that, in a strategic plan, the creation of information and analytical systems pursues the goals of achieving a competitive advantage, and not avoiding the costs of automation. For example, a corporate information and analytical system can provide necessary, timely and reliable information about a company, the publication of which for potential investors will ensure transparency and predictability of the company, which will inevitably become a condition for its investment attractiveness.

7. Areas of application of OLAP technologies.

OLAP is applicable wherever there is a task of analyzing multivariate data. In general, given a data table that has at least one descriptive column (dimension) and one numerical column (measures or facts), an OLAP tool will usually be an effective analysis and reporting tool.

Let's look at some areas of application of OLAP technologies taken from real life.

1. Sales.

Based on the analysis of the sales structure, issues necessary for making management decisions are resolved: on changing the range of goods, prices, closing and opening stores, branches, terminating and signing contracts with dealers, carrying out or terminating advertising campaigns etc.

2. Procurement.

The task is the opposite of sales analysis. Many enterprises purchase components and materials from suppliers. Trade enterprises purchase goods for resale. There are many possible tasks when analyzing procurement, from planning funds based on past experience, to control over managers, choosing suppliers.

3. Prices.

The analysis of market prices is closely related to the analysis of purchases. The purpose of this analysis is to optimize costs and select the most profitable offers.

4. Marketing.

By marketing analysis we mean only the area of analysis of buyers or clients-consumers of services. The purpose of this analysis is the correct positioning of the product, identifying buyer groups for targeted advertising, and optimizing the assortment. The task of OLAP in this case is to give the user a tool to quickly, at the speed of thought, obtain answers to questions that intuitively arise during data analysis.

5. Warehouse.

Analysis of the structure of warehouse balances by type of goods, warehouses, analysis of shelf life of goods, analysis of shipments by recipient and many other types of analysis that are important for the enterprise are possible if the organization has warehouse accounting.

6. Cash flow.

This is a whole area of analysis that has many schools and methods. OLAP technology can serve as a tool for implementing or improving these techniques, but not as a replacement for them. Cash turnover of non-cash and cash funds is analyzed in terms of business operations, counterparties, currencies and time in order to optimize flows, ensure liquidity, etc. The composition of measurements strongly depends on the characteristics of the business, industry, and methodology.

7. Budget.

One of the most fertile areas of application of OLAP technologies. It is not for nothing that no modern budgeting system is considered complete without the presence of OLAP tools for budget analysis. Most budget reports are easily built on the basis of OLAP systems. At the same time, the reports answer a very wide range of questions: analysis of the structure of expenses and income, comparison of expenses for certain items in different divisions, analysis of the dynamics and trends of expenses for certain items, analysis of costs and profits.

8. Accounts.

A classic balance sheet consisting of an account number and containing incoming balances, turnover and outgoing balances can be perfectly analyzed in an OLAP system. In addition, the OLAP system can automatically and very quickly calculate consolidated balances of a multi-branch organization, balances for the month, quarter and year, aggregated balances by hierarchy of accounts, and analytical balances based on analytical characteristics.

9. Financial reporting.

A technologically constructed reporting system is nothing more than a set of named indicators with date values that need to be grouped and summarized in various sections to obtain specific reports. When this is the case, then displaying and printing reports is most easily and cheaply implemented in OLAP systems. In any case, the enterprise's internal reporting system is not so conservative and can be restructured in order to save money on technical work on creating reports and obtain the capabilities of multidimensional operational analysis.

10. Site traffic.

The Internet server log file is multidimensional in nature, which means it is suitable for OLAP analysis. The facts are: the number of visits, the number of hits, time spent on the page and other information available in the log.

11. Production volumes.

This is another example of statistical analysis. Thus, it is possible to analyze the volumes of potatoes grown, steel smelted, and goods produced.

12. Consumption of consumables.

Imagine a factory consisting of dozens of workshops in which cooling, flushing fluids, oils, rags, sandpaper - hundreds of types of consumables. For accurate planning and cost optimization, a thorough analysis of the actual consumption of consumables is required.

13. Use of premises.

Another type of statistical analysis. Examples: analysis of the workload of classrooms, rented buildings and premises, the use of conference rooms, etc.

14. Personnel turnover at the enterprise.

Analysis of personnel turnover at the enterprise by branches, departments, professions, level of education, gender, age, time.

15. Passenger transportation.

Analysis of the number of tickets sold and amounts by season, direction, type of carriage (class), type of train (airplane).

This list is not limited to areas of application OLAP - technologies. For example, consider the technology OLAP - analysis in the field of sales.

8. Example of use OLAP -technologies for analysis in the field of sales.

Designing a multidimensional data representation for OLAP -analysis begins with the formation of a measurement map. For example, when analyzing sales, it may be advisable to identify individual parts of the market (developing, stable, large and small consumers, the likelihood of new consumers, etc.) and estimate sales volumes by product, territory, customer, market segment, sales channel and order sizes. These directions form the coordinate grid of a multidimensional representation of sales - the structure of its dimensions.

Since the activities of any enterprise take place over time, the first question that arises during the analysis is the question of the dynamics of business development. The correct organization of the time axis will allow us to qualitatively answer this question. Typically, the time axis is divided into years, quarters and months. Even greater fragmentation into weeks and days is possible. The structure of the time dimension is formed taking into account the frequency of data receipt; may also be determined by the frequency of information demand.

The Product Group dimension is designed to reflect as closely as possible the structure of the products sold. At the same time, it is important to maintain a certain balance in order, on the one hand, to avoid excessive detail (the number of groups should be visible), and on the other, not to miss a significant segment of the market.

The “Customers” dimension reflects the sales structure by territorial and geographical basis. Each dimension can have its own hierarchies, for example, in this dimension it can be the structure: Countries – Regions – Cities – Clients.

To analyze the performance of departments, you should create your own measurement. For example, we can distinguish two levels of hierarchy: departments and the divisions included in them, which should be reflected in the “Divisions” dimension.

In fact, the dimensions “Time”, “Products”, “Customers” quite fully define the space of the subject area.

Additionally, it is useful to divide this space into conditional areas, based on calculated characteristics, for example, ranges of transaction volume in value terms. Then the entire business can be divided into a number of cost ranges in which it is carried out. In this example, we can limit ourselves to the following indicators: the amount of sales of goods, the number of goods sold, the amount of income, the number of transactions, the number of customers, the volume of purchases from manufacturers.

OLAP - the cube for analysis will look like (Fig. 2):

Figure 2.OLAP– cube for analyzing sales volume

It is precisely this three-dimensional array that is called a cube in OLAP terms. In fact, from the point of view of strict mathematics, such an array will not always be a cube: a real cube must have the same number of elements in all dimensions, but OLAP cubes do not have such a limitation. An OLAP cube does not have to be three-dimensional. It can be both two- and multidimensional, depending on the problem being solved. Serious OLAP products are designed for about 20 dimensions. Simpler desktop applications support about 6 dimensions.

Not all elements of the cube must be filled in: if there is no information about sales of Product 2 to Customer 3 in the third quarter, the value in the corresponding cell simply will not be determined.

However, the cube itself is not suitable for analysis. If it is still possible to adequately imagine or depict a three-dimensional cube, then with a six- or nineteen-dimensional the situation is much worse. Therefore, before use, ordinary two-dimensional tables are extracted from the multidimensional cube. This operation is called "cutting" the cube. The analyst, as it were, takes and “cuts” the dimensions of the cube according to the marks of interest to him. In this way, the analyst receives a two-dimensional slice of the cube (report) and works with it. The structure of the report is presented in Figure 3.

Figure 3.Analytical report structure

Let's cut our OLAP cube and get a sales report for the third quarter, it will look like this (Fig. 4).

Figure 4.Third quarter sales report

You can cut the cube along the other axis and get a report on the sales of product group 2 during the year (Fig. 5).

Figure 5.Quarterly sales report for product 2

Similarly, you can analyze the relationship with client 4, cutting the cube according to the mark Clients(Fig. 6)

Figure 6.Report on deliveries of goods to client 4

You can detail the report by month or talk about the supply of goods to a specific branch of the client.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Course work

discipline: Databases

Subject: TechnologyOLAP

Completed:

Chizhikov Alexander Alexandrovich

Introduction

1. Classification of OLAP products

2. OLAP client - OLAP server: pros and cons

3. Core OLAP system

3.1 Design principles

Conclusion

List of sources used

Applications

INconducting

It is difficult to find a person in the computer world who, at least on an intuitive level, does not understand what databases are and why they are needed. Unlike traditional relational DBMSs, the concept of OLAP is not so widely known, although almost everyone has probably heard the mysterious term “OLAP cubes”. What is OnLine Analytical Processing?

OLAP is not a separate software product, not a programming language, or even a specific technology. If we try to cover OLAP in all its manifestations, then it is a set of concepts, principles and requirements that underlie software products that make it easier for analysts to access data. Although no one would disagree with such a definition, it is doubtful that it would bring non-specialists one iota closer to understanding the subject. Therefore, in your quest to understand OLAP, it is better to take a different path. First, we need to find out why analysts need to somehow specifically facilitate access to data.

The fact is that analysts are special consumers of corporate information. The analyst's task is to find patterns in large amounts of data. Therefore, the analyst will not pay attention to a single fact; he needs information about hundreds and thousands of events. By the way, one of the significant points that led to the emergence of OLAP is productivity and efficiency. Let's imagine what happens when an analyst needs to obtain information, but there are no OLAP tools in the enterprise. The analyst independently (which is unlikely) or with the help of a programmer makes the appropriate SQL query and receives the data of interest in the form of a report or exports it to a spreadsheet. A great many problems arise in this case. Firstly, the analyst is forced to do something other than his job (SQL programming) or wait for programmers to complete the task for him - all this negatively affects labor productivity, the rate of heart attack and stroke increases, and so on. Secondly, a single report or table, as a rule, does not save the giants of thought and the fathers of Russian analysis - and the whole procedure will have to be repeated again and again. Thirdly, as we have already found out, analysts do not ask about trifles - they need everything at once. This means (although technology is advancing by leaps and bounds) that the corporate relational DBMS server accessed by the analyst can think deeply and for a long time, blocking other transactions.

The concept of OLAP appeared precisely to solve such problems. OLAP cubes are essentially meta reports. By cutting meta-reports (cubes, that is) along dimensions, the analyst actually receives the “ordinary” two-dimensional reports that interest him (these are not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). The advantages of cubes are obvious - data needs to be requested from a relational DBMS only once - when building a cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for quite a long time. Thanks to this, not only are interruptions in the operation of the relational DBMS server eliminated (there are no queries with thousands and millions of response lines), but the speed of access to data for the analyst himself also sharply increases. In addition, as already noted, performance is also improved by calculating subsums of hierarchies and other aggregated values at the time the cube is built.

Of course, you have to pay to increase productivity in this way. It is sometimes said that the data structure simply “explodes” - an OLAP cube can take up tens or even hundreds of times more space than the original data.

Now that we have a little understanding of how OLAP works and what it serves, it is still worth formalizing our knowledge somewhat and giving OLAP criteria without simultaneous translation into ordinary human language. These criteria (12 in total) were formulated in 1993 by E.F. Codd - the creator of the concept of relational DBMS and, concurrently, OLAP. We will not consider them directly, since they were later reworked into the so-called FASMI test, which determines the requirements for OLAP products. FASMI is an acronym for the name of each test item:

Fast (fast). This property means that the system must provide a response to a user request in an average of five seconds; however, most requests are processed within one second, and the most complex requests should be processed within twenty seconds. Recent studies have shown that the user begins to doubt the success of a request if it takes more than thirty seconds.

Analysis (analytical). The system must be able to handle any logical and statistical analysis typical of business applications, and ensure that the results are stored in a form accessible to the end user. Analysis tools may include procedures for analyzing time series, cost distribution, currency conversion, modeling changes in organizational structures, and some others.

Shared. The system should provide ample opportunities for restricting access to data and simultaneous operation of many users.

Multidimensional (multidimensional). The system must provide a conceptually multidimensional view of data, including full support for multiple hierarchies.

Information. The power of various software products is characterized by the amount of input data processed. Different OLAP systems have different capacities: advanced OLAP solutions can handle at least a thousand times more data than the least powerful ones. When choosing an OLAP tool, there are a number of factors to consider, including data duplication, memory requirements, disk space usage, performance metrics, integration with information warehouses, and so on.

1. Classification of OLAP products

So, the essence of OLAP is that the initial information for analysis is presented in the form of a multidimensional cube, and it is possible to arbitrarily manipulate it and obtain the necessary information sections - reports. In this case, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in various sections (dimensions), and allows interactive management of calculations and report form. These operations are performed by the OLAP engine (or OLAP calculation engine).

Today, many products have been developed around the world that implement OLAP technologies. To make it easier to navigate among them, classifications of OLAP products are used: by the method of storing data for analysis and by the location of the OLAP machine. Let's take a closer look at each category of OLAP products.

I'll start with a classification based on the method of data storage. Let me remind you that multidimensional cubes are built on the basis of source and aggregate data. Both source and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, three methods of data storage are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP). Accordingly, OLAP products are divided into three similar categories based on the method of data storage:

1.In the case of MOLAP, source and aggregate data are stored in a multidimensional database or in a multidimensional local cube.

2.In ROLAP products, source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool.

3. When using HOLAP architecture, the source data remains in the relational database, and the aggregates are placed in the multidimensional one. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data.

The next classification is based on the location of the OLAP machine. Based on this feature, OLAP products are divided into OLAP servers and OLAP clients:

In server OLAP tools, calculations and storage of aggregate data are performed by a separate process - the server. The client application receives only the results of queries against multidimensional cubes that are stored on the server. Some OLAP servers support data storage only in relational databases, some only in multidimensional ones. Many modern OLAP servers support all three data storage methods: MOLAP, ROLAP and HOLAP.

The OLAP client is designed differently. The construction of a multidimensional cube and OLAP calculations are performed in the memory of the client computer. OLAP clients are also divided into ROLAP and MOLAP. And some may support both data access options.

Each of these approaches has its own pros and cons. Contrary to popular belief about the advantages of server tools over client tools, in a number of cases, using an OLAP client for users can be more effective and profitable than using an OLAP server.

2. OLAP client - OLAP server: pros and cons

When building an information system, OLAP functionality can be implemented using both server and client OLAP tools. In practice, the choice is a trade-off between performance and software cost.

The volume of data is determined by the combination of the following characteristics: number of records, number of dimensions, number of dimension elements, length of dimensions and number of facts. It is known that an OLAP server can process larger volumes of data than an OLAP client with equal computer power. This is because the OLAP server stores hard drives a multidimensional database containing pre-computed cubes.

When performing OLAP operations, client programs execute queries on it in an SQL-like language, receiving not the entire cube, but its displayed fragments. At the time of operation, the OLAP client must have random access memory the whole cube In the case of a ROLAP architecture, it is necessary to first load into memory the entire data array used to calculate the cube. Additionally, as the number of dimensions, facts, or dimension members increases, the number of aggregates grows exponentially. Thus, the amount of data processed by the OLAP client is directly dependent on the amount of RAM on the user's PC.

However, note that most OLAP clients provide distributed computing. Therefore, the number of processed records, which limits the work of the client OLAP tool, is understood not as the volume of primary data in the corporate database, but as the size of the aggregated sample from it. The OLAP client generates a request to the DBMS, which describes the filtering conditions and the algorithm for preliminary grouping of the primary data. The server finds, groups records and returns a compact selection for further OLAP calculations. The size of this sample can be tens or hundreds of times smaller than the volume of primary, non-aggregated records. Consequently, the need for such an OLAP client in PC resources is significantly reduced.

In addition, the number of dimensions is subject to limitations in human perception. It is known that the average person can simultaneously operate with 3-4, maximum 8 dimensions. With a larger number of dimensions in a dynamic table, the perception of information becomes significantly more difficult. This factor should be taken into account when preliminary calculating the RAM that may be required by the OLAP client.

The length of the dimensions also affects the size of the OLAP engine's address space when computing an OLAP cube. The longer the dimensions, the more resources are required to presort a multidimensional array, and vice versa. Only short measurements in the source data is another argument in favor of the OLAP client.

This characteristic is determined by the two factors discussed above: the volume of data processed and the power of computers. As the number, for example, of dimensions increases, the performance of all OLAP tools decreases due to a significant increase in the number of aggregates, but the rate of decrease is different. Let's demonstrate this dependence on a graph.

Scheme 1. Dependence of the performance of client and server OLAP tools on an increase in data volume

The speed characteristics of an OLAP server are less sensitive to data growth. This is explained by different technologies for processing user requests by the OLAP server and OLAP client. For example, during a drill-down operation, the OLAP server accesses the stored data and “pulls” the data from this “branch”. The OLAP client calculates the entire set of aggregates at the time of loading. However, up to a certain amount of data, the performance of server and client tools is comparable. For OLAP clients that support distributed computing, the scope of performance comparability can extend to data volumes that cover the OLAP analysis needs of a huge number of users. This is confirmed by the results of internal testing of MS OLAP Server and the OLAP client "Kontur Standard". The test was performed on an IBM PC Pentium Celeron 400 MHz, 256 Mb for a sample of 1 million unique (i.e., aggregated) records with 7 dimensions containing from 10 to 70 members. The cube loading time in both cases does not exceed 1 second, and various OLAP operations (drill up, drill down, move, filter, etc.) are completed in hundredths of a second.

When the sample size exceeds the amount of RAM, swapping with the disk begins and the performance of the OLAP client drops sharply. Only from this moment can we talk about the advantage of the OLAP server.

It should be remembered that the “breaking point” determines the limit of a sharp increase in the cost of an OLAP solution. For the tasks of each specific user, this point is easily determined by performance tests of the OLAP client. Such tests can be obtained from the development company.

In addition, the cost of a server OLAP solution increases as the number of users increases. The fact is that the OLAP server performs calculations for all users on one computer. Accordingly, than more quantity users, the more RAM and processing power. Thus, if the volumes of data being processed are in the area of comparable performance of server and client systems, then, other things being equal, using an OLAP client will be more profitable.

Using an OLAP server in the “classical” ideology involves uploading relational DBMS data into a multidimensional database. The upload is performed over a certain period, so the OLAP server data does not reflect the current state. Only those OLAP servers that support the ROLAP mode of operation are free from this drawback.

Similarly, a number of OLAP clients allow you to implement ROLAP and Desktop architectures with direct access to the database. This ensures on-line analysis of source data.

The OLAP server places minimal requirements on the power of client terminals. Objectively, the requirements of an OLAP client are higher, because... it performs calculations in the user's PC RAM. The state of a particular organization's hardware fleet is the most important indicator that must be taken into account when choosing an OLAP tool. But there are also “pros” and “cons” here. The OLAP server does not use a huge computing power modern personal computers. If an organization already has a fleet of modern PCs, it is ineffective to use them only as display terminals and at the same time incur additional costs for the central server.

If the power of the users' computers "leaves much to be desired," the OLAP client will work slowly or not be able to work at all. Buying one powerful server may be cheaper than upgrading all your PCs.

Here it is useful to take into account trends in hardware development. Since the volume of data for analysis is practically constant, a steady increase in PC power will lead to an expansion of the capabilities of OLAP clients and their displacement of OLAP servers into the segment of very large databases.

When using an OLAP server over the network, only the data to be displayed is transferred to the client's PC, while the OLAP client receives the entire volume of primary data.

Therefore, where an OLAP client is used, network traffic will be higher.

But, when using an OLAP server, user operations, for example, detailing, generate new queries to the multidimensional database, and, therefore, new data transfer. The execution of OLAP operations by an OLAP client is performed in RAM and, accordingly, does not cause new data flows in the network.

It should also be noted that modern network Hardware provides high level bandwidth.

Therefore, in the vast majority of cases, analyzing a “medium” sized database using an OLAP client will not slow down the user’s work.

The cost of an OLAP server is quite high. This should also include the cost of a dedicated computer and the ongoing costs of administering a multidimensional database. In addition, the implementation and maintenance of an OLAP server requires fairly highly qualified personnel.

The cost of an OLAP client is an order of magnitude lower than the cost of an OLAP server. No administration or additional technical equipment is required for the server. There are no high requirements for personnel qualifications when implementing an OLAP client. An OLAP client can be implemented much faster than an OLAP server.

Development of analytical applications using client OLAP tools is a fast process and does not require special training. A user who knows the physical implementation of the database can develop an analytical application independently, without the involvement of an IT specialist. When using an OLAP server, you need to learn 2 different systems, sometimes from different vendors - to create cubes on the server, and to develop a client application. The OLAP client provides a single visual interface for describing cubes and setting up user interfaces for them.

Let's walk through the process of creating an OLAP application using the client tool.

Diagram 2. Creating an OLAP application using a ROLAP client tool

The operating principle of the OLAP server client is different. In an OLAP server, when creating cubes, the user manipulates the physical descriptions of the database.

At the same time, custom descriptions are created in the cube itself. The OLAP server client is configured only for the cube.

Let us explain the principle of operation of the ROLAP client using the example of creating a dynamic sales report (see Diagram 2). Let the initial data for analysis be stored in two tables: Sales and Deal.

When creating a semantic layer, data sources - the Sales and Deal tables - are described in terms that the end user can understand and turn into “Products” and “Deals”. The "ID" field from the "Products" table is renamed to "Code", and the "Name" to "Product", etc.

Then the Sales business object is created. A business object is a flat table on the basis of which a multidimensional cube is formed. When creating a business object, the "Products" and "Transactions" tables are merged by the "Code" field of the product. Since all table fields are not required for display in the report, the business object uses only the “Item”, “Date” and “Amount” fields.

Next, an OLAP report is created based on the business object. The user selects a business object and drags its attributes onto the column or row areas of the report table. In our example, based on the “Sales” business object, a report on product sales by month was created.

All OLAP application settings can be stored in a dedicated metadata repository, in the application, or in a multidimensional database system repository. Implementation depends on the specific software product.

So, in what cases can using an OLAP client be more effective and profitable for users than using an OLAP server?

The economic feasibility of using an OLAP server arises when the volumes of data are very large and overwhelming for the OLAP client, otherwise the use of the latter is more justified. In this case, the OLAP client combines high performance characteristics and low cost.

Powerful PCs for analysts are another argument in favor of OLAP clients. When using an OLAP server, these capacities are not used. Among the advantages of OLAP clients are the following:

The costs of implementing and maintaining an OLAP client are significantly lower than the costs of an OLAP server.

When using an OLAP client with an embedded machine, data is transferred over the network once. When performing OLAP operations, no new data streams are generated.

Setting up ROLAP clients is simplified by eliminating the intermediate step - creating a multidimensional database.

3. Core OLAP system

3.1 Design principles

application client core data

From what has already been said, it is clear that the OLAP mechanism is one of the popular methods of data analysis today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes on the fly based on SQL queries to a relational DBMS. Each of these approaches has its pros and cons. Their comparative analysis is beyond the scope of this work. Only the core implementation of the desktop ROLAP module will be described here.

This task arose after using a ROLAP system built on the basis of Decision Cube components included in Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be mitigated by trying to cut out as much data as possible before feeding it into cubes. But this is not always enough.

You can find a lot of information about OLAP systems on the Internet and in the press, but almost nowhere is it said about how it works inside.

Scheme of work:

The general scheme of operation of a desktop OLAP system can be represented as follows:

Diagram 3. Operation of a desktop OLAP system

The operating algorithm is as follows:

1. Obtaining data in the form of a flat table or the result of executing an SQL query.

2.Caching data and converting it to a multidimensional cube.

3.Displaying the constructed cube using a cross-tab or chart, etc. IN general case, an arbitrary number of mappings can be connected to one cube.

Let's consider how such a system can be arranged internally. We will start from the side that can be seen and touched, that is, from the displays. The displays used in OLAP systems most often come in two types - cross-tabs and charts. Let's look at a crosstab, which is the basic and most common way to display a cube.

In the figure below, the rows and columns containing aggregated results are shown in yellow, the cells containing facts are in light gray, and the cells containing dimensional data are in dark gray.

Thus, the table can be divided into the following elements, which we will work with in the future:

When filling out the matrix with facts, we must proceed as follows:

Based on the measurement data, determine the coordinates of the element to be added in the matrix.

Determine the coordinates of the columns and rows of the totals that are affected by the added element.

Add an element to the matrix and the corresponding total columns and rows.

It should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (the option lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, for storing which there is no No amount of RAM is enough. For example, if our cube contains information about sales for one year, and if it has only 3 dimensions - Customers (250), Products (500) and Date (365), then we will get a fact matrix of the following dimensions: number of elements = 250 x 500 x 365 = 45,625,000. And this despite the fact that there may be only a few thousand filled elements in the matrix. Moreover, the greater the number of dimensions, the more sparse the matrix will be.

Therefore, to work with this matrix, you need to use special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are quite well described in the programming literature, for example, in the first volume of the classic book "The Art of Programming" by Donald Knuth.

Let us now consider how we can determine the coordinates of a fact, knowing the dimensions corresponding to it. To do this, let's take a closer look at the header structure:

In this case, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be proposed here. One is to use a tree to find matching cells. This tree can be constructed by traversing the selection. In addition, an analytical recurrence formula can be easily defined to calculate the required coordinate.

The data stored in the table needs to be transformed in order to be used. Thus, in order to improve performance when building a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can perform preliminary aggregation of facts for records that have the same dimension values. As mentioned above, the unique values available in the measurement fields are important to us. Then the following structure can be proposed for storing them:

Scheme 4. Structure for storing unique values

By using this structure, we significantly reduce the memory requirement. Which is quite relevant, because... To increase operating speed, it is advisable to store data in RAM. In addition, you can only store an array of elements, and dump their values to disk, since we will only need them when displaying the cross-tab.

The ideas described above were the basis for creating the CubeBase component library.

Diagram 5. Structure of the CubeBase component library

TСubeSource performs caching and conversion of data into an internal format, as well as preliminary aggregation of data. The TCubeEngine component performs hypercube calculations and operations with it. In fact, it is an OLAP engine that transforms a flat table into a multidimensional data set. The TCubeGrid component displays the crosstab and controls the display of the hypercube. TСubeChart allows you to see the hypercube in the form of graphs, and the TСubePivote component controls the operation of the cube core.

So, I looked at the architecture and interaction of components that can be used to build an OLAP machine. Now let's take a closer look at the internal structure of the components.

The first stage of the system will be loading data and converting it into an internal format. A logical question would be: why is this necessary, since you can simply use data from a flat table, viewing it when constructing a cube slice. In order to answer this question, let's look at the table structure from the point of view of an OLAP machine. For OLAP systems, table columns can be either facts or dimensions. However, the logic for working with these columns will be different. In a hypercube, the dimensions are actually the axes, and the dimension values are the coordinates on those axes. In this case, the cube will be filled very unevenly - there will be combinations of coordinates that will not correspond to any records and there will be combinations that correspond to several records in the original table, and the first situation is more common, that is, the cube will be similar to the universe - empty space, in some places which there are clusters of points (facts). Thus, if we perform data preaggregation during the initial data loading, that is, we combine records that have the same measurement values, while calculating preliminary aggregated fact values, then in the future we will have to work with fewer records, which will increase the speed of work and reduce requirements to the amount of RAM.

To build slices of a hypercube, we need the following capabilities - defining coordinates (actually measurement values) for table records, as well as defining records that have specific coordinates (measurement values). Let's consider how these possibilities can be realized. The easiest way to store a hypercube is to use a database of its own internal format.

Schematically, the transformations can be represented as follows:

Figure 6: Converting an Internal Format Database to a Normalized Database

That is, instead of one table, we got a normalized database. In fact, normalization reduces the speed of the system, database specialists may say, and in this they will certainly be right, in the case when we need to get values for dictionary elements (in our case, measurement values). But the thing is that we don’t need these values at all at the stage of constructing the slice. As mentioned above, we are only interested in the coordinates in our hypercube, so we will define the coordinates for the measurement values. The easiest thing to do would be to renumber the element values. In order for the numbering to be unambiguous within one dimension, we first sort the lists of dimension values (dictionaries, in database terms) in alphabetical order. In addition, we will renumber the facts, and the facts are pre-aggregated. We get the following diagram:

Scheme 7. Renumbering the normalized database to determine the coordinates of measurement values

Now all that remains is to connect the elements of different tables with each other. In the theory of relational databases, this is done using special intermediate tables. It is enough for us to associate each entry in the measurement tables with a list, the elements of which will be the numbers of facts in the formation of which these measurements were used (that is, to determine all facts that have the same value of the coordinate described by this measurement). For facts, each record will be matched to the values of the coordinates along which it is located in the hypercube. In the future, the coordinates of a record in a hypercube will be understood as the numbers of the corresponding records in the tables of measurement values. Then for our hypothetical example we get the following set defining the internal representation of the hypercube:

Diagram 8. Internal representation of a hypercube

This will be our internal representation of the hypercube. Since we are not making it for a relational database, we simply use fields of variable length as fields for connecting measurement values (we would not be able to do this in an RDB, since the number of table columns is predetermined there).

We could try to use a set of temporary tables to implement the hypercube, but this method will provide too low performance (for example, a set of Decision Cube components), so we will use our own data storage structures.

To implement a hypercube, we need to use data structures that will ensure maximum performance and minimal RAM consumption. Obviously, our main structures will be for storing dictionaries and fact tables. Let's look at the tasks that a dictionary must perform at maximum speed:

checking the presence of an element in the dictionary;

adding an element to the dictionary;

search for record numbers that have a specific coordinate value;

search for coordinates by measurement value;

searching for a measurement value by its coordinate.

To implement these requirements you can use Various types and data structures. For example, you can use arrays of structures. In a real case, these arrays require additional indexing mechanisms that will increase the speed of loading data and retrieving information.

To optimize the operation of a hypercube, it is necessary to determine which tasks need to be solved as a matter of priority, and by what criteria we need to improve the quality of work. The main thing for us is to increase the speed of the program, while it is desirable that a not very large amount of RAM is required. Increased performance is possible through the introduction of additional mechanisms for accessing data, for example, the introduction of indexing. Unfortunately, this increases the RAM overhead. Therefore, we will determine which operations we need to perform at the highest speed. To do this, consider the individual components that implement the hypercube. These components have two main types - dimension and fact table. For measurement, a typical task would be:

adding a new value;

determining the coordinate based on the measurement value;

determination of value by coordinate.

When adding a new element value, we need to check whether we already have such a value, and if so, then do not add a new one, but use the existing coordinate, otherwise we need to add a new element and determine its coordinate. To do this, you need a way to quickly find the presence of the desired element (in addition, such a problem arises when determining the coordinate by the value of the element). For this purpose, it is optimal to use hashing. In this case, the optimal structure would be to use hash trees in which we will store references to elements. In this case, the elements will be the lines of the dimension dictionary. Then the structure of the measurement value can be represented as follows:

PFactLink = ^TFactLink;

TFactLink = record

FactNo: integer; // fact index in the table

TDimensionRecord = record

Value: string; // measurement value

Index: integer; // coordinate value

FactLink: PFactLink; // pointer to the beginning of the list of fact table elements

And in the hash tree we will store links to unique elements. In addition, we need to solve the problem of inverse transformation - using the coordinate to determine the measurement value. To ensure maximum performance, direct addressing should be used. Therefore, you can use another array, the index of which is the coordinate of the dimension, and the value is a link to the corresponding entry in the dictionary. However, you can do it easier (and save on memory) if you arrange the array of elements accordingly so that the index of the element is its coordinate.

Organizing an array that implements a list of facts does not present any particular problems due to its simple structure. The only remark would be that it is advisable to calculate all aggregation methods that may be needed and which can be calculated incrementally (for example, sum).

So, we have described a method for storing data in the form of a hypercube. It allows you to generate a set of points in a multidimensional space based on information located in the data warehouse. In order for a person to be able to work with this data, it must be presented in a form convenient for processing. In this case, a pivot table and graphs are used as the main types of data presentation. Moreover, both of these methods are actually projections of a hypercube. In order to ensure maximum efficiency when constructing representations, we will start from what these projections represent. Let's start with the pivot table, as the most important one for data analysis.

Let's find ways to implement such a structure. There are three parts that make up a pivot table: row headers, column headers, and the actual table of aggregated fact values. The most in a simple way The fact table view will use a two-dimensional array, the dimension of which can be determined by constructing the headers. Unfortunately, the simplest method will be the most inefficient, because the table will be very sparse, and memory will be used extremely inefficiently, as a result of which it will be possible to build only very small cubes, since otherwise there may not be enough memory. Thus, we need to select a data structure for storing information that will ensure the maximum speed of searching/adding a new element and at the same time the minimum consumption of RAM. This structure will be the so-called sparse matrices, about which you can read in more detail from Knuth. There are various ways to organize the matrix. In order to choose the option that suits us, we will first consider the structure of the table headers.

Headings have a clear hierarchical structure, so it would be natural to assume using a tree to store them. In this case, the structure of a tree node can be schematically depicted as follows:

Appendix C

In this case, it is logical to store a link to the corresponding element of the dimension table of a multidimensional cube as a dimension value. This will reduce memory costs for storing the slice and speed up work. Links are also used as parent and child nodes.

To add an element to a tree, you must have information about its location in the hypercube. As such information, you need to use its coordinate, which is stored in the dictionary of measurement values. Let's consider the scheme for adding an element to the header tree of a pivot table. In this case, we use the values of measurement coordinates as initial information. The order in which these dimensions are listed is determined by the desired aggregation method and matches the hierarchy levels of the header tree. As a result of the work, you need to obtain a list of columns or rows of the pivot table to which you need to add an element.

ApplicationD

We use measurement coordinates as the initial data to determine this structure. In addition, for definiteness, we will assume that we are defining the column of interest to us in the matrix (we will consider how to define a row a little later, since it is more convenient to use other data structures there; the reason for this choice is also see below). As coordinates, we take integers - numbers of measurement values that can be determined as described above.

So, after performing this procedure, we will obtain an array of references to the columns of the sparse matrix. Now you need to perform all the necessary actions with the strings. To do this, you need to find the required element inside each column and add the corresponding value there. For each dimension in the collection, you need to know the number of unique values and the actual set of these values.

Now let's look at the form in which the values inside the columns need to be represented - that is, how to determine the required row. There are several approaches you can use to achieve this. The simplest would be to represent each column as a vector, but since it will be very sparse, memory will be used extremely inefficiently. To avoid this, we will use data structures that will provide greater efficiency in representing sparse one-dimensional arrays (vectors). The simplest of them would be a regular list, singly or doubly linked, but it is uneconomical from the point of view of accessing elements. Therefore, we will use a tree, which will provide faster access to elements.

For example, you could use exactly the same tree as for columns, but then you would have to create your own tree for each column, which would lead to significant memory overhead and processing time. Let's do it a little more cunningly - we'll create one tree to store all combinations of dimensions used in strings, which will be identical to the one described above, but its elements will not be pointers to strings (which do not exist as such), but their indices, and the values of the indices themselves are not of interest to us and are used only as unique keys. We will then use these keys to find the desired element within the column. The columns themselves are most easily represented as a regular binary tree. Graphically, the resulting structure can be represented as follows:

Diagram 9. Image of a pivot table as a binary tree

You can use the same procedure as the procedure described above for determining the pivot table columns to determine the appropriate row numbers. In this case, row numbers are unique within one pivot table and identify elements in vectors that are columns of the pivot table. Most simple option These numbers will be generated by maintaining a counter and incrementing it by one when adding a new element to the row header tree. These column vectors themselves are most easily stored as binary trees, where the row number value is used as the key. In addition, it is also possible to use hash tables. Since the procedures for working with these trees are discussed in detail in other sources, we will not dwell on this and will consider general scheme adding an element to a column.

In general, the sequence of actions for adding an element to the matrix can be described as follows:

1. Determine the line numbers to which elements are added

2.Define a set of columns to which elements are added

3.For all columns, find elements with the necessary numbers rows and add the current element to them (adding involves connecting the required number of fact values and calculating aggregated values, which can be determined incrementally).

After executing this algorithm, we will obtain a matrix, which is a summary table that we needed to build.

Now a few words about filtering when constructing a slice. The easiest way to do this is at the stage of constructing the matrix, since at this stage there is access to all the required fields, and, in addition, aggregation of values is carried out. In this case, when retrieving an entry from the cache, its compliance with the filtering conditions is checked, and if it is not met, the entry is discarded.

Since the structure described above completely describes the pivot table, the task of visualizing it will be trivial. In this case, you can use standard table components that are available in almost all programming tools for Windows.

The first product to perform OLAP queries was Express (IRI). However, the term OLAP itself was coined by Edgar Codd, “the father of relational databases.” And Codd's work was funded by Arbor, a company that had released its own OLAP product, Essbase (later acquired by Hyperion, which was acquired by Oracle in 2007) the year before. Other well-known OLAP products include Microsoft Analysis Services (formerly called OLAP Services, part of SQL Server), Oracle OLAP Option, IBM's DB2 OLAP Server (essentially EssBase with additions from IBM), SAP BW, Brio products, BusinessObjects, Cognos, MicroStrategy and other manufacturers.

From a technical point of view, the products on the market are divided into “physical OLAP” and “virtual”. In the first case, there is a program that performs a preliminary calculation of aggregates, which are then stored in a special multidimensional database that provides quick retrieval. Examples of such products are Microsoft Analysis Services, Oracle OLAP Option, Oracle/Hyperion EssBase, Cognos PowerPlay. In the second case, the data is stored in relational DBMSs, and aggregates may not exist at all or may be created upon the first request in the DBMS or analytical software cache. Examples of such products are SAP BW, BusinessObjects, Microstrategy. Systems based on "physical OLAP" provide stable best time response to queries than virtual OLAP systems. Virtual OLAP vendors claim greater scalability of their products to support very large volumes of data.

In this work, I would like to take a closer look at the BaseGroup Labs product - Deductor.

Deductor is an analytics platform, i.e. basis for creating complete application solutions. The technologies implemented in Deductor allow you to go through all the stages of building an analytical system on the basis of a single architecture: from creating a data warehouse to automatically selecting models and visualizing the results obtained.

System composition:

Deductor Studio is the analytical core of the Deductor platform. Deductor Studio includes a full set of mechanisms that allows you to obtain information from an arbitrary data source, carry out the entire processing cycle (cleaning, transforming data, building models), display the results in the most convenient way (OLAP, tables, charts, decision trees...) and export results.

Deductor Viewer is the end user workstation. The program allows you to minimize the requirements for personnel, because all required operations are performed automatically using previously prepared processing scripts, there is no need to think about the method receiving data and the mechanisms of their processing. The Dedustor Viewer user only needs to select the report of interest.

Deductor Warehouse is a multidimensional cross-platform data warehouse that accumulates all the information necessary for analyzing the subject area. The use of a single repository allows for convenient access, high processing speed, consistency of information, centralized storage and automatic support for the entire data analysis process.

4. Client-Server

Deductor Server is designed for remote analytical processing. It provides the ability to both automatically “run” data through existing scripts on the server and retrain existing models. Using Deductor Server allows you to implement a full-fledged three-tier architecture in which it serves as an application server. Access to the server is provided using Deductor Client.

Work principles:

1. Import data

Analysis of any information in Deductor begins with data import. As a result of import, the data is brought into a form suitable for subsequent analysis using all the mechanisms available in the program. The nature of the data, format, DBMS, etc. do not matter, because the mechanisms for working with everyone are unified.

2. Export data

The presence of export mechanisms allows you to send the results to third party applications, for example, transfer a sales forecast to the system to generate a purchase order or post a prepared report on a corporate website.

3. Data processing

Processing in Deductor means any action associated with some kind of data transformation, for example, filtering, model building, cleaning, etc. Actually, in this block the most important actions from the point of view of analysis are performed. The most significant feature of the processing mechanisms implemented in Deductor is that the data obtained as a result of processing can be processed again by any of available to the system methods. Thus, you can build arbitrarily complex processing scenarios.

4. Visualization

You can visualize data in Deductor Studio (Viewer) at any stage of processing. The system independently determines how it can do this, for example, if a neural network is trained, then in addition to tables and diagrams, you can view the neural network graph. The user needs to select the desired option from the list and configure several parameters.

5. Integration mechanisms

Deductor does not provide data entry tools - the platform is focused solely on analytical processing. To use information stored in heterogeneous systems, flexible import-export mechanisms are provided. Interaction can be organized using batch execution, working in OLE server mode and accessing the Deductor Server.

6.Replication of knowledge

Deductor allows you to implement one of the most important functions any analytical system - support for the process of knowledge replication, i.e. providing the opportunity for employees who do not understand analysis methods and methods of obtaining a particular result to receive an answer based on models prepared by an expert.

Zconclusion

This paper examined such an area of modern information technologies as data analysis systems. The main tool for analytical information processing - OLAP - technology is analyzed. The essence of the concept of OLAP and the importance of OLAP systems in a modern business process are revealed in detail. The structure and process of operation of a ROLAP server is described in detail. As an example of the implementation of OLAP data technologies, the Deductor analytical platform is given. The submitted documentation has been developed and meets the requirements.

OLAP technologies are a powerful tool for real-time data processing. An OLAP server allows you to organize and present data across various analytical areas and turns data into valuable information that helps companies make more informed decisions.

The use of OLAP systems provides consistently high levels of performance and scalability, supporting multi-gigabyte data volumes that can be accessed by thousands of users. With the help of OLAP technologies, access to information is carried out in real time, i.e. Query processing no longer slows down the analysis process, ensuring its speed and efficiency. Visual administration tools allow you to develop and implement even the most complex analytical applications, making the process simple and fast.

1.2 Definition OLAP-systems

The technology for complex multidimensional data analysis is called OLAP. OLAP is a key component of a data warehouse organization.

OLAP functionality can be implemented in various ways, both simple ones, such as data analysis in office applications, and more complex ones - distributed analytical systems based on server products.

OLAP (On-LineAnalyticalProcessing) is a technology for operational analytical data processing that uses tools and methods for collecting, storing and analyzing multidimensional data to support decision-making processes.

The main purpose of OLAP systems is to support analytical activities, arbitrary requests from user analysts. The purpose of OLAP analysis is to test emerging hypotheses.

In the series of articles “Introduction to Databases,” published recently (see ComputerPress No. 3’2000 - 3’2001), we discussed various technologies and software used in the creation of information systems - desktop and server DBMSs, data design tools, application development tools, as well as Business Intelligence - enterprise-scale data analysis and processing tools, which are currently becoming increasingly popular in the world, including ours country. We note, however, that the issues of using Business Intelligence tools and the technologies used to create applications of this class are not yet sufficiently covered in the domestic literature. In a new series of articles we will try to fill this gap and talk about what the technologies underlying such applications are. As implementation examples, we will mainly use Microsoft OLAP technologies (mainly Analysis Services in Microsoft SQL Server 2000), but we hope that the bulk of the material will be useful to users of other tools.

The first article in this series is devoted to the basics of OLAP (On-Line Analytical Processing) - a technology for multidimensional data analysis. In it, we will look at the concepts of data warehousing and OLAP, the requirements for data warehousing and OLAP tools, the logical organization of OLAP data, and the basic terms and concepts used when discussing multidimensional analysis.

What is a data warehouse

Enterprise-scale information systems, as a rule, contain applications designed for complex multidimensional analysis of data, its dynamics, trends, etc. Such analysis is ultimately intended to support decision making. These systems are often called decision support systems.

Accept any management decision impossible without having the information necessary for this, usually quantitative. This requires the creation of Data warehouses, that is, the process of collecting, sifting and pre-processing data in order to provide the resulting information to users for statistical analysis (and often the creation of analytical reports).

Ralph Kimball, one of the originators of the data warehouse concept, described a data warehouse as “a place where people can access their data” (see, for example, Ralph Kimball, “The Data Warehouse Toolkit: Practical Techniques for Building Dimensions Data Warehouses", John Wiley & Sons, 1996 and "The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse", John Wiley & Sons, 2000). He also formulated the basic requirements for data warehouses:

support for high speed data retrieval from storage;
maintaining internal data consistency;
the ability to obtain and compare so-called data slices (slice and dice);
availability of convenient utilities for viewing data in the storage;
completeness and reliability of stored data;
support for a high-quality data replenishment process.

It is often not possible to satisfy all of the above requirements within the same product. Therefore, to implement data warehouses, several products are usually used, some of which are actual data storage tools, others are tools for retrieving and viewing them, others are tools for replenishing them, etc.

A typical data warehouse is typically different from a typical relational database. First, regular databases are designed to help users perform day-to-day work, while data warehouses are designed for decision making. For example, the sale of goods and the issuance of invoices are carried out using a database designed for transaction processing, and the analysis of sales dynamics over several years, which allows planning work with suppliers, is carried out using a data warehouse.

Second, while traditional databases are subject to constant change as users work, the data warehouse is relatively stable: the data in it is usually updated according to a schedule (for example, weekly, daily, or hourly, depending on needs). Ideally, the enrichment process is simply adding new data over a period of time without changing previous information already in the store.

And thirdly, regular databases are most often the source of data that ends up in the warehouse. In addition, the repository can be replenished from external sources, such as statistical reports.

What is OLAP

Decision support systems usually have the means to provide the user with aggregate data for various samples from the original set in a form convenient for perception and analysis. Typically, such aggregate functions form a multidimensional (and therefore non-relational) data set (often called a hypercube or metacube), whose axes contain parameters, and whose cells contain aggregate data that depends on them. Along each axis, data can be organized into a hierarchy, representing different levels of detail. Thanks to this data model, users can formulate complex queries, generate reports, and obtain subsets of data.

The technology for complex multidimensional data analysis is called OLAP (On-Line Analytical Processing). OLAP is a key component of data warehousing. The concept of OLAP was described in 1993 by Edgar Codd, a renowned database researcher and author of the relational data model (see E.F. Codd, S.B. Codd, and C.T. Salley, Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. Technical report, 1993). In 1995, based on the requirements set out by Codd, the so-called FASMI test (Fast Analysis of Shared Multidimensional Information) was formulated, including the following requirements for applications for multidimensional analysis:

providing the user with analysis results in an acceptable time (usually no more than 5 s), even at the cost of a less detailed analysis;
the ability to perform any logical and statistical analysis specific to a given application and save it in a form accessible to the end user;
multi-user access to data with support for appropriate locking mechanisms and authorized access means;
multidimensional conceptual representation of data, including full support for hierarchies and multiple hierarchies (this is a key requirement of OLAP);
opportunity to contact anyone necessary information regardless of its volume and storage location.

It should be noted that OLAP functionality can be implemented in various ways, from the simplest data analysis tools in office applications to distributed analytical systems based on server products. But before we talk about the different implementations of this functionality, let's look at what OLAP cubes are from a logical point of view.

Multidimensional cubes

In this section, we will take a closer look at the concept of OLAP and multidimensional cubes. As an example of a relational database that we will use to illustrate OLAP principles, we will use the Northwind database, which is included with Microsoft SQL Server or Microsoft Access and is a typical database that stores trading information for a wholesale food distribution company. Such data includes information about suppliers, clients, delivery companies, a list of supplied goods and their categories, data about orders and ordered goods, a list of company employees. Detailed description Northwind databases can be found in the references Microsoft systems SQL Server or Microsoft Access - we do not list it here due to lack of space.

To explore the concept of OLAP, we'll use the Invoices view and the Products and Categories tables from the Northwind database to create a query that will result in detailed information about all the goods ordered and invoices issued:

SELECT dbo.Invoices.Country, dbo.Invoices.City, dbo.Invoices.CustomerName, dbo.Invoices.Salesperson, dbo.Invoices.OrderDate, dbo.Categories.CategoryName, dbo.Invoices.ProductName, dbo.Invoices.ShipperName, dbo .Invoices.ExtendedPrice FROM dbo.Products INNER JOIN dbo.Categories ON dbo.Products.CategoryID = dbo.Categories.CategoryID INNER JOIN dbo.Invoices ON dbo.Products.ProductID = dbo.Invoices.ProductID

In Access 2000, a similar query looks like this:

SELECT Invoices.Country, Invoices.City, Invoices.Customers.CompanyName AS CustomerName, Invoices.Salesperson, Invoices.OrderDate, Categories.CategoryName, Invoices.ProductName, Invoices.Shippers.CompanyName AS ShipperName, Invoices.ExtendedPrice FROM Categories INNER JOIN (Invoices INNER JOIN Products ON Invoices.ProductID = Products.ProductID) ON Categories.CategoryID = Products.CategoryID;

This query accesses the Invoices view, which contains information about all invoices issued, as well as the Categories and Products tables, which contain information about the categories of products that were ordered and the products themselves, respectively. The result of this request is a set of order data that includes the category and name of the item ordered, the date the order was placed, the name of the invoicing person, the city, country and company name of the ordering company, as well as the name of the shipping company.

For convenience, let's save this request as a view, calling it Invoices1. The result of accessing this representation is shown in Fig. 1 .

What aggregate data can we get from this view? Typically these are answers to questions like:

What is the total value of orders placed by customers from France?
What is the total value of orders placed by customers in France and delivered by Speedy Express?
What is the total value of orders placed by customers in France in 1997 and delivered by Speedy Express?

Let's translate these questions into queries in SQL (Table 1).

The result of any of the above queries is a number. If in the first query you replace the 'France' parameter with 'Austria' or the name of another country, you can run this query again and get a different number. By performing this procedure with all countries, we get the following data set (a fragment is shown below):

Country	SUM (ExtendedPrice)
Argentina	7327.3
Austria	110788.4
Belgium	28491.65
Brazil	97407.74
Canada	46190.1
Denmark	28392.32
Finland	15296.35
France	69185.48
Germany	209373.6
…	…

The resulting set of aggregate values (in this case, sums) can be interpreted as a one-dimensional data set. The same data set can also be obtained as a result of a query with a GROUP BY clause of the following form:

SELECT Country, SUM (ExtendedPrice) FROM invoices1 GROUP BY Country

Now let's look at the second query above, which contains two conditions in the WHERE clause. If we run this query, plugging in all possible values for the Country and ShipperName parameters, we will get a two-dimensional data set that looks like this (a snippet is shown below):

	ShipperName
Country	Federal Shipping	Speedy Express	United Package
Argentina	1 210.30	1 816.20	5 092.60
Austria	40 870.77	41 004.13	46 128.93
Belgium	11 393.30	4 717.56	17 713.99
Brazil	16 514.56	35 398.14	55 013.08
Canada	19 598.78	5 440.42	25 157.08
Denmark	18 295.30	6 573.97	7 791.74
Finland	4 889.84	5 966.21	7 954.00
France	28 737.23	21 140.18	31 480.90
Germany	53 474.88	94 847.12	81 962.58
…	…	…	…

Such a data set is called a pivot table or cross table. Many spreadsheets and desktop DBMSs allow you to create such tables - from Paradox for DOS to Microsoft Excel 2000. For example, this is what a similar query looks like in Microsoft Access 2000:

TRANSFORM Sum(Invoices1.ExtendedPrice) AS SumOfExtendedPrice SELECT Invoices1.Country FROM Invoices1 GROUP BY Invoices1.Country PIVOT Invoices1.ShipperName;

Aggregate data for such a pivot table can also be obtained using a regular GROUP BY query:

SELECT Country,ShipperName, SUM (ExtendedPrice) FROM invoices1 GROUP BY COUNTRY,ShipperName Note, however, that the result of this query will not be the pivot table itself, but only a set of aggregate data for its construction (a fragment is shown below):

Country	ShipperName	SUM (ExtendedPrice)
Argentina	Federal Shipping	845.5
Austria	Federal Shipping	35696.78
Belgium	Federal Shipping	8747.3
Brazil	Federal Shipping	13998.26
…	…	…

The third of the queries discussed above already has three parameters in the WHERE condition. By varying them, we obtain a three-dimensional data set (Fig. 2).

Cells of the cube shown in Fig. 2 contain aggregate data corresponding to the values of the query parameters in the WHERE clause located on the cube axes.

You can obtain a set of two-dimensional tables by cutting a cube with planes parallel to its faces (the terms cross-sections and slices are used to denote them).

Obviously, the data contained in the cube cells can also be obtained using an appropriate query with a GROUP BY clause. In addition, some spreadsheets (particularly Microsoft Excel 2000) also allow you to plot a three-dimensional data set and view different cross-sections of the cube parallel to its face as shown on the workbook sheet.

If the WHERE clause contains four or more parameters, the resulting set of values (also called an OLAP cube) can be 4-dimensional, 5-dimensional, etc.

Having looked at what multidimensional OLAP cubes are, let's move on to some key terms and concepts used in multidimensional data analysis.

Some terms and concepts

Along with sums, the cells of an OLAP cube may contain the results of executing other aggregate functions of the SQL language, such as MIN, MAX, AVG, COUNT, and in some cases, others (variance, standard deviation, etc.). To describe data values in cells, the term summary is used (in general, there can be several of them in one cube), the term measure is used to denote the source data on the basis of which they are calculated, and the term dimension is used to denote query parameters (translated into Russian usually referred to as "dimension" when talking about OLAP cubes, and as "dimension" when talking about data warehouses). The values plotted on the axes are called dimension members.

When talking about measurements, it is worth mentioning that the values plotted on the axes can have different levels of detail. For example, we may be interested in the total value of orders made by customers in different countries, or the total value of orders made by out-of-town customers or even individual customers. Naturally, the resulting set of aggregate data in the second and third cases will be more detailed than in the first. Note that the ability to obtain aggregate data with varying degrees of detail meets one of the requirements for data warehouses - the requirement for the availability of various data slices for comparison and analysis.

Since in the example considered, in general, each country can have several cities, and a city can have several clients, we can talk about hierarchies of values in dimensions. In this case, countries are located at the first level of the hierarchy, cities are at the second, and clients are at the third (Fig. 3).

Note that hierarchies can be balanced, such as the hierarchy shown in Fig. 3, as well as hierarchies based on date-time and unbalanced data. A typical example of an unbalanced hierarchy is a “superior-subordinate” hierarchy (it can be built, for example, using the values of the Salesperson field of the original data set from the example discussed above), shown in Fig. 4 .

Sometimes the term Parent-child hierarchy is used for such hierarchies.

There are also hierarchies that occupy an intermediate position between balanced and unbalanced (they are designated by the term ragged). They typically contain members whose logical "parents" are not at the immediately superior level (for example, a geographic hierarchy has the levels Country, City, and State, but there are countries in the dataset that have no states or regions between the Country and City levels ; Fig. 5).

Note that unbalanced and “uneven” hierarchies are not supported by all OLAP tools. For example, Microsoft Analysis Services 2000 supports both types of hierarchy, but Microsoft OLAP Services 7.0 supports only balanced ones. The number of hierarchy levels, the maximum allowed number of members of one level, and the maximum possible number of dimensions themselves can be different in different OLAP tools.

Conclusion

In this article we learned the basics of OLAP. We learned the following:

The purpose of data warehouses is to provide users with information for statistical analysis and management decision-making.
Data warehouses must ensure high speed of data retrieval, the ability to obtain and compare so-called data slices, as well as consistency, completeness and reliability of data.
OLAP (On-Line Analytical Processing) is a key component of building and using data warehouses. This technology is based on the construction of multidimensional data sets - OLAP cubes, the axes of which contain parameters, and the cells contain aggregate data that depends on them.
Applications with OLAP functionality must provide the user with analysis results in an acceptable time, perform logical and statistical analysis, support multi-user access to data, provide a multi-dimensional conceptual representation of data, and be able to access any necessary information.

In addition, we reviewed the basic principles of the logical organization of OLAP cubes, and also learned the basic terms and concepts used in multidimensional analysis. Finally, we learned what the different types of hierarchies are in OLAP cube dimensions.

In the next article in this series, we will look at the typical structure of data warehouses, talk about what client and server OLAP is, and also focus on some technical aspects of multidimensional data storage.

ComputerPress 4"2001

17.05.2021

Internet

The most interesting: