Basic types of olap analysis. Introduction to OLAP

data warehouses are formed on the basis of snapshots of operational databases recorded over a long period of time information system and possibly various external sources. Data warehouses use database technologies, OLAP, deep data analysis, and data visualization.

Main characteristics of data warehouses.

  • contains historical data;
  • stores detailed information, as well as partially and completely summarized data;
  • the data is mostly static;
  • an ad hoc, unstructured and heuristic way of processing data;
  • medium and low transaction processing intensity;
  • unpredictable way of using data;
  • intended for analysis;
  • focused on subject areas;
  • support for strategic decision making;
  • serves a relatively small number of management employees.

The term OLAP (On-Line Analytical Processing) is used to describe the model for presenting data and, accordingly, the technology for processing it in data warehouses. OLAP uses a multidimensional view of aggregated data to provide quick access to strategically important information for the purpose of in-depth analysis. OLAP applications must have the following basic properties:

  • multidimensional data presentation;
  • support for complex calculations;
  • correct consideration of the time factor.

Advantages of OLAP:

  • promotion productivity production staff, developers application programs . Timely access to strategic information.
  • providing sufficient opportunity for users to make their own changes to the schema.
  • OLAP applications rely on data warehouses and OLTP systems, receiving current data from them, which allows saving integrity control corporate data.
  • reducing the load on OLTP systems and data warehouses.

OLAP and OLTP. Characteristics and main differences

OLAP OLTP
Data store should include both internal corporate data and external data the main source of information entering the operational database is the activities of the corporation, and data analysis requires the involvement of external sources of information (for example, statistical reports)
The volume of analytical databases is at least an order of magnitude larger than the volume of operational ones. to conduct reliable analysis and forecasting in data store you need to have information about the corporation’s activities and market conditions over several years For prompt processing, data for the last few months is required
Data store must contain uniformly presented and consistent information that is as close as possible to the content of operational databases. A component is needed to extract and “clean” information from different sources. In many large corporations, several operational information systems with their own databases simultaneously exist (for historical reasons). Operational databases may contain semantically equivalent information presented in different formats, with different indications of the time of its receipt, sometimes even contradictory
The set of queries to an analytical database cannot be predicted. data warehouses exist to respond to ad hoc requests from analysts. You can only count on the fact that requests will not come too often and will involve large amounts of information. The size of the analytical database encourages the use of queries with aggregates (sum, minimum, maximum, average value etc.) Data processing systems are created to solve specific problems. Information from the database is selected frequently and in small portions. Typically, a set of queries to an operational database is known already during design
When the variability of analytical databases is low (only when loading data), the ordering of arrays, faster indexing methods for mass sampling, and storage of pre-aggregated data turn out to be reasonable Data processing systems by their nature are highly variable, which is taken into account in the DBMS used (normalized database structure, rows stored out of order, B-trees for indexing, transactional)
Analytical database information is so critical for a corporation that greater granularity of protection is required (individual access rights to certain rows and/or columns of the table) For data processing systems it is usually sufficient information protection at table level

Codd's rules for OLAP systems

In 1993, Codd published OLAP for User Analysts: What It Should Be. In it, he outlined the basic concepts of online analytics and defined 12 rules that must be met by products that provide online analytics capabilities.

  1. Conceptual multidimensional representation. An OLAP model must be multidimensional at its core. A multidimensional conceptual diagram or custom representation facilitates modeling and analysis as well as calculations.
  2. Transparency. The user is able to obtain all the necessary data from the OLAP engine, without even knowing where it comes from. Regardless of whether the OLAP product is part of the user's tools or not, this fact should be invisible to the user. If OLAP is provided by client-server computing, then this fact should also, if possible, be invisible to the user. OLAP must be provided in the context of a truly open architecture, allowing the user, wherever he is, to communicate through an analytical tool with the server. In addition to this, transparency should also be achieved when the analytical tool interacts with homogeneous and heterogeneous database environments.
  3. Availability. OLAP must provide its own logic circuit to access in a heterogeneous database environment and perform appropriate transformations to provide data to the user. Moreover, it is necessary to take care in advance about where and how, and what types of physical organization of data will actually be used. An OLAP system should access only the data that is actually required, and not apply general principle“kitchen funnel”, which entails unnecessary input.
  4. Constant performance when developing reports. Performance the ability to generate reports should not drop significantly as the number of dimensions and database size increases.
  5. Client-server architecture. It requires that the product not only be client-server, but also that the server component be intelligent enough to allow different clients to connect with a minimum of effort and programming.
  6. General multidimensionality. All dimensions must be equal, each dimension must be equivalent in both structure and operational capabilities. It is true that additional operational capabilities are allowed for individual dimensions (presumably time is implied), but such additional functionality must be provided to any dimension. It should not be so that basic data structures, computational or reporting formats were more specific to one dimension.
  7. Dynamic Control sparse matrices. OLAP systems must automatically adjust their physical schema depending on the model type, data volumes, and database sparsity.
  8. Multi-user support. An OLAP tool must provide capabilities sharing(query and completion), integrity and security.
  9. Unlimited cross operations. All types of operations must be allowed for any measurements.
  10. Intuitive data manipulation. Data manipulation was carried out through direct actions on cells in viewing mode without using menus and multiple operations.
  11. Flexible reporting options. Dimensions should be placed in the report the way the user needs it.
  12. Unlimited

Online Analytical Processing, or OLAP, is efficient technology data processing, as a result of which, based on huge arrays of all kinds of data, final information is displayed. It is a powerful product that helps you access, retrieve and view information on your PC by analyzing it from different perspectives.

OLAP is a tool that provides a strategic position for long-term planning and looks at the underlying information of operational data for a future of 5, 10 or more years. Data is stored in a database with a dimension, which is its attribute. Users can view the same data set with different attributes, depending on the purpose of the analysis.

History of OLAP

OLAP is not a new concept and has been used for decades. In fact, the origins of the technology can be traced back to 1962. But the term was only coined in 1993 by database author Ted Coddom, who also established 12 rules for the product. As with many other applications, the concept has undergone several stages of evolution.

The history of OLAP technology itself dates back to 1970, when Express information resources and the first Olap server were released. They were acquired by Oracle in 1995 and subsequently became the basis for the online analytical processing of the multi-dimensional computing engine that the famous computer brand provided in its database. In 1992, another well-known online analytics processing product, Essbase, was released by Arbor Software (acquired by Oracle in 2007).

In 1998, Microsoft released the online analytical data processing server MS Analysis Services. This contributed to the popularity of the technology and prompted the development of other products. Today there are several world-renowned vendors offering Olap applications, including IBM, SAS, SAP, Essbase, Microsoft, Oracle, IcCube.

Online analytical processing

OLAP is a tool that allows you to make decisions about planned events. An atypical Olap calculation can be more complex than simply aggregating data. Analytical queries per minute (AQM) are used as a standard benchmark to compare the performance of different tools. These systems should shield users from complex query syntax as much as possible and provide consistent response times for everyone (no matter how complex).

There are the following main characteristics of OLAP:

  1. Multidimensional data representations.
  2. Supports complex calculations.
  3. Temporary reconnaissance.

A multidimensional view provides the basis for analytical processing through flexible access to enterprise data. It allows users to analyze data in any dimension and at any level of aggregation.

Support for complex calculations is the core of OLAP software.

Time intelligence is used to evaluate the performance of any analytics application over a period of time. For example, this month compared to last month, this month compared to the same month last year.

Multidimensional data structure

One of the main characteristics of online analytical processing is the multidimensional structure of the data. A cube can have several dimensions. Thanks to this model, the entire OLAP mining process is simple for managers and executives because the objects represented in the cells are real-world business objects. In addition, this data model allows users to process not only structured arrays, but also unstructured and semi-structured ones. All this makes them especially popular for data analysis and BI applications.

Main characteristics of OLAP systems:

  1. Use multidimensional data analysis methods.
  2. Provide advanced database support.
  3. Create easy-to-use end-user interfaces.
  4. Supports client/server architecture.

One of the main components of OLAP concepts is the client-side server. In addition to aggregating and preprocessing data from a relational database, it provides advanced calculation and recording options, additional functions, basic advanced query capabilities, and other functions.

Depending on the example application selected by the user, various models data and tools, including real-time alerting, what-if scenarios, optimization, and sophisticated OLAP reporting.

Cubic shape

The concept is based on a cubic shape. The layout of the data shows how OLAP adheres to the principle of multidimensional analysis, resulting in a data structure designed for fast and efficient analysis.

An OLAP cube is also called a "hypercube". It is described as consisting of numerical facts (measures) classified into facets (dimensions). Dimensions refer to the attributes that define a business problem. Simply put, a dimension is a label that describes a measure. For example, in sales reports, the measure would be sales volume, and the dimensions would include sales period, salespeople, product or service, and sales region. In manufacturing operations reporting, the measure may be total manufacturing costs and units of production. The dimensions will be the date or time of production, the production stage or phase, even the workers involved in the production process.

The OLAP data cube is the cornerstone of the system. The data in a cube is organized using either a star or snowflake schema. In the center there is a fact table containing aggregates (measures). It is associated with a series of dimension tables containing information about measures. The dimensions describe how these measures can be analyzed. If a cube contains more than three dimensions, it is often called a hypercube.

One of the main features belonging to the cube is its static nature, which implies that the cube cannot be changed once it is developed. Therefore, the process of assembling the cube and setting up the data model is a critical step towards appropriate data processing in the OLAP architecture.

Data merging

The use of aggregations is the main reason why queries are processed much faster in OLAP tools (compared to OLTP). Aggregations are summaries of data that have been pre-calculated during processing. All members stored in OLAP dimension tables determine the queries that the cube can receive.

In a cube, accumulations of information are stored in cells, the coordinates of which are specified by specific dimensions. The number of aggregates a cube can contain depends on all possible combinations of dimension members. Therefore, a typical cube in an application may contain extremely a large number of units. Pre-calculation will be performed only for key aggregates that are distributed throughout the online analytics cube. This will significantly reduce the time required to define any aggregations when running a query in the data model.

There are also two options related to aggregations that you can use to improve the performance of your finished cube: create a capability cache aggregation and use a user query-based aggregation.

Principle of operation

Typically, analysis of operational information obtained from transactions can be performed using a simple spreadsheet (data values ​​are presented in rows and columns). This is good given the two-dimensional nature of the data. In the case of OLAP there are differences, which is associated with a multidimensional data array. Because they often come from different sources, the spreadsheet can't always process them efficiently.

The cube solves this problem and also makes the OLAP data warehouse work in a logical and orderly manner. Business collects data from numerous sources and is presented in different formats such as text files, multimedia files, electronic Excel tables, Access databases and even OLTP databases.

All data is collected in a warehouse filled directly from the sources. In it, the raw information obtained from OLTP and other sources will be cleared of any erroneous, incomplete and inconsistent transactions.

Once cleaned and transformed, the information will be stored in a relational database. It will then be uploaded to a multidimensional OLAP server (or Olap cube) for analysis. End users responsible for business applications, data mining and other business operations will have access to the information they need from the Olap cube.

Advantages of the Array Model

OLAP is a tool that provides fast query performance, which is achieved through optimized storage, multidimensional indexing and caching, which are among the significant advantages of the system. In addition, the advantages are:

  1. Smaller data size on disk.
  2. Automated calculation of higher level data aggregates.
  3. Array models provide natural indexing.
  4. Effective data extraction is achieved through preliminary structuring.
  5. Compactness for low-dimensional data sets.

The disadvantages of OLAP include the fact that some decisions (processing steps) can take quite a long time, especially with large volumes of information. This is usually corrected by performing only incremental processing (learning from the data that has changed).

Basic analytical operations

Convolution(roll-up/drill-up) is also known as “consolidation”. Collapsing involves taking all the data that can be obtained and computing everything in one or more dimensions. Most often, this may require the use of a mathematical formula. As an OLAP example, we can consider a retail chain with outlets in different cities. To identify patterns and anticipate future sales trends, sales data from all locations is “rolled up” to the company’s main sales department for consolidation and calculation.

Disclosure(drill-down). This is the opposite of rolling up. The process starts with a large data set and then breaks it down into smaller parts, thereby allowing users to view the details. In the retail chain example, the analyst would analyze sales data and look at the individual brands or products that are considered best sellers at each outlet in different cities.

Section(Slice and dice). This is a process where analytical operations involve two actions: extracting a specific set of data from an OLAP cube (the "cutting" aspect of the analysis) and viewing it from different points of view or angles. This can happen when all the point of sale data is received and entered into the hypercube. An analyst cuts a set of data related to sales from an OLAP Cube. Next, it will be viewed when analyzing sales of individual units in each region. During this time, other users can focus on assessing the cost-effectiveness of sales or assessing the effectiveness of a marketing and advertising campaign.

Turn(Pivot). It rotates the data axes to provide a replacement representation of information.

Types of databases

Basically, it is a typical OLAP cube that implements analytical processing of multi-dimensional data using OLAP Cube or any data cube so that the analytical process can add dimensions as needed. Any information loaded into a multidimensional database will be stored or archived and can be recalled when required.

Meaning

Relational OLAP (ROLAP)

ROLAP is an advanced DBMS along with multidimensional data mapping to perform standard relational operation

Multidimensional OLAP (MOLAP)

MOLAP - implements work in multidimensional data

Hybrid Online Analytical Processing (HOLAP)

In the HOLAP approach, aggregated totals are stored in a multidimensional database, and detailed information stored in a relational database. This provides both the efficiency of the ROLAP model and the performance of the MOLAP model

Desktop OLAP (DOLAP)

In Desktop OLAP, the user downloads a piece of data from a database locally or to his desktop and analyzes it. DOLAP is relatively cheaper to deploy as it offers very little functionality compared to other OLAP systems

Web OLAP (WOLAP)

Web OLAP is an OLAP system accessible through a web browser. WOLAP is a three-tier architecture. It consists of three components: client, middleware and database server

Mobile OLAP

Mobile OLAP helps users access and analyze OLAP data using their mobile devices

Spatial OLAP

SOLAP is created to facilitate the management of both spatial and non-spatial data in geographic information system(GIS)

There are lesser-known OLAP systems or technologies, but these are the main ones currently used by large corporations, businesses, and even governments.

OLAP Tools

Online analytics tools are very well present on the Internet in both paid and free versions.

The most popular of them:

  1. Dundas BI from Dundas Data Visualization is a browser-based platform for business intelligence and data visualization that includes integrated dashboards, OLAP reporting, and data analytics.
  2. Yellowfin is a business intelligence platform that provides a single integrated solution designed for companies of different industries and sizes. This system is customized for enterprises in the fields of accounting, advertising, and agriculture.
  3. ClicData is a business intelligence (BI) solution designed for use primarily by small and medium-sized businesses. The tool allows end users to create reports and dashboards. Board is designed to integrate business intelligence, corporate performance management and is a full-featured system that serves mid-market and enterprise companies.
  4. Domo is a cloud-based business management suite that integrates with multiple data sources, including spreadsheets, databases, social media, and any existing cloud or on-premises software solution.
  5. InetSoft Style Intelligence is a software platform for business analysts that allows users to create dashboards, visual OLAP analysis technology and reports using a mashup engine.
  6. Birst from Infor Company is a web-based business intelligence and analytics solution that connects the insights of diverse teams to help you make informed decisions. The tool allows decentralized users to scale up the enterprise team model.
  7. Halo is complex system supply chain management and business intelligence, which helps in business planning and inventory forecasting for supply chain management. The system uses data from all sources - large, small and in between.
  8. Chartio is cloud solution for Business Analysts, which provides founders, business teams, data analysts and product teams with organizational tools for their daily work.
  9. Exago BI is a web-based solution designed for implementation in web applications. Implementing Exago BI enables companies of all sizes to provide their clients with tailored, timely and interactive reporting.

Business Impact

The user will find OLAP in most business applications across industries. The analysis is used not only by business, but also by other interested parties.

Some of its most common applications include:

  1. Marketing OLAP data analysis.
  2. Financial reporting, which covers sales and expenses, budgeting and financial planning.
  3. Business process management.
  4. Sales analysis.
  5. Database marketing.

The industries continue to grow, which means that users will soon see more OLAP applications. Multidimensional adaptive processing provides more dynamic analysis. It is for this reason that these OLAP systems and technologies are used to evaluate what-if scenarios and alternative business scenarios.

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Course work

discipline: Databases

Subject: TechnologyOLAP

Completed:

Chizhikov Alexander Alexandrovich

Introduction

1. Classification of OLAP products

2. OLAP client - OLAP server: pros and cons

3. Core OLAP system

3.1 Design principles

Conclusion

List of sources used

Applications

INconducting

It is difficult to find a person in the computer world who, at least on an intuitive level, does not understand what databases are and why they are needed. Unlike traditional relational DBMSs, the concept of OLAP is not so widely known, although almost everyone has probably heard the mysterious term “OLAP cubes”. What is OnLine Analytical Processing?

OLAP is not a separate software product, not a programming language, or even a specific technology. If we try to cover OLAP in all its manifestations, then it is a set of concepts, principles and requirements that underlie software products that make it easier for analysts to access data. Although no one would disagree with such a definition, it is doubtful that it would bring non-specialists one iota closer to understanding the subject. Therefore, in your quest to understand OLAP, it is better to take a different path. First, we need to find out why analysts need to somehow specifically facilitate access to data.

The fact is that analysts are special consumers of corporate information. The analyst's task is to find patterns in large amounts of data. Therefore, the analyst will not pay attention to a single fact; he needs information about hundreds and thousands of events. By the way, one of the significant points that led to the emergence of OLAP is productivity and efficiency. Let's imagine what happens when an analyst needs to obtain information, but there are no OLAP tools in the enterprise. The analyst independently (which is unlikely) or with the help of a programmer makes the appropriate SQL query and receives the data of interest in the form of a report or exports it to a spreadsheet. A great many problems arise in this case. Firstly, the analyst is forced to do something other than his job (SQL programming) or wait for programmers to complete the task for him - all this negatively affects labor productivity, the rate of heart attack and stroke increases, and so on. Secondly, a single report or table, as a rule, does not save the giants of thought and the fathers of Russian analysis - and the whole procedure will have to be repeated again and again. Thirdly, as we have already found out, analysts do not ask about trifles - they need everything at once. This means (although technology is advancing by leaps and bounds) that the corporate relational DBMS server accessed by the analyst can think deeply and for a long time, blocking other transactions.

The concept of OLAP appeared precisely to solve such problems. OLAP cubes are essentially meta reports. By cutting meta-reports (cubes, that is) along dimensions, the analyst actually receives the “ordinary” two-dimensional reports that interest him (these are not necessarily reports in the usual sense of the term - we are talking about data structures with the same functions). The advantages of cubes are obvious - data needs to be requested from a relational DBMS only once - when building a cube. Since analysts, as a rule, do not work with information that is supplemented and changed on the fly, the generated cube is relevant for quite a long time. Thanks to this, not only are interruptions in the operation of the relational DBMS server eliminated (there are no queries with thousands and millions of response lines), but the speed of access to data for the analyst himself also sharply increases. In addition, as already noted, performance is also improved by calculating subsums of hierarchies and other aggregated values ​​at the time the cube is built.

Of course, you have to pay to increase productivity in this way. It is sometimes said that the data structure simply “explodes” - an OLAP cube can take up tens or even hundreds of times more space than the original data.

Now that we have a little understanding of how OLAP works and what it serves, it is still worth formalizing our knowledge somewhat and giving OLAP criteria without simultaneous translation into ordinary human language. These criteria (12 in total) were formulated in 1993 by E.F. Codd - the creator of the concept of relational DBMS and, concurrently, OLAP. We will not consider them directly, since they were later reworked into the so-called FASMI test, which determines the requirements for OLAP products. FASMI is an acronym for the name of each test item:

Fast (fast). This property means that the system must provide a response to a user request in an average of five seconds; however, most requests are processed within one second, and the most complex requests should be processed within twenty seconds. Recent studies have shown that the user begins to doubt the success of a request if it takes more than thirty seconds.

Analysis (analytical). The system must be able to handle any logical and statistical analysis typical of business applications, and ensure that the results are stored in a form accessible to the end user. Analysis tools may include procedures for analyzing time series, cost distribution, currency conversion, modeling changes in organizational structures, and some others.

Shared. The system should provide ample opportunities for restricting access to data and simultaneous operation of many users.

Multidimensional (multidimensional). The system must provide a conceptually multidimensional view of data, including full support for multiple hierarchies.

Information. The power of various software products is characterized by the amount of input data processed. Different OLAP systems have different capacities: advanced OLAP solutions can handle at least a thousand times more data than the least powerful ones. When choosing an OLAP tool, there are a number of factors to consider, including data duplication, memory requirements, disk space usage, performance metrics, integration with information warehouses, and so on.

1. Classification of OLAP products

So, the essence of OLAP is that the initial information for analysis is presented in the form of a multidimensional cube, and it is possible to arbitrarily manipulate it and obtain the necessary information sections - reports. In this case, the end user sees the cube as a multidimensional dynamic table that automatically summarizes data (facts) in various sections (dimensions), and allows interactive management of calculations and report form. These operations are performed by the OLAP engine (or OLAP calculation engine).

Today, many products have been developed around the world that implement OLAP technologies. To make it easier to navigate among them, classifications of OLAP products are used: by the method of storing data for analysis and by the location of the OLAP machine. Let's take a closer look at each category of OLAP products.

I'll start with a classification based on the method of data storage. Let me remind you that multidimensional cubes are built on the basis of source and aggregate data. Both source and aggregate data for cubes can be stored in both relational and multidimensional databases. Therefore, three methods of data storage are currently used: MOLAP (Multidimensional OLAP), ROLAP (Relational OLAP) and HOLAP (Hybrid OLAP). Accordingly, OLAP products are divided into three similar categories based on the method of data storage:

1.In the case of MOLAP, source and aggregate data are stored in a multidimensional database or in a multidimensional local cube.

2.In ROLAP products, source data is stored in relational databases or in flat local tables on a file server. Aggregate data can be placed in service tables in the same database. Conversion of data from a relational database into multidimensional cubes occurs at the request of an OLAP tool.

3. When using HOLAP architecture, the source data remains in the relational database, and the aggregates are placed in the multidimensional one. An OLAP cube is built at the request of an OLAP tool based on relational and multidimensional data.

The next classification is based on the location of the OLAP machine. Based on this feature, OLAP products are divided into OLAP servers and OLAP clients:

In server OLAP tools, calculations and storage of aggregate data are performed by a separate process - the server. The client application receives only the results of queries against multidimensional cubes that are stored on the server. Some OLAP servers support data storage only in relational databases, some only in multidimensional ones. Many modern OLAP servers support all three data storage methods: MOLAP, ROLAP and HOLAP.

The OLAP client is designed differently. The construction of a multidimensional cube and OLAP calculations are performed in the memory of the client computer. OLAP clients are also divided into ROLAP and MOLAP. And some may support both data access options.

Each of these approaches has its own pros and cons. Contrary to popular belief about the advantages of server tools over client tools, in a number of cases, using an OLAP client for users can be more effective and profitable than using an OLAP server.

2. OLAP client - OLAP server: pros and cons

When building an information system, OLAP functionality can be implemented using both server and client OLAP tools. In practice, the choice is a trade-off between performance and software cost.

The volume of data is determined by the combination of the following characteristics: number of records, number of dimensions, number of dimension elements, length of dimensions and number of facts. It is known that an OLAP server can process larger volumes of data than an OLAP client with equal computer power. This is because the OLAP server stores a multidimensional database containing precomputed cubes on hard drives.

When performing OLAP operations, client programs execute queries on it in an SQL-like language, receiving not the entire cube, but its displayed fragments. The OLAP client must have the entire cube in RAM at the time of operation. In the case of a ROLAP architecture, it is necessary to first load into memory the entire data array used to calculate the cube. Additionally, as the number of dimensions, facts, or dimension members increases, the number of aggregates grows exponentially. Thus, the amount of data processed by the OLAP client is directly dependent on the amount of RAM on the user's PC.

However, note that most OLAP clients provide distributed computing. Therefore, the number of processed records, which limits the work of the client OLAP tool, is understood not as the volume of primary data in the corporate database, but as the size of the aggregated sample from it. The OLAP client generates a request to the DBMS, which describes the filtering conditions and the algorithm for preliminary grouping of the primary data. The server finds, groups records and returns a compact selection for further OLAP calculations. The size of this sample can be tens or hundreds of times smaller than the volume of primary, non-aggregated records. Consequently, the need for such an OLAP client in PC resources is significantly reduced.

In addition, the number of dimensions is subject to limitations in human perception. It is known that the average person can simultaneously operate with 3-4, maximum 8 dimensions. With a larger number of dimensions in a dynamic table, the perception of information becomes significantly more difficult. This factor should be taken into account when preliminary calculating the RAM that may be required by the OLAP client.

The length of the dimensions also affects the size of the OLAP engine's address space when computing an OLAP cube. The longer the dimensions, the more resources are required to presort a multidimensional array, and vice versa. Only short measurements in the source data is another argument in favor of the OLAP client.

This characteristic is determined by the two factors discussed above: the volume of data processed and the power of computers. As the number, for example, of dimensions increases, the performance of all OLAP tools decreases due to a significant increase in the number of aggregates, but the rate of decrease is different. Let's demonstrate this dependence on a graph.

Scheme 1. Dependence of the performance of client and server OLAP tools on an increase in data volume

The speed characteristics of an OLAP server are less sensitive to data growth. This is explained by different technologies for processing user requests by the OLAP server and OLAP client. For example, during a drill-down operation, the OLAP server accesses the stored data and “pulls” the data from this “branch”. The OLAP client calculates the entire set of aggregates at the time of loading. However, up to a certain amount of data, the performance of server and client tools is comparable. For OLAP clients that support distributed computing, the scope of performance comparability can extend to data volumes that cover the OLAP analysis needs of a huge number of users. This is confirmed by the results of internal testing of MS OLAP Server and the OLAP client "Kontur Standard". The test was performed on an IBM PC Pentium Celeron 400 MHz, 256 Mb for a sample of 1 million unique (ie, aggregated) records with 7 dimensions containing from 10 to 70 members. The cube loading time in both cases does not exceed 1 second, and various OLAP operations (drill up, drill down, move, filter, etc.) are completed in hundredths of a second.

When the sample size exceeds the amount of RAM, swapping with the disk begins and the performance of the OLAP client drops sharply. Only from this moment can we talk about the advantage of the OLAP server.

It should be remembered that the “breaking point” determines the limit of a sharp increase in the cost of an OLAP solution. For the tasks of each specific user, this point is easily determined by performance tests of the OLAP client. Such tests can be obtained from the development company.

In addition, the cost of a server OLAP solution increases as the number of users increases. The fact is that the OLAP server performs calculations for all users on one computer. Accordingly, than more quantity users, the more RAM and processing power. Thus, if the volumes of data being processed are in the area of ​​comparable performance of server and client systems, then, other things being equal, using an OLAP client will be more profitable.

Using an OLAP server in the “classical” ideology involves uploading relational DBMS data into a multidimensional database. The upload is performed over a certain period, so the OLAP server data does not reflect the current state. Only those OLAP servers that support the ROLAP mode of operation are free from this drawback.

Similarly, a number of OLAP clients allow you to implement ROLAP and Desktop architectures with direct access to the database. This ensures on-line analysis of source data.

The OLAP server places minimal requirements on the power of client terminals. Objectively, the requirements of an OLAP client are higher, because... it performs calculations in the user's PC RAM. The state of a particular organization's hardware fleet is the most important indicator that must be taken into account when choosing an OLAP tool. But there are also “pros” and “cons” here. An OLAP server does not use the enormous computing power of modern personal computers. If an organization already has a fleet of modern PCs, it is ineffective to use them only as display terminals and at the same time incur additional costs for the central server.

If the power of the users' computers "leaves much to be desired," the OLAP client will work slowly or not be able to work at all. Buying one powerful server may be cheaper than upgrading all your PCs.

Here it is useful to take into account trends in hardware development. Since the volume of data for analysis is practically constant, a steady increase in PC power will lead to an expansion of the capabilities of OLAP clients and their displacement of OLAP servers into the segment of very large databases.

When using an OLAP server over the network, only the data to be displayed is transferred to the client's PC, while the OLAP client receives the entire volume of primary data.

Therefore, where an OLAP client is used, network traffic will be higher.

But, when using an OLAP server, user operations, for example, detailing, generate new queries to the multidimensional database, and, therefore, new data transfer. The execution of OLAP operations by an OLAP client is performed in RAM and, accordingly, does not cause new data flows in the network.

It should also be noted that modern network hardware provides high levels of throughput.

Therefore, in the vast majority of cases, analyzing a “medium” sized database using an OLAP client will not slow down the user’s work.

The cost of an OLAP server is quite high. This should also include the cost of a dedicated computer and the ongoing costs of administering a multidimensional database. In addition, the implementation and maintenance of an OLAP server requires fairly highly qualified personnel.

The cost of an OLAP client is an order of magnitude lower than the cost of an OLAP server. No administration or additional technical equipment is required for the server. There are no high requirements for personnel qualifications when implementing an OLAP client. An OLAP client can be implemented much faster than an OLAP server.

Development of analytical applications using client OLAP tools is a fast process and does not require special training. A user who knows the physical implementation of the database can develop an analytical application independently, without the involvement of an IT specialist. When using an OLAP server, you need to learn 2 different systems, sometimes from different vendors - to create cubes on the server, and to develop a client application. The OLAP client provides a single visual interface for describing cubes and setting up user interfaces for them.

Let's walk through the process of creating an OLAP application using the client tool.

Diagram 2. Creating an OLAP application using a ROLAP client tool

The operating principle of ROLAP clients is a preliminary description of the semantic layer, behind which the physical structure of the source data is hidden. In this case, data sources can be: local tables, RDBMS. The list of supported data sources is determined by the specific software product. After this, the user can independently manipulate objects that he understands in terms of the subject area to create cubes and analytical interfaces.

The operating principle of the OLAP server client is different. In an OLAP server, when creating cubes, the user manipulates the physical descriptions of the database.

At the same time, custom descriptions are created in the cube itself. The OLAP server client is configured only for the cube.

Let us explain the principle of operation of the ROLAP client using the example of creating a dynamic sales report (see Diagram 2). Let the initial data for analysis be stored in two tables: Sales and Deal.

When creating a semantic layer, data sources - the Sales and Deal tables - are described in terms that the end user can understand and turn into “Products” and “Deals”. The "ID" field from the "Products" table is renamed to "Code", and the "Name" to "Product", etc.

Then the Sales business object is created. A business object is a flat table on the basis of which a multidimensional cube is formed. When creating a business object, the "Products" and "Transactions" tables are merged by the "Code" field of the product. Since all table fields are not required for display in the report, the business object uses only the “Item”, “Date” and “Amount” fields.

Next, an OLAP report is created based on the business object. The user selects a business object and drags its attributes onto the column or row areas of the report table. In our example, based on the “Sales” business object, a report on product sales by month was created.

When working with an interactive report, the user can set filtering and grouping conditions with the same simple mouse movements. At this point, the ROLAP client accesses the data in the cache. The OLAP server client generates a new query to the multidimensional database. For example, by applying a filter by product in a sales report, you can get a report on sales of products that interest us.

All OLAP application settings can be stored in a dedicated metadata repository, in the application, or in a multidimensional database system repository. Implementation depends on the specific software product.

So, in what cases can using an OLAP client be more effective and profitable for users than using an OLAP server?

The economic feasibility of using an OLAP server arises when the volumes of data are very large and overwhelming for the OLAP client, otherwise the use of the latter is more justified. In this case, the OLAP client combines high performance characteristics and low cost.

Powerful PCs for analysts are another argument in favor of OLAP clients. When using an OLAP server, these capacities are not used. Among the advantages of OLAP clients are the following:

The costs of implementing and maintaining an OLAP client are significantly lower than the costs of an OLAP server.

When using an OLAP client with an embedded machine, data is transferred over the network once. When performing OLAP operations, no new data streams are generated.

Setting up ROLAP clients is simplified by eliminating the intermediate step - creating a multidimensional database.

3. Core OLAP system

3.1 Design principles

application client core data

From what has already been said, it is clear that the OLAP mechanism is one of the popular methods of data analysis today. There are two main approaches to solving this problem. The first of them is called Multidimensional OLAP (MOLAP) - implementation of the mechanism using a multidimensional database on the server side, and the second Relational OLAP (ROLAP) - building cubes on the fly based on SQL queries to a relational DBMS. Each of these approaches has its pros and cons. Their comparative analysis is beyond the scope of this work. Only the core implementation of the desktop ROLAP module will be described here.

This task arose after using a ROLAP system built on the basis of Decision Cube components included in Borland Delphi. Unfortunately, the use of this set of components showed poor performance on large amounts of data. This problem can be mitigated by trying to cut out as much data as possible before feeding it into cubes. But this is not always enough.

You can find a lot of information about OLAP systems on the Internet and in the press, but almost nowhere is it said about how it works inside.

Scheme of work:

The general scheme of operation of a desktop OLAP system can be represented as follows:

Diagram 3. Operation of a desktop OLAP system

The operating algorithm is as follows:

1. Obtaining data in the form of a flat table or the result of executing an SQL query.

2.Caching data and converting it to a multidimensional cube.

3.Displaying the constructed cube using a cross-tab or chart, etc. IN general case, an arbitrary number of mappings can be connected to one cube.

Let's consider how such a system can be arranged internally. We will start from the side that can be seen and touched, that is, from the displays. The displays used in OLAP systems most often come in two types - cross-tabs and charts. Let's look at a crosstab, which is the basic and most common way to display a cube.

In the picture below, yellow rows and columns containing aggregated results are displayed, cells containing facts are marked in light gray, and cells containing dimensional data are marked in dark gray.

Thus, the table can be divided into the following elements, which we will work with in the future:

When filling out the matrix with facts, we must proceed as follows:

Based on the measurement data, determine the coordinates of the element to be added in the matrix.

Determine the coordinates of the columns and rows of the totals that are affected by the added element.

Add an element to the matrix and the corresponding total columns and rows.

It should be noted that the resulting matrix will be very sparse, which is why its organization in the form of a two-dimensional array (the option lying on the surface) is not only irrational, but, most likely, impossible due to the large dimension of this matrix, for storing which there is no No amount of RAM is enough. For example, if our cube contains information about sales for one year, and if it has only 3 dimensions - Customers (250), Products (500) and Date (365), then we will get a fact matrix of the following dimensions: number of elements = 250 x 500 x 365 = 45,625,000. And this despite the fact that there may be only a few thousand filled elements in the matrix. Moreover, the greater the number of dimensions, the more sparse the matrix will be.

Therefore, to work with this matrix, you need to use special mechanisms for working with sparse matrices. Various options for organizing a sparse matrix are possible. They are quite well described in the programming literature, for example, in the first volume of the classic book "The Art of Programming" by Donald Knuth.

Let us now consider how we can determine the coordinates of a fact, knowing the dimensions corresponding to it. To do this, let's take a closer look at the header structure:

In this case, you can easily find a way to determine the numbers of the corresponding cell and the totals in which it falls. Several approaches can be proposed here. One is to use a tree to find matching cells. This tree can be constructed by traversing the selection. In addition, an analytical recurrence formula can be easily defined to calculate the required coordinate.

The data stored in the table needs to be transformed in order to be used. Thus, in order to improve performance when building a hypercube, it is desirable to find unique elements stored in columns that are dimensions of the cube. In addition, you can perform preliminary aggregation of facts for records that have the same dimension values. As mentioned above, the unique values ​​​​available in the measurement fields are important to us. Then the following structure can be proposed for storing them:

Scheme 4. Structure for storing unique values

By using this structure, we significantly reduce the memory requirement. Which is quite relevant, because... To increase operating speed, it is advisable to store data in RAM. In addition, you can only store an array of elements, and dump their values ​​to disk, since we will only need them when displaying the cross-tab.

The ideas described above were the basis for creating the CubeBase component library.

Diagram 5. Structure of the CubeBase component library

TСubeSource performs caching and conversion of data into an internal format, as well as preliminary aggregation of data. The TCubeEngine component performs hypercube calculations and operations with it. In fact, it is an OLAP engine that transforms a flat table into a multidimensional data set. The TCubeGrid component displays the crosstab and controls the display of the hypercube. TСubeChart allows you to see the hypercube in the form of graphs, and the TСubePivote component controls the operation of the cube core.

So, I looked at the architecture and interaction of components that can be used to build an OLAP machine. Now let's take a closer look at the internal structure of the components.

The first stage of the system will be loading data and converting it into an internal format. A logical question would be: why is this necessary, since you can simply use data from a flat table, viewing it when constructing a cube slice. In order to answer this question, let's look at the table structure from the point of view of an OLAP machine. For OLAP systems, table columns can be either facts or dimensions. However, the logic for working with these columns will be different. In a hypercube, the dimensions are actually the axes, and the dimension values ​​are the coordinates on those axes. In this case, the cube will be filled very unevenly - there will be combinations of coordinates that will not correspond to any records and there will be combinations that correspond to several records in the original table, and the first situation is more common, that is, the cube will be similar to the universe - empty space, in some places which there are clusters of points (facts). Thus, if we perform data preaggregation during the initial data loading, that is, we combine records that have the same measurement values, while calculating preliminary aggregated fact values, then in the future we will have to work with fewer records, which will increase the speed of work and reduce requirements to the amount of RAM.

To build slices of a hypercube, we need the following capabilities - defining coordinates (actually measurement values) for table records, as well as defining records that have specific coordinates (measurement values). Let's consider how these possibilities can be realized. The easiest way to store a hypercube is to use a database of its own internal format.

Schematically, the transformations can be represented as follows:

Figure 6: Converting an Internal Format Database to a Normalized Database

That is, instead of one table, we got a normalized database. In fact, normalization reduces the speed of the system, database specialists may say, and in this they will certainly be right, in the case when we need to get values ​​for dictionary elements (in our case, measurement values). But the thing is that we don’t need these values ​​at all at the stage of constructing the slice. As mentioned above, we are only interested in the coordinates in our hypercube, so we will define the coordinates for the measurement values. The easiest thing to do would be to renumber the element values. In order for the numbering to be unambiguous within one dimension, we first sort the lists of dimension values ​​(dictionaries, in database terms) in alphabetical order. In addition, we will renumber the facts, and the facts are pre-aggregated. We get the following diagram:

Scheme 7. Renumbering the normalized database to determine the coordinates of measurement values

Now all that remains is to connect the elements of different tables with each other. In the theory of relational databases, this is done using special intermediate tables. It is enough for us to associate each entry in the measurement tables with a list, the elements of which will be the numbers of facts in the formation of which these measurements were used (that is, to determine all facts that have the same value of the coordinate described by this measurement). For facts, each record will be matched to the values ​​of the coordinates along which it is located in the hypercube. In the future, the coordinates of a record in a hypercube will be understood as the numbers of the corresponding records in the tables of measurement values. Then for our hypothetical example we get the following set defining the internal representation of the hypercube:

Diagram 8. Internal representation of a hypercube

This will be our internal representation of the hypercube. Since we are not making it for a relational database, we simply use fields of variable length as fields for connecting measurement values ​​(we would not be able to do this in an RDB, since the number of table columns is predetermined there).

We could try to use a set of temporary tables to implement the hypercube, but this method will provide too low performance (for example, a set of Decision Cube components), so we will use our own data storage structures.

To implement a hypercube, we need to use data structures that will ensure maximum performance and minimal RAM consumption. Obviously, our main structures will be for storing dictionaries and fact tables. Let's look at the tasks that a dictionary must perform at maximum speed:

checking the presence of an element in the dictionary;

adding an element to the dictionary;

search for record numbers that have a specific coordinate value;

search for coordinates by measurement value;

searching for a measurement value by its coordinate.

Various data types and structures can be used to implement these requirements. For example, you can use arrays of structures. In a real case, these arrays require additional indexing mechanisms that will increase the speed of loading data and retrieving information.

To optimize the operation of a hypercube, it is necessary to determine which tasks need to be solved as a matter of priority, and by what criteria we need to improve the quality of work. The main thing for us is to increase the speed of the program, while it is desirable that a not very large amount of RAM is required. Increased performance is possible through the introduction of additional mechanisms for accessing data, for example, the introduction of indexing. Unfortunately, this increases the RAM overhead. Therefore, we will determine which operations we need to perform at the highest speed. To do this, consider the individual components that implement the hypercube. These components have two main types - dimension and fact table. For measurement, a typical task would be:

adding a new value;

determining the coordinate based on the measurement value;

determination of value by coordinate.

When adding a new element value, we need to check whether we already have such a value, and if so, then do not add a new one, but use the existing coordinate, in otherwise you need to add a new element and determine its coordinate. To do this, you need a way to quickly find the presence of the desired element (in addition, such a problem arises when determining the coordinate by the value of the element). For this purpose, it is optimal to use hashing. In this case, the optimal structure would be to use hash trees in which we will store references to elements. In this case, the elements will be the lines of the dimension dictionary. Then the structure of the measurement value can be represented as follows:

PFactLink = ^TFactLink;

TFactLink = record

FactNo: integer; // fact index in the table

TDimensionRecord = record

Value: string; // measurement value

Index: integer; // coordinate value

FactLink: PFactLink; // pointer to the beginning of the list of fact table elements

And in the hash tree we will store links to unique elements. In addition, we need to solve the problem of inverse transformation - using the coordinate to determine the measurement value. To ensure maximum performance, direct addressing should be used. Therefore, you can use another array, the index of which is the coordinate of the dimension, and the value is a link to the corresponding entry in the dictionary. However, you can do it easier (and save on memory) if you arrange the array of elements accordingly so that the index of the element is its coordinate.

Organizing an array that implements a list of facts does not present any particular problems due to its simple structure. The only remark would be that it is advisable to calculate all aggregation methods that may be needed and which can be calculated incrementally (for example, sum).

So, we have described a method for storing data in the form of a hypercube. It allows you to generate a set of points in a multidimensional space based on information located in the data warehouse. In order for a person to be able to work with this data, it must be presented in a form convenient for processing. In this case, a pivot table and graphs are used as the main types of data presentation. Moreover, both of these methods are actually projections of a hypercube. In order to ensure maximum efficiency when constructing representations, we will start from what these projections represent. Let's start with the pivot table, as the most important one for data analysis.

Let's find ways to implement such a structure. There are three parts that make up a pivot table: row headers, column headers, and the actual table of aggregated fact values. The most in a simple way The fact table view will use a two-dimensional array, the dimension of which can be determined by constructing the headers. Unfortunately, the simplest method will be the most inefficient, because the table will be very sparse, and memory will be used extremely inefficiently, as a result of which it will be possible to build only very small cubes, since otherwise there may not be enough memory. Thus, we need to select a data structure for storing information that will ensure the maximum speed of searching/adding a new element and at the same time the minimum consumption of RAM. This structure will be the so-called sparse matrices, about which you can read in more detail from Knuth. There are various ways to organize the matrix. In order to choose the option that suits us, we will first consider the structure of the table headers.

Headings have a clear hierarchical structure, so it would be natural to assume using a tree to store them. In this case, the structure of a tree node can be schematically depicted as follows:

Appendix C

In this case, it is logical to store a link to the corresponding element of the dimension table of a multidimensional cube as a dimension value. This will reduce memory costs for storing the slice and speed up work. Links are also used as parent and child nodes.

To add an element to a tree, you must have information about its location in the hypercube. As such information, you need to use its coordinate, which is stored in the dictionary of measurement values. Let's look at the scheme for adding an element to the header tree of a pivot table. In this case, we use the values ​​of measurement coordinates as initial information. The order in which these dimensions are listed is determined by the desired aggregation method and matches the hierarchy levels of the header tree. As a result of the work, you need to obtain a list of columns or rows of the pivot table to which you need to add an element.

ApplicationD

We use measurement coordinates as the initial data to determine this structure. In addition, for definiteness, we will assume that we are defining the column of interest to us in the matrix (we will consider how to define a row a little later, since it is more convenient to use other data structures there; the reason for this choice is also see below). As coordinates, we take integers - numbers of measurement values ​​that can be determined as described above.

So, after performing this procedure, we will obtain an array of references to the columns of the sparse matrix. Now you need to perform all the necessary actions with the strings. To do this, you need to find the required element inside each column and add the corresponding value there. For each dimension in the collection, you need to know the number of unique values ​​and the actual set of these values.

Now let's look at the form in which the values ​​inside the columns need to be represented - that is, how to determine the required row. There are several approaches you can use to achieve this. The simplest would be to represent each column as a vector, but since it will be very sparse, memory will be used extremely inefficiently. To avoid this, we will use data structures that will provide greater efficiency in representing sparse one-dimensional arrays (vectors). The simplest of them would be a regular list, singly or doubly linked, but it is uneconomical from the point of view of accessing elements. Therefore, we will use a tree, which will provide faster access to elements.

For example, you could use exactly the same tree as for columns, but then you would have to create your own tree for each column, which would lead to significant memory overhead and processing time. Let's do it a little more cunningly - we'll create one tree to store all combinations of dimensions used in strings, which will be identical to the one described above, but its elements will not be pointers to strings (which do not exist as such), but their indices, and the values ​​of the indices themselves are not of interest to us and are used only as unique keys. We will then use these keys to find the desired element within the column. The columns themselves are most easily represented as a regular binary tree. Graphically, the resulting structure can be represented as follows:

Diagram 9. Image of a pivot table as a binary tree

You can use the same procedure as the procedure described above for determining the pivot table columns to determine the appropriate row numbers. In this case, row numbers are unique within one pivot table and identify elements in vectors that are columns of the pivot table. Most simple option These numbers will be generated by maintaining a counter and incrementing it by one when adding a new element to the row header tree. These column vectors themselves are most easily stored as binary trees, where the row number value is used as the key. In addition, it is also possible to use hash tables. Since the procedures for working with these trees are discussed in detail in other sources, we will not dwell on this and will consider general scheme adding an element to a column.

In general, the sequence of actions for adding an element to the matrix can be described as follows:

1. Determine the line numbers to which elements are added

2.Define a set of columns to which elements are added

3. For all columns, find the elements with the required row numbers and add the current element to them (adding includes connecting the required number of fact values ​​and calculating aggregated values, which can be determined incrementally).

After executing this algorithm, we will obtain a matrix, which is a summary table that we needed to build.

Now a few words about filtering when constructing a slice. The easiest way to do this is at the stage of constructing the matrix, since at this stage there is access to all the required fields, and, in addition, aggregation of values ​​is carried out. In this case, when retrieving an entry from the cache, its compliance with the filtering conditions is checked, and if it is not met, the entry is discarded.

Since the structure described above completely describes the pivot table, the task of visualizing it will be trivial. In this case, you can use standard table components that are available in almost all programming tools for Windows.

The first product to perform OLAP queries was Express (IRI). However, the term OLAP itself was coined by Edgar Codd, “the father of relational databases.” And Codd's work was funded by Arbor, a company that had released its own OLAP product, Essbase (later acquired by Hyperion, which was acquired by Oracle in 2007) the year before. Other well-known OLAP products include Microsoft Analysis Services (formerly called OLAP Services, part SQL Server), Oracle OLAP Option, DB2 OLAP Server from IBM (in fact, EssBase with additions from IBM), SAP BW, products from Brio, BusinessObjects, Cognos, MicroStrategy and other manufacturers.

From a technical point of view, the products on the market are divided into “physical OLAP” and “virtual”. In the first case, there is a program that performs a preliminary calculation of aggregates, which are then stored in a special multidimensional database that provides quick retrieval. Examples of such products are Microsoft Analysis Services, Oracle OLAP Option, Oracle/Hyperion EssBase, Cognos PowerPlay. In the second case, the data is stored in relational DBMSs, and aggregates may not exist at all or may be created upon the first request in the DBMS or analytical software cache. Examples of such products are SAP BW, BusinessObjects, Microstrategy. Systems based on “physical OLAP” provide consistently better response times to queries than “virtual OLAP” systems. Virtual OLAP vendors claim greater scalability of their products to support very large volumes of data.

In this work, I would like to take a closer look at the BaseGroup Labs product - Deductor.

Deductor is an analytics platform, i.e. basis for creating complete application solutions. The technologies implemented in Deductor allow you to go through all the stages of building an analytical system on the basis of a single architecture: from creating a data warehouse to automatically selecting models and visualizing the results obtained.

System composition:

Deductor Studio is the analytical core of the Deductor platform. Deductor Studio includes a full set of mechanisms that allows you to obtain information from an arbitrary data source, carry out the entire processing cycle (cleaning, transforming data, building models), display the results in the most convenient way (OLAP, tables, charts, decision trees...) and export results.

Deductor Viewer is the end user workstation. The program allows you to minimize the requirements for personnel, because all required operations are performed automatically using previously prepared processing scripts; there is no need to think about the method of obtaining data and the mechanisms for processing it. The Dedustor Viewer user only needs to select the report of interest.

Deductor Warehouse is a multidimensional cross-platform data warehouse that accumulates all the information necessary for analyzing the subject area. The use of a single repository allows for convenient access, high processing speed, consistency of information, centralized storage and automatic support for the entire data analysis process.

4. Client-Server

Deductor Server is designed for remote analytical processing. It provides the ability to both automatically “run” data through existing scripts on the server and retrain existing models. Using Deductor Server allows you to implement a full-fledged three-tier architecture in which it serves as an application server. Access to the server is provided using Deductor Client.

Work principles:

1. Import data

Analysis of any information in Deductor begins with data import. As a result of import, the data is brought into a form suitable for subsequent analysis using all the mechanisms available in the program. The nature of the data, format, DBMS, etc. do not matter, because the mechanisms for working with everyone are unified.

2. Export data

The presence of export mechanisms allows you to send the results to third party applications, for example, transfer a sales forecast to the system to generate a purchase order or post a prepared report on a corporate website.

3. Data processing

Processing in Deductor means any action associated with some kind of data transformation, for example, filtering, model building, cleaning, etc. Actually, in this block the most important actions from the point of view of analysis are performed. The most significant feature of the processing mechanisms implemented in Deductor is that the data obtained as a result of processing can be processed again by any of the methods available to the system. Thus, you can build arbitrarily complex processing scenarios.

4. Visualization

You can visualize data in Deductor Studio (Viewer) at any stage of processing. The system independently determines how it can do this, for example, if it is trained neural network, then in addition to tables and diagrams, you can view the neural network graph. The user needs to select the desired option from the list and configure several parameters.

5. Integration mechanisms

Deductor does not provide data entry tools - the platform is focused solely on analytical processing. To use information stored in heterogeneous systems, flexible import-export mechanisms are provided. Interaction can be organized using batch execution, working in OLE server mode and accessing the Deductor Server.

6.Replication of knowledge

Deductor allows you to implement one of the most important functions any analytical system - support for the process of knowledge replication, i.e. providing the opportunity for employees who do not understand analysis methods and methods of obtaining a particular result to receive an answer based on models prepared by an expert.

Zconclusion

In this work, we examined such an area of ​​modern information technologies, as data analysis systems. The main tool for analytical information processing - OLAP - technology is analyzed. The essence of the concept of OLAP and the importance of OLAP systems in a modern business process are revealed in detail. The structure and process of operation of a ROLAP server is described in detail. As an example of the implementation of OLAP data technologies, the Deductor analytical platform is given. The submitted documentation has been developed and meets the requirements.

OLAP technologies are a powerful tool for real-time data processing. An OLAP server allows you to organize and present data across various analytical areas and turns data into valuable information that helps companies make more informed decisions.

The use of OLAP systems provides consistently high levels of performance and scalability, supporting multi-gigabyte data volumes that can be accessed by thousands of users. With the help of OLAP technologies, access to information is carried out in real time, i.e. Query processing no longer slows down the analysis process, ensuring its speed and efficiency. Visual administration tools allow you to develop and implement even the most complex analytical applications, making the process simple and fast.

Similar documents

    The basis of the concept of OLAP (On-Line Analytical Processing) is operational analytical processing of data, features of its use on the client and on the server. General characteristics of the basic requirements for OLAP systems, as well as methods of storing data in them.

    abstract, added 10/12/2010

    OLAP: general characteristics, purpose, goals, objectives. Classification of OLAP products. Principles of building an OLAP system, library of CubeBase components. Dependence of the performance of client and server OLAP tools on the increase in data volume.

    course work, added 12/25/2013

    Eternal data storage. The essence and significance of the OLAP (On-line Analytical Processing) tool. Databases and data warehouses, their characteristics. Structure, data storage architecture, their suppliers. Some tips for improving the performance of OLAP cubes.

    test, added 10/23/2010

    Construction of data analysis systems. Building algorithms for designing an OLAP cube and creating queries for the constructed pivot table. OLAP technology for multidimensional data analysis. Providing users with information for making management decisions.

    course work, added 09/19/2008

    Basic information about OLAP. Operational analytical data processing. Classification of OLAP products. Requirements for online analytical processing tools. The use of multidimensional databases in operational analytical processing systems, their advantages.

    course work, added 06/10/2011

    Development of website analysis subsystems using Microsoft Access and Olap technologies. Theoretical aspects of developing a data analysis subsystem in the information system of a music portal. Olap technologies in the research object analysis subsystem.

    course work, added 11/06/2009

    Consideration of OLAP tools: classification of storefronts and information warehouses, the concept of a data cube. Architecture of a decision support system. Software implementation of the "Abitura" system. Creating a Web report using Reporting Services technologies.

    course work, added 12/05/2012

    Data storage, principles of organization. Processes for working with data. OLAP structure, technical aspects of multidimensional data storage. Integration Services, populating data warehouses and data marts. Capabilities of systems using Microsoft technologies.

    course work, added 12/05/2012

    Construction of a data warehouse diagram for a trading enterprise. Descriptions of storage relationship diagrams. Displaying product information. Creation of an OLAP cube for further analysis of information. Development of queries to evaluate the efficiency of a supermarket.

    test, added 12/19/2015

    Purpose of data storage. SAP BW architecture. Building analytical reporting based on OLAP cubes in the SAP BW system. Key differences between a data warehouse and an OLTP system. Overview of BEx functional areas. Creating a query in BEx Query Designer.

OLAP (OnLine Analytical Processing) is the name not of a specific product, but of an entire technology for operational analytical processing, which involves analyzing data and obtaining reports. The user is provided with a multidimensional table that automatically summarizes data in various sections and allows you to quickly manage calculations and the report form.

Although in some publications analytical processing is called both online and interactive, the adjective “online” most accurately reflects the meaning of OLAP technology. The development of management solutions by a manager falls into the category of areas most susceptible to automation. However, today there is an opportunity to assist the manager in developing solutions and, most importantly, significantly speed up the process of developing solutions, their selection and adoption.

Decision support systems usually have the means to provide the user with aggregate data for various samples from the original set in a form convenient for perception and analysis. As a rule, such aggregate functions form a multidimensional data set, often called a hypercube or metacube, the axes of which contain parameters, and the cells contain aggregate data that depends on them - and such data can also be stored in relational tables, but in this case we are talking about a logical organization data, and not about the physical implementation of their storage.

Along each axis, data can be organized into a hierarchy, representing different levels of detail.

According to the dimensions in the multidimensional model, factors influencing the activities of the enterprise are set aside (for example: time, products, company branches, etc.). The resulting OLAP cube is then filled with indicators of the enterprise’s activity (prices, sales, plan, profits, surpluses, etc.). It should be noted that, unlike a geometric cube, the faces of an OLAP cube do not necessarily have to be the same size. This can be filled with real data operating systems, and predicted based on historical data. The dimensions of a hypercube can be complex, hierarchical, and relationships can be established between them. During the analysis process, the user can change the point of view on the data (the so-called operation of changing the logical view), thereby viewing the data from various perspectives and solving specific problems. Various operations can be performed on cubes, including forecasting and conditional planning (what-if analysis).

Thanks to this data model, users can formulate complex queries, generate reports, and obtain subsets of data. Operational analytical processing can significantly simplify and speed up the process of preparation and decision-making by management personnel. Online analytical processing serves the purpose of turning data into information. It is fundamentally different from the traditional decision support process, which is most often based on the review of structured reports.


OLAP technology refers to a type of intelligent analysis and involves 12 principles:

1. Conceptual multidimensional representation. The user-analyst sees the world of the enterprise as multidimensional in nature, and accordingly, the OLAP model must be multidimensional in its core.

2. Transparency. The architecture of an OLAP system must be open, allowing the user, wherever he is, to communicate using an analytical tool - the client - with the server.

3. Availability. The OLAP analyst user must be able to perform analyzes based on a common conceptual schema containing enterprise-wide data in a relational database as well as data from legacy legacy databases, common access methods, and a common analytical model. An OLAP system should only access data that is actually needed, rather than adopting a general "kitchen funnel" approach that introduces unnecessary input.

4. Consistent performance in report development. As the number of dimensions or database size increases, the user-analyst should not experience a significant decrease in performance.

5. Client-server architecture. Most of the data that today needs to be processed online is contained on mainframes with access to user workstations via LAN. This means that OLAP products must be able to work in a client-server environment.

6. General multidimensionality. Each dimension must be applied without regard to its structure and operational capabilities. Basic data structures, formulas, and reporting formats should not be biased toward any one dimension.

7. Dynamic management of sparse matrices. The physical design of an OLAP tool must be fully adapted to the specific analytical model for optimal management of sparse matrices. Sparsity (measured as the percentage of empty cells to all possible cells) is one of the characteristics of data propagation.

8. Multi-user support. An OLAP tool must provide the ability to share query and completion among multiple user analysts while maintaining integrity and security.

9. Unlimited cross operations. Due to their hierarchical nature, various operations can represent dependent relationships in the OLAP model, that is, they are cross-functional. Their execution should not require the analytical user to redefine these calculations and operations.

10. Intuitive data manipulation. The analyst user's view of the dimensions defined in the analytical model must contain all the necessary information to perform actions on the OLAP model, i.e. they should not require the use of a menu system or other multiple user interface operations.

11. Flexible reporting options. Reporting tools must be synthesized data or information resulting from the data model in any possible orientation. This means that the rows, columns, or pages of the report must display multiple OLAP model dimensions simultaneously, with the ability to show any subset of the members (values) contained in the dimension, in any order.

12. Unlimited dimension and number of aggregation levels. A study of the possible number of necessary dimensions required in an analytical model showed that up to 19 dimensions can be used simultaneously by the user-analyst. This leads to a recommendation on the number of dimensions supported by the OLAP system. Moreover, each of the common dimensions should not be limited in the number of aggregation levels defined by the user-analyst.

Specialized OLAP systems currently offered on the market include CalliGraph and Business Intelligence.

To solve simple data analysis problems, it is possible to use a budget solution - Office applications Excel and Access from Microsoft, which contain basic OLAP technology tools that allow you to create pivot tables and build various reports based on them.

Purpose of the report

This report will focus on one of the categories of intelligent technologies that are a convenient analytical tool - OLAP technologies.

The purpose of the report: to reveal and highlight 2 issues: 1) the concept of OLAP and their applied importance in financial management; 2) implementation of OLAP functionality in software solutions: differences, opportunities, advantages, disadvantages.

I would like to note right away that OLAP is a universal tool that can be used in any application area, and not just in finance (as can be understood from the title of the report), which requires data analysis using various methods.

Financial management

Financial management is an area in which analysis is more important than any other. Any financial and management decision arises as a result of certain analytical procedures. Today, financial management is becoming important for the successful functioning of an enterprise. Despite the fact that financial management is an auxiliary process in an enterprise, it requires special attention, since erroneous financial and managerial decisions can lead to large losses.

Financial management is aimed at providing the enterprise with financial resources in the required volumes, at the right time and in the right place in order to obtain the maximum effect from their use through optimal distribution.

It is perhaps difficult to define the level of “maximum resource efficiency”, but in any case,

The CFO should always know:

  • How many financial resources are available?
  • Where will the funds come from and in what quantities?
  • where to invest more effectively and why?
  • and at what points in time does all this need to be done?
  • how much is needed to ensure normal operation of the enterprise?

To obtain reasonable answers to these questions, it is necessary to have, analyze and know how to analyze a sufficiently large number of performance indicators. In addition, FU covers a huge number of areas: cash flow analysis (cash flow), analysis of assets and liabilities, profitability analysis, margin analysis, profitability analysis, assortment analysis.

Knowledge

Therefore, a key factor in the effectiveness of the financial management process is the availability of knowledge:

  • Personal knowledge in the subject area (one might say theoretical and methodological), including experience, intuition of a financier/finance director
  • General (corporate) knowledge or systematic information about the facts of financial transactions in an enterprise (i.e. information about the past, present and future state of the enterprise, presented in various indicators and measurements)

If the first lies in the scope of actions of this financier (or the HR director who hired this employee), then the second should be purposefully created at the enterprise by the joint efforts of employees of financial and information services.

What is there now

However, now a paradoxical situation is typical in enterprises: there is information, there is a lot of it, too much. But it is in a chaotic state: unstructured, inconsistent, fragmented, not always reliable and often erroneous, it is almost impossible to find and obtain. A lengthy and often useless generation of mountains of financial statements is carried out, which is inconvenient for financial analysis and difficult to understand, since it is created not for internal management, but for submission to external regulatory authorities.

According to the results of a study conducted by the company Reuters Among 1,300 international managers, 38% of respondents say they spend a lot of time trying to find the information they need. It turns out that a highly qualified specialist spends highly paid time not on data analysis, but on collecting, searching and systematizing the information necessary for this analysis. At the same time, managers are overloaded with data that is often irrelevant, which again reduces the effectiveness of their work. The reason for this situation: excess information and lack of knowledge.

What to do

Information must be turned into knowledge. For modern business, valuable information, its systematic acquisition, synthesis, exchange, use is a kind of currency, but in order to receive it, it is necessary to manage information, like any business process.

The key to information management is delivering the right information in the right form to stakeholders within the organization at the right time. The goal of such management is to help people work better together using increasing amounts of information.

Information technology in this case acts as a means by which it would be possible to systematize information in an enterprise, provide certain users with access to it and give them the tools to transform this information into knowledge.

Basic concepts of OLAP technologies

OLAP technologies (from the English On-Line Analytical Processing) is the name not of a specific product, but of an entire technology for the operational analysis of multidimensional data accumulated in a warehouse. In order to understand the essence of OLAP, it is necessary to consider the traditional process of obtaining information for decision making.

Traditional decision support system

Here, of course, there can also be many options: complete information chaos or the most typical situation when the enterprise has operational systems with the help of which the facts of certain operations are recorded and stored in databases. To extract data from databases for analytical purposes, a system of queries for specific data samples has been built.

But this method of decision support lacks flexibility and has many disadvantages:

  • negligible amount of data is used that can be useful for decision making
  • sometimes complex multi-page reports are created, of which 1-2 lines are actually used (the rest is just in case) - information overload
  • slow response of the process to changes: if a new data representation is needed, the request must be formally described and coded by the programmer, only then executed. Waiting time: hours, days. Or perhaps a solution is needed now, immediately. But after receiving new information, a new question will arise (clarifying)

If query reports are presented in a one-dimensional format, then business problems are usually multidimensional and multifaceted. If you want to get a clear picture of a company's business, then you need to analyze data from various perspectives.

Many companies create excellent relational databases, ideally organizing mountains of unused information, which in itself does not provide either a quick or sufficiently competent response to market events. YES - relational databases were, are and will be the most suitable technology for storing corporate data. It's not about new technology DB, but rather about analysis tools that complement the functions of existing DBMSs and are flexible enough to provide and automate various types of intellectual analysis inherent in OLAP.

Understanding OLAP

What does OLAP provide?

  • Advanced storage data access tools
  • Dynamic interactive data manipulation (rotation, consolidation or drill-down)
  • Clear visual display of data
  • Fast – analysis is carried out in real time
  • Multidimensional data presentation - simultaneous analysis of a number of indicators along several dimensions

To get the effect of using OLAP technologies, you must: 1) understand the essence of the technologies themselves and their capabilities; 2) clearly define what processes need to be analyzed, what indicators they will be characterized by and in what dimensions it is advisable to see them, i.e. create an analysis model.

The basic concepts that OLAP technologies operate on are as follows:

Multidimensionality

To understand the multidimensionality of the data, you should first present a table showing, for example, the performance of Enterprise Costs by economic elements and business units.

This data is presented in two dimensions:

  • article
  • business unit

This table is not informative, as it shows sales for one specific period of time. For different time periods, analysts will have to compare several tables (for each time period):

The figure shows a 3rd dimension, Time, in addition to the first two. (Article, business unit)

Another way to show multidimensional data is to represent it in the form of a cube:

OLAP cubes allow analysts to obtain data at various slices to obtain answers to questions posed by the business:

  • Which costs are critical in which business units?
  • How do business unit costs change over time?
  • How do cost items change over time?

Answers to such questions are necessary for making management decisions: on the reduction of certain cost items, the impact on their structure, identifying the reasons for changes in costs over time, deviations from the plan and their elimination - optimizing their structure.

In this example, only 3 dimensions are considered. It's difficult to depict more than 3 dimensions, but it works in the same way as with 3 dimensions.

Typically, OLAP applications allow you to obtain data on 3 or more dimensions, for example, you can add one more dimension - Plan-Actual, Cost Category: direct, indirect, by Orders, by Months. Additional dimensions allow you to obtain more analytical slices and provide answers to questions with multiple conditions.

Hierarchy

OLAP also allows analysts to organize each dimension into a hierarchy of groups, subgroups, and totals that reflect the measure across the entire organization—the most logical way to analyze a business.

For example, it is advisable to group costs hierarchically:

OLAP allows analysts to obtain data into a common summary measure (actually upper level), and then drill down to the lowest and subsequent levels, and thus discover the exact reason for the change in the indicator.

By allowing analysts to use multiple dimensions in a data cube, with the ability to hierarchically construct dimensions, OLAP provides a picture of the business that is not compressed by the information warehouse structure.

Changing directions of analysis in a cube (rotating data)

As a rule, they operate in concepts: dimensions specified in columns, rows (there may be several of them), the rest form slices, the contents of the table form dimensions (sales, costs, cash)

Typically, OLAP allows you to change the orientation of cube dimensions, thereby presenting the data in different views.

The display of cube data depends on:

  • dimension orientations: which dimensions are specified in rows, columns, slices;
  • groups of indicators, highlighted in rows, columns, sections.
  • Changing dimensions is within the scope of the user's actions.

Thus, OLAP allows you to carry out various types of analysis and understand their relationships with their results.

  • Deviation analysis is an analysis of plan implementation, which is supplemented by factor analysis of the causes of deviations by detailing the indicators.
  • Dependency analysis: OLAP allows you to identify various dependencies between various changes, for example, when beer was removed from the assortment during the first two months, a drop in roach sales was discovered.
  • Comparison (comparative analysis). Comparison of the results of changes in an indicator over time, for a given group of goods, in different regions, etc.
  • Analysis of dynamics allows us to identify certain trends in changes in indicators over time.

Efficiency: we can say that OLAP is based on the laws of psychology: the ability to process information requests in “real time” - at the pace of the process of analytical comprehension of data by the user.

If a relational database can read about 200 records per second and write 20, then a good OLAP server, using calculated rows and columns, can consolidate 20,000-30,000 cells (equivalent to one record in a relational database) per second.

Visibility: It should be emphasized that OLAP provides advanced tools graphical representation data to the end user. The human brain is capable of perceiving and analyzing information that is presented in the form of geometric images, in a volume that is several orders of magnitude greater than information presented in alphanumeric form. Example: Let's say you need to find a familiar face in one of a hundred photographs. I believe this process will take you no more than a minute. Now imagine that instead of photographs you will be offered a hundred verbal descriptions of the same persons. I think that you will not be able to solve the proposed problem at all.

Simplicity : Main feature These technologies are that they are intended for use not by a specialist in the field of information technology, not by an expert statistician, but by a professional in the applied field - a manager of a credit department, a manager of a budget department, and finally, a director. They are designed for the analyst to communicate with the problem, not with the computer..

Despite the great capabilities of OLAP (in addition, the idea is relatively old - the 60s), its actual use is practically never found in our enterprises. Why?

  • there is no information or the possibilities are not clear
  • habit of thinking two-dimensionally
  • price barrier
  • excessive technological content of articles devoted to OLAP: unusual terms are frightening - OLAP, “data mining and slicing”, “ad hoc queries”, “identification of significant correlations”

Our approach and Western ones to the use of OLAP

In addition, we also have a specific understanding of the application utility of OLAP even while understanding its technological capabilities.

Our and Russian authors of various materials devoted to OLAP express the following opinion regarding the usefulness of OLAP: most perceive OLAP as a tool that allows you to expand and collapse data simply and conveniently, carrying out the manipulations that come to the analyst’s mind during the analysis process. The more “slices” and “sections” of data the analyst sees, the more ideas he has, which, in turn, require more and more “slices” for verification. It is not right.

The Western understanding of the usefulness of OLAP is based on a methodological analysis model that must be incorporated when designing OLAP solutions. The analyst should not play with the OLAP cube and aimlessly change its dimensions and levels of detail, data orientation, graphical display of data (and this really takes!), but clearly understand what views he needs, in what sequence and why (of course, the elements " there may be discoveries here, but it is not fundamental to the usefulness of OLAP).

Applications of OLAP

  • Budget
  • Flow of funds

One of the most fertile areas of application of OLAP technologies. Not a single one for nothing modern system budgeting is not considered complete without the presence of OLAP tools for budget analysis. Most budget reports are easily built on the basis of OLAP systems. At the same time, the reports answer a very wide range of questions: analysis of the structure of expenses and income, comparison of expenses for certain items in different divisions, analysis of the dynamics and trends of expenses for certain items, analysis of costs and profits.

OLAP will allow you to analyze cash inflows and outflows in the context of business operations, counterparties, currencies and time in order to optimize their flows.

  • Financial and management reporting (with analytics that management needs)
  • Marketing
  • Balanced Scorecard
  • Profitability Analysis

If you have the appropriate data, you can find various applications of OLAP technology.

OLAP products

This section will discuss OLAP as a software solution.

General requirements for OLAP products

There are many ways to implement OLAP applications, so no particular technology should have been required, or even recommended. Under different conditions and circumstances, one approach may be preferable to another. The implementation techniques include many different proprietary ideas that vendors are so proud of: variations of client-server architecture, time series analysis, object orientation, data storage optimization, parallel processes, etc. But these technologies cannot be part of the definition of OLAP.

There are characteristics that must be observed in all OLAP products (if it is an OLAP product), which is the ideal of the technology. These are 5 key definitions that characterize OLAP (the so-called FASMI test): Fast Analysis of Shared Multidimensional Information.

  • Fast(FAST) means that the system should be able to provide most responses to users within approximately five seconds. Even if the system warns that the process will take significantly longer, users may become distracted and lose their thoughts, and the quality of the analysis will suffer. This speed is not easy to achieve with large amounts of data, especially if special on-the-fly calculations are required. Vendors resort to a wide variety of methods to achieve this goal, including specialized forms of data storage, extensive pre-computing, or increasingly stringent hardware requirements. However, there are currently no fully optimized solutions. At first glance, it may seem surprising that when receiving a report in a minute that not so long ago took days, the user very quickly becomes bored while waiting, and the project turns out to be much less successful than in the case of an instant response, even at the cost of less detailed analysis.
  • Shared means that the system makes it possible to fulfill all data protection requirements and implement distributed and simultaneous access to data for different levels of users. The system must be able to handle multiple data changes in a timely, secure manner. This is a major weakness of many OLAP products, which tend to assume that all OLAP applications are read-only and provide simplified security controls.
  • Multidimensional is a key requirement. If you had to define OLAP in one word, you would choose it. The system must provide a multi-dimensional conceptual view of data, including full support for hierarchies and multiple hierarchies, as this determines the most logical way to analyze the business. There is no minimum number of dimensions that must be processed, as this also depends on the application, and most OLAP products have a sufficient number of dimensions for the markets they are aimed at. Again, we do not specify what underlying database technology should be used if the user is to obtain a truly multidimensional conceptual view of the information. This feature is the heart of OLAP
  • Information. The necessary information must be obtained where it is needed, regardless of its volume and storage location. However, a lot depends on the application. The power of various products is measured in terms of how much input data they can process, but not how many gigabytes they can store. The power of the products varies widely - the largest OLAP products can handle at least a thousand times more data than the smallest. There are many factors to consider in this regard, including data duplication, RAM requirements, disk space usage, performance metrics, integration with information warehouses, etc.
  • Analysis means that the system can handle any logical and statistical analysis specific to a given application and ensures that it is stored in a form accessible to the end user. The user should be able to define new custom calculations as part of the analysis without the need for programming. That is, all required analysis functionality must be provided in an intuitive way for end users. Analysis tools could include certain procedures, such as time series analysis, cost allocation, currency transfers, target searches, etc. Such capabilities vary widely among products, depending on the target orientation.

In other words, these 5 key definitions are the goals that OLAP products are designed to achieve.

Technological aspects of OLAP

An OLAP system includes certain components. There are various schemes of their operation that this or that product can implement.

Components of OLAP systems (what does an OLAP system consist of?)

Typically, an OLAP system includes the following components:

  • Data source
    The source from which data for analysis is taken (data warehouse, database of operational accounting systems, set of tables, combinations of the above).
  • OLAP server
    Data from the source is transferred or copied to the OLAP server, where it is systematized and prepared for faster generation of responses to queries.
  • OLAP client
    User interface to the OLAP server in which the user operates

It should be noted that not all components are required. There are desktop OLAP systems that allow you to analyze data stored directly on the user's computer and do not require an OLAP server.

However, what element is required is the data source: data availability is an important issue. If they exist, in any form, like an Excel table, in a database accounting system, in the form of structured branch reports, the IT specialist will be able to integrate with the OLAP system directly or with intermediate conversion. OLAP systems have special tools for this. If this data is not available, or it is of insufficient completeness and quality, OLAP will not help. That is, OLAP is only a superstructure over the data, and if there is none, it becomes a useless thing.

Most data for OLAP applications originates in other systems. However, in some applications (for example, planning or budgeting), data can be created directly in OLAP applications. When data comes from other applications, it is usually necessary for the data to be stored in a separate, duplicate form for the OLAP application. Therefore, it is advisable to create data warehouses.

It should be noted that the term “OLAP” is inextricably linked with the term “data warehouse” (Data Warehouse). A data warehouse is a domain-specific, time-based, and immutable collection of data to support management decision-making. Data in the warehouse comes from operational systems (OLTP systems), which are designed to automate business processes; the warehouse can be replenished from external sources, for example, statistical reports.

Despite the fact that they contain obviously redundant information that is already in databases or operating system files, data warehouses are necessary because:

  • fragmentation of data, storing it in various DBMS formats;
  • data retrieval performance improves
  • if in an enterprise all data is stored on a central database server (which is extremely rare), the analyst will probably not understand their complex, sometimes confusing structures
  • complex analytical queries for operational information slow down the current work of the company, blocking tables for a long time and taking over server resources
  • ability to clean and harmonize data
  • it is impossible or very difficult to directly analyze data from operating systems;

The purpose of the repository is to provide the “raw material” for analysis in one place and in a simple, understandable structure. That is, the concept of Data Warehousing is not a concept of data analysis, rather it is a concept of preparing data for analysis. It involves the implementation of a single integrated data source.

OLAP products: architectures

When using OLAP products, two questions are important: how and where keep And process data. Depending on how these two processes are implemented, OLAP architectures are distinguished. There are 3 ways to store data for OLAP and 3 ways to process this data. Many manufacturers offer several options, some try to prove that their approach is the single most prudent one. This is, of course, absurd. However, very few products can operate in more than one mode efficiently.

OLAP data storage options

Storage in this context means keeping data in a constantly updated state.

  • Relational databases: This is a typical choice if an enterprise stores accounting data in a RDB. In most cases, data should be stored in a denormalized structure (the most suitable is a star schema). A normalized database is not acceptable due to the very low query performance when generating aggregates for OLAP (often the resulting data is stored in aggregate tables).
  • Database files on client computer(kiosks or data marts): This data can be pre-distributed or created upon request on client computers.

Multidimensional Databases: This assumes that data is stored in a multidimensional database on a server. It can include data extracted and summarized from other systems and relational databases, end-user files, etc. In most cases, multidimensional databases are stored on disk, but some products allow you to use RAM, calculating the most frequently used data on the fly " Very few products based on multidimensional databases allow multiple editing of data; many products allow single editing but multiple readings of data, while others are limited to reading only.

These three storage locations have different storage capabilities, and they are arranged in descending order of capacity. They also have different query performance characteristics: relational databases are much slower than the latter two options.

Options for processing OLAP data

There are 3 of the same data processing options:

  • Using SQL: This option is, of course, used when storing data in a RDB. However, SQL does not allow multidimensional calculations in a single query, so it requires writing complex SQL queries to achieve more than basic multidimensional functionality. However, this doesn't stop developers from trying. In most cases, they perform a limited number of relevant calculations in SQL, with results that can be obtained from multidimensional data processing or from the client machine. It is also possible to use RAM that can store data using more than one request: this dramatically improves response.
  • Multidimensional processing on the client: The client OLAP product does the calculations itself, but such processing is only available if users have relatively powerful PCs.

Server-side multidimensional processing: This is a popular place to perform multidimensional calculations in client-server OLAP applications and is used in many products. Performance is usually high because most of the calculations have already been done. However, this requires a lot of disk space.

Matrix of OLAP architectures

Accordingly, by combining storage/processing options, it is possible to obtain a matrix of OLAP system architectures. Accordingly, theoretically there can be 9 combinations of these methods. However, since 3 of them lack common sense, in reality there are only 6 options for storing and processing OLAP data.

Multidimensional storage options
data

Options
multidimensional
data processing

Relational database

Server-side multidimensional database

Client computer

Cartesis Magnitude

Multidimensional server processing

Crystal Holos (ROLAP mode)

IBM DB2 OLAP Server

CA EUREKA:Strategy

Informix MetaCube

Speedware Media/MR

Microsoft Analysis Services

Oracle Express (ROLAP mode)

Pilot Analysis Server

Applix iTM1

Crystal Holos

Comshare Decision

Hyperion Essbase

Oracle Express

Speedware Media/M

Microsoft Analysis Services

PowerPlay Enterprise Server

Pilot Analysis Server

Applix iTM1

Multidimensional processing on the client computer

Oracle Discoverer

Informix MetaCube

Dimensional Insight

Hyperion Enterprise

Cognos PowerPlay

Personal Express

iTM1 Perspectives

Since it is storage that determines processing, it is customary to group by storage options, that is:

  • ROLAP products in sectors 1, 2, 3
  • Desktop OLAP - in sector 6

MOLAP products – in sectors 4 and 5

HOLAP products (allowing both multidimensional and relational data storage options) – in 2 and 4 (in italics)

Categories of OLAP products

There are more than 40 OLAP vendors, although they cannot all be considered competitors because their capabilities are very different and, in fact, they operate in different market segments. They can be grouped into 4 fundamental categories, the differences between which are based on the following concepts: complex functionality - simple functionality, performance - disk space. It is useful to depict categories in the shape of a square because it clearly shows the relationships between them. The distinctive feature of each category is represented on its side, and the similarities with others are represented on the adjacent sides, therefore, the categories on opposite sides are fundamentally different.

Peculiarities

Advantages

Flaws

Representatives

Applied OLAP

Complete applications with rich functionality. Almost all require a multidimensional database, although some work with a relational one. Many of this category of applications are specialized, such as sales, manufacturing, banking, budgeting, financial consolidation, sales analysis

Possibility of integration with various applications

High level of functionality

High level of flexibility and scalability

Application complexity (user training required)

High price

Hyperion Solutions

Crystal Decisions

Information Builders

The product is based on a non-relational data structure that provides multidimensional storage, processing and presentation of data. During the analysis process, data is selected exclusively from a multidimensional structure. Despite the high level of openness, suppliers persuade buyers to purchase their own tools

High performance (fast calculations of summary indicators and various multidimensional transformations for any of the dimensions). The average response time to an ad hoc analytical query when using a multidimensional database is usually 1-2 orders of magnitude less than in the case of an RDB

High level of openness: a large number of products with which integration is possible

They easily cope with the tasks of including various built-in functions in the information model, conducting specialized analysis by the user, etc.

The need for large disk space to store data (due to redundancy of data that is stored). This is an extremely inefficient use of memory - due to denormalization and pre-executed aggregation, the volume of data in a multidimensional database corresponds to 2.5-100 times less than the volume of the original detailed data. In any case, MOLAP does not allow working with large databases. The real limit is a database of 10-25 gigabytes

The potential for a database “explosion” is an unexpected, sharp, disproportionate increase in its volume

Lack of flexibility when it comes to modifying data structures. Any change in the structure of dimensions almost always requires a complete restructuring of the hypercube

For multidimensional databases, there are currently no uniform standards for the interface, languages ​​for describing and manipulating data

Hyperion (Essbase)

DOLAP (Desktop OLAP)

Client OLAP products that are fairly easy to implement and have a low cost per seat

We are talking about such analytical processing where hypercubes are small, their dimension is small, the needs are modest, and for such analytical processing a personal machine on a desktop is sufficient

The goal of the producers of this market is to automate hundreds and thousands of jobs, but users must perform a fairly simple analysis. Buyers are often encouraged to buy more jobs than necessary

Good integration with databases: multidimensional, relational

Possibility of making complex purchases, which reduces the cost of implementation projects

Ease of use of applications

Very limited functionality (not comparable in this regard with specialized products)

Very limited power (small data volumes, small number of measurements)

Cognos (PowerPlay)

Business Objects

Crystal Decisions

This is the smallest sector of the market.

Detailed data remains where it was originally - in the relational database; some aggregates are stored in the same database in specially created service tables

Capable of handling very large amounts of data (cost-effective storage)

Provide a multi-user mode of operation, including editing mode, and not just reading

Higher level of data protection and good options for differentiating access rights

Frequent changes to the measurement structure are possible (do not require physical reorganization of the database)

Low performance, significantly inferior in terms of response speed to multidimensional ones (response to complex queries is measured in minutes or even hours rather than in seconds). These are better report builders than interactive analytics tools

Complexity of products. Requires significant maintenance costs from information technology specialists. To provide performance comparable to MOLAP, relational systems require careful design of the database schema and configuration of indexes, that is, a lot of effort on the part of database administrators

Expensive to implement

The limitations of SQL remain a reality, which prevents the implementation in RDBMS of many built-in functions that are easily provided in systems based on a multidimensional representation of data

Information Advantage

Informix (MetaCube)

It should be noted that consumers of hybrid products that allow the choice of ROLAP and MOLAP mode, such as Microsoft Analysis Services, Oracle Express, Crystal Holos, IBM DB2 OLAPServer, almost always select MOLAP mode.

Each of the presented categories has its own strengths and weaknesses; there is no single optimal choice. The choice affects 3 important aspects: 1) performance; 2) disk space for data storage; 3) capabilities, functionality and especially the scalability of the OLAP solution. In this case, it is necessary to take into account the volume of data being processed, the power of the equipment, the needs of users and seek a compromise between speed and redundancy of disk space occupied by the database, simplicity and versatility.

Classification of Data Warehouses in accordance with the volume of the target database

Disadvantages of OLAP

Like any OLAP technology, it also has its drawbacks: high requirements for hardware, training and knowledge of administrative staff and end users, high costs for the implementation of the implementation project (both monetary and time, intellectual).

Selecting an OLAP product

Choosing the right OLAP product is difficult, but very important if you want the project to not fail.

As you can see, product differences lie in many areas: functional, architectural, technical. Some products are very limited in settings. Some are created for specialized subject areas: marketing, sales, finance. There are products for general purposes, which do not have an application specific use, which must be quite flexible. As a rule, such products are cheaper than specialized ones, but the implementation costs are higher. The range of OLAP products is very wide - from the simplest tools for building pivot tables and charts that are part of office products, to tools for analyzing data and searching for patterns, which cost tens of thousands of dollars.

As in any other field, in the field of OLAP there cannot be clear guidelines for choosing tools. You can only focus on a number of key points and compare the proposed software capabilities with the needs of the organization. One thing is important: without properly thinking about how you are going to use OLAP tools, you risk creating a major headache for yourself.

During the selection process, there are 2 questions to consider:

  • assess the needs and capabilities of the enterprise
  • evaluate the existing offer on the market, development trends are also important

Then compare all this and, in fact, make a choice.

Needs assessment

You can't make a rational product choice without understanding what it will be used for. Many companies want the “best possible product” without a clear understanding of how it should be used.

In order for the project to be successfully implemented, the financial director must, at a minimum, competently formulate his wishes and requirements to the manager and automation service specialists. Many problems arise due to insufficient preparation and awareness for the choice of OLAP; IT specialists and end users experience communication difficulties simply because they manipulate different concepts and terms during conversation and put forward conflicting preferences. There needs to be consistency in goals within the company.

Some factors have already become obvious after reading the overview of OLAP product categories, namely:

Technical aspects

  • Data sources: corporate data warehouse, OLTP system, table files, relational databases. Possibility of linking OLAP tools with all DBMS used in the organization. As practice shows, the integration of heterogeneous products into a stable operating system is one of the most important issues, and its solution in some cases can be associated with big problems. It is necessary to understand how simply and reliably it is possible to integrate OLAP tools with the DBMS existing in the organization. It is also important to evaluate the possibilities of integration not only with data sources, but also with other applications to which you may need to export data: email, office applications
  • Variability of data taken into account
  • Server platform: NT, Unix, AS/400, Linux - but don't insist that OLAP specification products run on questionable or dying platforms you're still using
  • Client-side and browser standards
  • Deployable architecture: the local network and PC modem connection, high-speed client/server, intranet, extranet, Internet
  • International Features: Multi-currency support, multi-lingual operations, data sharing, localization, licensing, Windows update

Amounts of input information that are available and that will appear in the future

Users

  • Area of ​​application: sales/marketing analysis, budgeting/planning, performance analysis, accounting report analysis, qualitative analysis, financial condition, generation of analytical materials (reports)
  • Number of users and their location, requirements for the division of access rights to data and functions, secrecy (confidentiality) of information
  • User type: senior management, finance, marketing, HR, sales, production, etc.
  • User experience. User skill level. Consider providing training. It is very important that the OLAP client application is designed so that users feel confident and can use it effectively.

Key Features: Data Writeback Needs, Distributed Computing, Complex Currency Conversions, Report Printing Needs, Spreadsheet Interface, Application Logic Complexity, Dimensions Required, Analysis Types: Statistical, Goal Search, What-If Analysis

Implementation

  • Who will be involved in implementation and operation: external consultants, internal IT function or end users
  • Budget: software, hardware, services, data transfer. Remember that paying for OLAP product licenses is only a small part of the total cost of the project. Implementation and hardware costs may be more than the license fee, and long-term support, operation, and administration costs are almost certainly significantly more. And if you make the wrong decision to buy the wrong product just because it's cheaper, you may end up with a higher overall project cost due to higher maintenance, administration and/or hardware costs for what you'll likely get lower level of business benefits. When estimating total costs, be sure to ask the following questions: How broad are the sources of implementation, training, and support available? Is the potential general fund (employees, contractors, consultants) likely to grow or shrink? How widely can you use your industrial professional experience?

Despite the fact that the cost of analytical systems remains quite high even today, and the methodologies and technologies for implementing such systems are still in their infancy, today the economic effect they provide significantly exceeds the effect of traditional operational systems.

The effect of proper organization, strategic and operational planning of business development is difficult to quantify in advance, but it is obvious that it can exceed the costs of implementing such systems by tens and even hundreds of times. However, one should not be mistaken. The effect is ensured not by the system itself, but by the people working with it. Therefore, declarations like: “a system of Data Warehousing and OLAP technologies will help the manager make the right decisions” are not entirely correct.” Modern analytical systems are not artificial intelligence systems and they can neither help nor hinder decision making. Their goal is to promptly provide the manager with all the information necessary to make a decision in a convenient form. And what information will be requested and what decision will be made based on it depends only on the specific person using it.

All that remains to be said is that these systems can help solve many business problems and can have far-reaching positive effects. It remains to be seen who will be the first to realize the benefits of this approach and be ahead of the others.

Internet