Hooked On Hyperion: September 2013

Hyperion Environment Statistics Gathering/Trend And Causality Analysis

In complex environments continuous monitoring of system health is necessary for proactive issues resolution, especially issues related to performance degradation. As data volumes grow, new system elements and integration points are added, issues resolution becomes difficult without systematic approach. I experimented with a framework for automatic metrics gathering that allows continuous system monitoring, statistical regression analysis, causality inference between failure modes and system statistics. Based on failure likelihood, severity and results of causality analysis we can prioritize issues and allocate resources for resolution. The framework includes gathering statistics from essbase databases and logs, ETL and automation performance metrics. The information is gathered before and after events like calc scripts execution, data loads, metadata synchronization, etc.

Optimization approaches
Usually Hyperion developers have to deal with application optimization as part of the original implementation, or with optimizing existing applications, performance of which degraded substantially.

The implicit assumption behind this approach is that applications change slowly, and we can afford this investment every once in awhile.

Causes of degradation

Typically degradation comes from the following areas:

Inefficient hardware resources allocation to applications and calculations
Architectural deficiencies

Inefficient outline, bad calculations
Overall bad design
High system complexity - usually caused by conflicting or inconsistent requirements
Inappropriate choice of tools

Bad integration processes

Data and metadata loads
Inefficient integration with other applications
High reliance on batched processes, instead of just-in-time/real-time processes.

And finally, bad business processes around development, change control, data quality control, and so forth.

We should keep in mind that quality of the system, or how efficient it is, is expressed not only by the current performance, but also factors like how fast, expensive, and risky it is to scale the system to a certain level, how reliable it is, and what is the cost to maintain the system at the current level.

These performance improvement areas are ordered more or less by their level of complexity and level of effort required to fix them.

It is relatively easy to fine-tune caches or number of threads for calculations. But optimizing outline and calculations is more difficult. And business process change can involve many parties, and can be really hard.

Resources allocations, outline and calculation optimizations were usually the primary focus of essbase performance tuning. Developer would start by collecting current database statistics such as fragmentation ratios, block density, parent-child ratios of different dimensions,memory and threads allocation, number of blocks and so forth.

Many of those optimization techniques are less relevant nowadays.

This is because of the new infrastructure Hyperion tools work on, and because Essbase itself changed a lot. Large memory volumes, SSD drives and 64 bit operating systems had profound effect on how we design and tune applications.

Non-linear relationships

There is much more resources available on one hand, on the other, relationship between resources and performance is not linear, often having local or global extremes.

Examples of those are

The number of CPUs for calc parallel: at some point adding CPUs may result in decreased performance
Adding more memory to calculator cache and other caches
Block sizes

Since those optimal resource allocations depend on specific data volumes and outlines, even if the system is optimized, it will move away from its performance optimum the moment you add a member or send additional data into the database.

Here we need to look closely at the assumption we mentioned in the beginning: how slowly really do our systems change? From business point of view we are exposed to larger and more diverse data sets over time. Ability to bring that diversity into our analytics creates competitive advantage. Hence the motivation to constantly incorporate new requirements into the systems. In addition, more companies implement master data management and standard ETL tools, which continually change outlines and database structure. With those factors in mind, our manual infrequent optimization effort may look like shooting at the moving target. And how can we do that when we have 10, 20 or 50 applications to deal with?

Trajectory vs single observation

An alternative is to track performance constantly and make adjustments on a regular basis. And it is much easier, if in addition to current state we can observe system trajectory. Once we have historical system metrics we can use statistical inference to understand what causes performance degradation.

You can automatically collect metrics from essbase databases and logs, any object of the ETL job, before and after events like calc scripts execution, data loads, metadata synchronization, etc.

We used Python for this purpose, but the next implementation will replace Python with Java.

Metrics Gathering Workflow

This diagram shows the workflow of metrics gathering.

Steps 100-110-111 represent the hierarchy of steps within the process. It can be a complex ETL job that involves metadata and data loads for multiple applications, or as simple as a single business rule executed by a user.

You could monitor steps at the most granular level, and at a higher level of task groups. Usually the lowest level monitoring makes the most sense, but monitoring at a group level as well can be useful when the objective is to optimize group performance as a whole, and not individual tasks.

In our case metrics gathering methods write into relational database.

In this example we collect metrics from:

Essbase log for the executed task

Database settings using maxl commands

Obviously start and end time of each step is recorded as well.

What can we do with this raw metrics data?

We can run a query to see how a particular task executes over time.

We can send alert if performance dropped below certain threshold.

Display data in graphical form

Run statistical regressions to explain performance degradation by other statistics

The following is an piece of essbase log relevant for latest agg.csc calc script execution.

Received Command [Calculate] from user [XXXXX@Native Directory] using [agg.csc]

Commit Blocks Interval for the calculation is [3000]

Remote bitmap cache is [Disabled]

Maximum Number of Lock Blocks: [420] Blocks

Completion Notice Messages: [Disabled]

Calculations On Updated Blocks Only: [Disabled]

Clear Update Status After Full Calculations: [Enabled]

Calculator Cache With Single Bitmap For: [Product]

Aggregate #Missing values: [Enabled]

Calculating in parallel with [4] threads

Calculation task schedule [24330,5110,250,10]

Parallelizing using [3] task dimensions.

Empty tasks [21054,3988,175,7]

Total Block Created: [0.0000e+000] Blocks

Sparse Calculations: [5.6873e+006] Writes and [1.1818e+007] Reads

Dense Calculations: [0.0000e+000] Writes and [0.0000e+000] Reads

Sparse Calculations: [1.1830e+009] Cells

Dense Calculations: [0.0000e+000] Cells
Total Calc Elapsed Time for [agg.csc] : [924.415] seconds

And this is the output of display database maxl statement.

number_of_dimensions:17

declared_block_size:1344

actual_block_size:468

declared_maximum_blocks:110074828662099840

actual_maximum_blocks:78385241435054400

number_of_non_missing_leaf_blocks:84816

number_of_non_missing_non_leaf_blocks:5687309

number_of_total_blocks:5772125

index_type:1

average_block_density:1.683760683761

average_sparse_density:7.363790548228e-009

block_compression_ratio:0.04364779874214

average_clustering_ratio:0.4646076943676

average_fragmentation_quotient:10.44245235983

free_space_is_recoverable:FALSE

estimated_bytes_of_recoverable_free_space:0

This is how the view data set looks like. Please note, that we may need to collect different sets of statistics for different process steps. The table in which those are stored is agnostic of the number and type of statistics. The relevant set of metrics is defined dynamically in the view.

Once we have sufficient number of data points we can run statistical regression. In this example we run the built-in Oracle function REGR_R2 for single variable regression.

SELECT

REGR_R2(SECLEN, LEAFBLOCKS) LEAFBLOCKS

,REGR_R2(SECLEN, CALCPARELLEL) CALCPARELLEL

,REGR_R2(SECLEN, TOTALBLOCKSSCREATED) TOTALBLOCKSSCREATED

,REGR_R2(SECLEN, SPARSEWRITE) SPARSEWRITE

,REGR_R2(SECLEN, SPARSEREAD) SPARSEREAD

,REGR_R2(SECLEN, DENSEWRITE) DENSEWRITE

,REGR_R2(SECLEN, DENSEREAD) DENSEREAD

,REGR_R2(SECLEN, SPARSECELLS) SPARSECELLS

,REGR_R2(SECLEN, DENSECELLS) DENSECELLS

,REGR_R2(SECLEN, TOTALTIME) TOTALTIME

,REGR_R2(SECLEN, DECLAREBLOCKSIZE) DECLAREBLOCKSIZE

,REGR_R2(SECLEN, LEAFBLOCKS) LEAFBLOCKS

,REGR_R2(SECLEN, NONLEAFBLOCKS) NONLEAFBLOCKS

,REGR_R2(SECLEN, BLOCKDENSITY) BLOCKDENSITY

,REGR_R2(SECLEN, SPARSEDENSITY) SPARSEDENSITY

,REGR_R2(SECLEN, BLOCKCOMPRESS) BLOCKCOMPRESS

,REGR_R2(SECLEN, CLUSTERINGRATIO) CLUSTERINGRATIO

,REGR_R2(SECLEN, FRAGQUOTIENT) FRAGQUOTIENT

FROM HYP_V_ESS_PRF_MONITOR_ALL

WHERE 1=1

     and upper(JOB) LIKE upper('%load_legal%')

     and CLUSTERINGRATIO is not null
     and STEP like '%agg_act%'

R2 indicates how well data points fit a regression line. In other words what is the percentage of dependent variable variability explained by volatility of independent variable. And i have to say that by running those regressions we are making several assumptions. But we will omit those in this discussion.

Statistic

R2

Statistic

R2

LEAFBLOCKS

6%

NONLEAFBLOCKS

7%

TOTALBLOCKSSCREATED

8%

BLOCKDENSITY

0%

SPARSEWRITE

48%

SPARSEDENSITY

9%

SPARSEREAD

55%

BLOCKCOMPRESS

1%

DENSEWRITE

0%

CLUSTERINGRATIO

40%

DENSEREAD

0%

FRAGQUOTIENT

21%

SPARSECELLS

46%

In this table we can see that four variables have the highest R2:

Number of sparse cells in the database
Number of sparse cells read during calculation
Number of sparse cells written during calculation
Clustering ratio

It is obvious that calculation time variability will be explained by the number of cells that need to be calculated. Yet it explains only 55% of variability. Other 45% are explained by the other factors. One distinctive one is clustering ratio. If clustering ratio was completely unrelated to the number of cells in the database, we would explain 95% of variability in execution time by those two factors. But we know that the larger the number of cells we calculate and change, the larger would be database fragmentation. To establish that relationship we need to find R2 of SPARSECELLS on CLUSTERINGRATIO. It happens to be ~30%.

So the conclusion would be that roughly 27% of calc execution variability is explained by database fragmentation.

The next step would be to assess severity for business users of particular performance issue (in this case increase in calc time), probability of causing factor (database fragmentation), prioritize issue resolution and allocate resources for resolution.

Tuesday, September 3, 2013

Optimizing Essbase with Regression Analysis

Dmitry Kryuk

Statistic	R2	Statistic	R2
LEAFBLOCKS	6%	NONLEAFBLOCKS	7%
TOTALBLOCKSSCREATED	8%	BLOCKDENSITY	0%
SPARSEWRITE	48%	SPARSEDENSITY	9%
SPARSEREAD	55%	BLOCKCOMPRESS	1%
DENSEWRITE	0%	CLUSTERINGRATIO	40%
DENSEREAD	0%	FRAGQUOTIENT	21%
SPARSECELLS	46%