Presentation I was making at Oracle Open World 09/26/2013
Friday, October 11, 2013
Tuesday, September 3, 2013
Hyperion Environment Statistics Gathering/Trend And Causality Analysis
In complex environments continuous monitoring of system health is necessary for proactive issues resolution, especially issues related to performance degradation. As data volumes grow, new system elements and integration points are added, issues resolution becomes difficult without systematic approach. I experimented with a framework for automatic metrics gathering that allows continuous system monitoring, statistical regression analysis, causality inference between failure modes and system statistics. Based on failure likelihood, severity and results of causality analysis we can prioritize issues and allocate resources for resolution. The framework includes gathering statistics from essbase databases and logs, ETL and automation performance metrics. The information is gathered before and after events like calc scripts execution, data loads, metadata synchronization, etc.
Usually Hyperion developers have to deal with application optimization as part of the original implementation, or with optimizing existing applications, performance of which degraded substantially.
The implicit assumption behind this approach is that applications change slowly, and we can afford this investment every once in awhile.
Causes of degradation
Typically degradation comes from the following areas:
- Inefficient hardware resources allocation to applications and calculations
- Architectural deficiencies
- Inefficient outline, bad calculations
- Overall bad design
- High system complexity - usually caused by conflicting or inconsistent requirements
- Inappropriate choice of tools
- Bad integration processes
- Data and metadata loads
- Inefficient integration with other applications
- High reliance on batched processes, instead of just-in-time/real-time processes.
- And finally, bad business processes around development, change control, data quality control, and so forth.
We should keep in mind that quality of the system, or how efficient it is, is expressed not only by the current performance, but also factors like how fast, expensive, and risky it is to scale the system to a certain level, how reliable it is, and what is the cost to maintain the system at the current level.
These performance improvement areas are ordered more or less by their level of complexity and level of effort required to fix them.
It is relatively easy to fine-tune caches or number of threads for calculations. But optimizing outline and calculations is more difficult. And business process change can involve many parties, and can be really hard.
Resources allocations, outline and calculation optimizations were usually the primary focus of essbase performance tuning. Developer would start by collecting current database statistics such as fragmentation ratios, block density, parent-child ratios of different dimensions,memory and threads allocation, number of blocks and so forth.
Many of those optimization techniques are less relevant nowadays.
This is because of the new infrastructure Hyperion tools work on, and because Essbase itself changed a lot. Large memory volumes, SSD drives and 64 bit operating systems had profound effect on how we design and tune applications.
There is much more resources available on one hand, on the other, relationship between resources and performance is not linear, often having local or global extremes.
Examples of those are
- The number of CPUs for calc parallel: at some point adding CPUs may result in decreased performance
- Adding more memory to calculator cache and other caches
- Block sizes
Since those optimal resource allocations depend on specific data volumes and outlines, even if the system is optimized, it will move away from its performance optimum the moment you add a member or send additional data into the database.
Here we need to look closely at the assumption we mentioned in the beginning: how slowly really do our systems change? From business point of view we are exposed to larger and more diverse data sets over time. Ability to bring that diversity into our analytics creates competitive advantage. Hence the motivation to constantly incorporate new requirements into the systems. In addition, more companies implement master data management and standard ETL tools, which continually change outlines and database structure. With those factors in mind, our manual infrequent optimization effort may look like shooting at the moving target. And how can we do that when we have 10, 20 or 50 applications to deal with?
Trajectory vs single observation
An alternative is to track performance constantly and make adjustments on a regular basis. And it is much easier, if in addition to current state we can observe system trajectory. Once we have historical system metrics we can use statistical inference to understand what causes performance degradation.
You can automatically collect metrics from essbase databases and logs, any object of the ETL job, before and after events like calc scripts execution, data loads, metadata synchronization, etc.
We used Python for this purpose, but the next implementation will replace Python with Java.
Metrics Gathering Workflow
This diagram shows the workflow of metrics gathering.
Steps 100-110-111 represent the hierarchy of steps within the process. It can be a complex ETL job that involves metadata and data loads for multiple applications, or as simple as a single business rule executed by a user.
You could monitor steps at the most granular level, and at a higher level of task groups. Usually the lowest level monitoring makes the most sense, but monitoring at a group level as well can be useful when the objective is to optimize group performance as a whole, and not individual tasks.
In our case metrics gathering methods write into relational database.
In this example we collect metrics from:
- Essbase log for the executed task
- Database settings using maxl commands
- Obviously start and end time of each step is recorded as well.
What can we do with this raw metrics data?
- We can run a query to see how a particular task executes over time.
- We can send alert if performance dropped below certain threshold.
- Display data in graphical form
- Run statistical regressions to explain performance degradation by other statistics
The following is an piece of essbase log relevant for latest agg.csc calc script execution.
Received Command [Calculate] from user [XXXXX@Native Directory] using [agg.csc]
Commit Blocks Interval for the calculation is 
Remote bitmap cache is [Disabled]
Maximum Number of Lock Blocks:  Blocks
Completion Notice Messages: [Disabled]
Calculations On Updated Blocks Only: [Disabled]
Clear Update Status After Full Calculations: [Enabled]
Calculator Cache With Single Bitmap For: [Product]
Aggregate #Missing values: [Enabled]
Calculating in parallel with  threads
Calculation task schedule [24330,5110,250,10]
Parallelizing using  task dimensions.
Empty tasks [21054,3988,175,7]
Total Block Created: [0.0000e+000] Blocks
Sparse Calculations: [5.6873e+006] Writes and [1.1818e+007] Reads
Dense Calculations: [0.0000e+000] Writes and [0.0000e+000] Reads
Sparse Calculations: [1.1830e+009] Cells
Dense Calculations: [0.0000e+000] CellsTotal Calc Elapsed Time for [agg.csc] : [924.415] seconds
And this is the output of display database maxl statement.
This is how the view data set looks like. Please note, that we may need to collect different sets of statistics for different process steps. The table in which those are stored is agnostic of the number and type of statistics. The relevant set of metrics is defined dynamically in the view.
Once we have sufficient number of data points we can run statistical regression. In this example we run the built-in Oracle function REGR_R2 for single variable regression.
REGR_R2(SECLEN, LEAFBLOCKS) LEAFBLOCKS
,REGR_R2(SECLEN, CALCPARELLEL) CALCPARELLEL
,REGR_R2(SECLEN, TOTALBLOCKSSCREATED) TOTALBLOCKSSCREATED
,REGR_R2(SECLEN, SPARSEWRITE) SPARSEWRITE
,REGR_R2(SECLEN, SPARSEREAD) SPARSEREAD
,REGR_R2(SECLEN, DENSEWRITE) DENSEWRITE
,REGR_R2(SECLEN, DENSEREAD) DENSEREAD
,REGR_R2(SECLEN, SPARSECELLS) SPARSECELLS
,REGR_R2(SECLEN, DENSECELLS) DENSECELLS
,REGR_R2(SECLEN, TOTALTIME) TOTALTIME
,REGR_R2(SECLEN, DECLAREBLOCKSIZE) DECLAREBLOCKSIZE
,REGR_R2(SECLEN, LEAFBLOCKS) LEAFBLOCKS
,REGR_R2(SECLEN, NONLEAFBLOCKS) NONLEAFBLOCKS
,REGR_R2(SECLEN, BLOCKDENSITY) BLOCKDENSITY
,REGR_R2(SECLEN, SPARSEDENSITY) SPARSEDENSITY
,REGR_R2(SECLEN, BLOCKCOMPRESS) BLOCKCOMPRESS
,REGR_R2(SECLEN, CLUSTERINGRATIO) CLUSTERINGRATIO
,REGR_R2(SECLEN, FRAGQUOTIENT) FRAGQUOTIENT
and upper(JOB) LIKE upper('%load_legal%')
and CLUSTERINGRATIO is not nulland STEP like '%agg_act%'
R2 indicates how well data points fit a regression line. In other words what is the percentage of dependent variable variability explained by volatility of independent variable. And i have to say that by running those regressions we are making several assumptions. But we will omit those in this discussion.
In this table we can see that four variables have the highest R2:
- Number of sparse cells in the database
- Number of sparse cells read during calculation
- Number of sparse cells written during calculation
- Clustering ratio
It is obvious that calculation time variability will be explained by the number of cells that need to be calculated. Yet it explains only 55% of variability. Other 45% are explained by the other factors. One distinctive one is clustering ratio. If clustering ratio was completely unrelated to the number of cells in the database, we would explain 95% of variability in execution time by those two factors. But we know that the larger the number of cells we calculate and change, the larger would be database fragmentation. To establish that relationship we need to find R2 of SPARSECELLS on CLUSTERINGRATIO. It happens to be ~30%.
So the conclusion would be that roughly 27% of calc execution variability is explained by database fragmentation.
The next step would be to assess severity for business users of particular performance issue (in this case increase in calc time), probability of causing factor (database fragmentation), prioritize issue resolution and allocate resources for resolution.