Thursday, July 9, 2015

Essbase Sessions Monitor: ODI Process with Jython





Abstract
Workflow
ODI process
Loop Control
Display Sessions
Parse Log
Termination
Notify The User
Hand Brake, Purge, Repeat


Abstract



The purpose of this post is to demonstrate the process that monitors essbase sessions and takes appropriate actions in different circumstances.


If you use native essbase and either SmartView add-hoc reporting, or excel add-in, you probably have to deal with situations when user starts a huge retrieve, which never ends. That can happen as a result of an innocent mistake, or because of the “rogue” user retrieving monstrous templates, despite all efforts to persuade him not to do so.


Sometimes users may need to use different data sources for different retrieves. Some retrieves should happen in central ASO reporting cube. If ran against a Planning BSO cube - it would take much longer. I've had such a situation recently when large retrieves with multiple attributes were performing better from partitioned ASO cube, than from planning BSO. Both cubes could return the same data set, but ASO was much faster. You can train the users, but still, it is easy to make a mistake and use the wrong data source.


Whatever the reason - you end up with those “spreadsheet operations” sessions that keep running until you kill them.


There are many ways handling this situation, depending on the frequency of those events and how badly they affect the system. Here i show an automated ODI monitor, that will
  • Continuously check running sessions
  • If session request time is longer than N seconds for a particular database, it will be terminated automatically.
  • Send notification to the user, whose session was killed, saying something, “Hey, you made a mistake trying to run your huge report from this database. Please run it from a different cube.”


Workflow





In terms of the workflow we would like our monitor to do the following:


1. Check if it still needs to run. It it does, continue to the next step. If not, stop the monitor. We will pass a JOB_DURATION parameter that will tell the monitor for how long it should run. Why do we need this parameter, if the monitor is supposed to run continuously? We’ll get the answer when we go through the steps of the ODI job.


2. Connect to essbase server and execute maxl statement that displays sessions on a particular essbase server. It should look something like:


set column_width 50;
display session all;


Width is an important parameter here, since it will be used later during parsing.


3. Now we have maxl output, and we need to parse it.
It will look similar to this one:


So we would need to get session IDs that meet the following criteria:
  • Application = our application of interest (we pass it as a parameter to the job)
  • Request type = SpreadsheetOperation
  • Request Time > threshold we pass to the job.


4. If some sessions meet the criteria of time threshold, request type and application, we need to terminate them, and send email to the user.


5. If no session meets our criteria, we need to wait for N sec, and repeat the check.


ODI process



Below is the ODI process that implements the workflow we just defined.


Loop Control

The first step of the workflow corresponds to a set of refreshing variables and evaluating LOOP_CONTRALL variable in ODI diagram. We compare current time to job start time plus a parameter JOB_DURATION .
JOB_DURATION specifies how long current instance of the monitor will be running. In the end the monitor needs to run continuously at all times. But that would create a large number of sessions for DisplaySession scenario (which runs MAXL statement), and steps inside current main session. And that could cause our operator repository to run out of space.


To avoid this situation there are purge log steps that delete previous successful monitor sessions, and individual DisplaySessions sessions. Since we cannot delete steps from current session, we need to stop the monitor, and delete current session from the next session. For example JOB_DURATION=7200 means that the job will stop in 2 hours, and once it is restarted, the prior sessions will be cleared.


That means we need to schedule our monitor to start every 2 hours (if JOB_DURATION=7200 sec). If you use Windows scheduler you need to check “Repeat task every 2 hours” .




Display Sessions

DisplaySessions scenario is just a generic scenario that runs MAXL script on a particular essbase server. We pass maxl script as a parameter, and ODI scenario runs the script.
One caveat is that you may want to run it in asynchronous mode, and check for the output in the next step. Why so? Well, what happens if essbase went down? Clearly the error would be raised, and the next step would be FINISHFAILED, which would send and email, text, or whatever needs to be done in such gruesome event.
But what if Essbase didn’t really crash, but in a “hung” state? ODI just waits for MAXL to return any feedback. And it can happen after several minutes, or hours in the form of phone call in the middle of the night.  Regardless of the reasons why it can happen, you may want to capture those events and take some action. That’s why if output is not generated by the DisplaySession in a given number of seconds, failure notification is sent. This is controlled by the
ESSBASE_MONITOR_TIMEOUT=120000 inside WaitForLog step, which is 2 min in our case.


Parse Log



So you got the output from your DISPLAYSESSION scenario. Now we are getting to the fun part - when we can parse the log, and find sessions that need to be killed. Below is a code that does that.


import os
import re
file_name_option = '<%=odiRef.getOption("FileName")%>'

file_name='%s' % file_name_option
out_file = '<%=odiRef.getOption("OutputFile")%>'
output_write=open(out_file,'a')
usersfile = '<%=odiRef.getOption("UsersFile")%>'
usersfile_write=open(usersfile,'w')
userlist=''
str_len=<%=odiRef.getOption("ColWidth")%>
if os.path.exists(file_name):
 for line in open(file_name, 'r'):
   if  len(line)>str_len*6:
     split_list = [line[i:i+str_len] for i in range(0, len(line), str_len)]
     if  '<%=odiRef.getOption("StringToSearch")%>'.upper() == split_list[6].strip().upper() and \
       int(float('<%=odiRef.getOption("TimeThreshold")%>')) < int(float(split_list[7].upper())) and \
       ('<%=odiRef.getOption("AppName")%>'.upper() == 'ALL' or '<%=odiRef.getOption("AppName")%>'.upper() == split_list[3].strip().upper()):
         output_write.write('alter system logout session '+split_list[1].strip()+' force;'+'\n')
         userlist=userlist+split_list[0].strip()+'@salesforce.com,'
 usersfile_write.write(userlist)
 usersfile_write.close()
 output_write.close()
else:
 output_write.close()
 usersfile_write.close()


And below is the same code, but with comments you can read for detailed explanation of each line. (FYI, if you decide to copy the code - make sure you have the right indentation).


import os
import re
###------------------------------------------------------------------------###
### FileName option is passed as a parameter from ODI scenario.------------###
### This is the log file we want to parse. --------------------------------###
###------------------------------------------------------------------------###
file_name_option = '<%=odiRef.getOption("FileName")%>'
file_name='%s' % file_name_option


###------------------------------------------------------------------------###
### out_file is the file into which we write results of the parsing.-------###
### E.g. maxl statements to kill sessions. --------------------------------###
### Filename is also passed as an option from ODI scenario.----------------###
###------------------------------------------------------------------------###


out_file = '<%=odiRef.getOption("OutputFile")%>'
output_write=open(out_file,'a')


###------------------------------------------------------------------------###
### Similarly to out_file, usersfile defines the file with users that -----###
### will receive email notifications. Don't forget to replace yourdomain---###
### with actual domain name.-----------------------------------------------###
### It is also passed as an option.----------------------------------------###
###------------------------------------------------------------------------###


usersfile = '<%=odiRef.getOption("UsersFile")%>'
usersfile_write=open(usersfile,'w')
userlist=''


###------------------------------------------------------------------------###
### ColWidth is another option that defines the width of the column in-----###
### MAXL output. That's what you put in -----------------------------------###
### set column_width XX;---------------------------------------------------###
### It is also passed as an option.----------------------------------------###
###------------------------------------------------------------------------###


str_len=<%=odiRef.getOption("ColWidth")%>


if os.path.exists(file_name):


###------------------------------------------------------------------------###
### Nothing fancy, just loop through all lines in the MAXL output.---------###
### E.g. through all the sessions.-----------------------------------------###
###------------------------------------------------------------------------###


 for line in open(file_name, 'r'):


###------------------------------------------------------------------------###
### The following condition distinguishes line in a log with sessions------###
### from other output, like -----------------------------------------------###
### essmsh timestamp: Tue Mar 31 15:33:13 2015 ----------------------------###
###------------------------------------------------------------------------###


   if  len(line)>str_len*6:


###------------------------------------------------------------------------###
### Split the line into list of fixed width elements. Width is defined-----###
### by str_len a.k.a. getOption("ColWidth"), a.k.a. XX from MAXL-----------###
### set column_width XX; --------------------------------------------------###
###------------------------------------------------------------------------###


     split_list = [line[i:i+str_len] for i in range(0, len(line), str_len)]


###------------------------------------------------------------------------###
### Check if current session is the one you are looking for.---------------###
### "StringToSearch" is request type like SpreadsheetOperation.------------###
### "TimeThreshold" is how long this operation is allowed to run.----------###
### As far as application you can either pass ALL, in which case-----------###
### sessions of particular type of all applications will be terminated ----###
### once they cross the TimeThreshold, or you can pass a specific app name-###
###------------------------------------------------------------------------###


     if  '<%=odiRef.getOption("StringToSearch")%>'.upper() \
          == split_list[6].strip().upper() and \
          int(float('<%=odiRef.getOption("TimeThreshold")%>')) <
          int(float(split_list[7].upper())) and \
          ('<%=odiRef.getOption("AppName")%>'.upper() == 'ALL' or \
          '<%=odiRef.getOption("AppName")%>'.upper() == \
          split_list[3].strip().upper()):


###------------------------------------------------------------------------###
### If the condition is met, write maxl statement to terminate current ----###
### session. Also write the user who's session is going to be killed.------###
###------------------------------------------------------------------------###
         output_write.write('alter system logout session '+split_list[1].strip()+' force;'+'\n')
         userlist=userlist+split_list[0].strip()+'@yourdomain.com,'
 usersfile_write.write(userlist)
 usersfile_write.close()
 output_write.close()
 # tst_write.close()
else:
 output_write.close()
 usersfile_write.close()


Termination

You saw (if you decided not to skip the previous section) that this Jython script generates a MAXL file, say SessionsMonitorV2.SESSIONSTOKILL.20150226034010.MXL, which has the following content:


set timestamp on;
spool stdout on to 'D:\xxx\xxx\SessionsMonitorV2.SESSIONSTOKILL.20150226034010.MXL.log.txt';
spool stderr on to 'D:\xxx\xxx\SessionsMonitorV2.SESSIONSTOKILL.20150226034010.MXL.err.txt';


alter system logout session 1962934007 force;
alter system logout session 1962354007 force;
alter system logout session 1962354089 force;


Obviously the last 3 rows is the whole purpose of this process. Sessions which match our criteria for application, request type (e.g. SpreadsheetOperation), and time threshold (say 300 sec), will be terminated.


The execution of this new MAXL script happens in the next 2 steps:
If the new MAXL file is found (with corresponding timestamp), it will be executed by the same MAXL-running scenario, which accepts a different MAXL file name as a parameter (SessionsMonitorV2.SESSIONSTOKILL.20150226034010.MXL).


Notify The User



We saw that the parser also created a file (<%=odiRef.getOption("UsersFile")%>) with the list of users, sessions of which are terminated. All we need to do now is:


  • Read from that file and assign value to variable
  • Read some text from another file that would be a body of an email. Apparently sometimes you want the text of an email to be in external text file, so that you can easily modify it, without changing ODI job. (Being able to change a message to the users on the fly is a good thing, when you realize the users have no idea what you want from them).
  • Send an email




Hand Brake, Purge, Repeat



In case you want to stop the monitor in the middle of the running cycle (defined by JOB_DURATION, e.g. 2 hours), you can put some text in some file, and MonitorStop procedure will check the size of that file. If the size is positive, the job will stop. This is probably not the most elegant way to stop the monitor, but it is easy and provides control from the file system.


As we mentioned before, in the next steps we are purging session logs of previously completed monitor cycles.

And then we wait… for N sec (value of the #POLLING_INTERVAL parameter, which is also provided to ODI job).

2 comments:

  1. Why did you decide to go this route instead of using the QRYGOVEXECTIME Essbase.cfg setting?

    ReplyDelete
    Replies
    1. We needed more control in terms of notifying the users and IT team. So that folks don't freak out when their retrieves get terminated. They will know right away that they need to connect to a different database for example.
      Also this allows you to define different behavior for different users. Some may actually have a good reason to run long report, while others don't.

      Delete