An interesting issue appeared for me recently. What happens when after a server crash / restart, Essbase refuses to start? The service shows Running, but Essbase isn’t listening and not responding. The start attempt produces no errors, no logs, it simply does not start. OPMN has been part of the Essbase stack for a few years now. However, because of how seamlessly it tends to work and manage the service, most people don’t even notice it’s there. But it’s there, making sure Essbase is running and (if you’re clustering) that only one node runs at a time.
OPMN starts Essbase, stops Essbase and monitors Essbase. This allows it to run on several servers, report back to Foundation and allow Foundation to determine if a failover is necessary. But understanding this, some unseen behavior comes into play.
What happens if OPMN crashes (not Essbase)? How does it recover? What happens if it doesn’t and how do you see that or fix it?
Located in your instance’s bin directory (i.e. D:\Oracle\Middleware\user_projects\epmsystem1\bin), you will find opmnctl (.bat or .sh – depending on if you are on Windows or UNIX). This utility can be used to see the actual Essbase status (the Service in the control panel is actually the OPMN service that controls Essbase, not Essbase itself), if it’s a cluster, it will report which node is up, which one is down, etc. It also allows you to stop or start a controlled service.
The OPMN service uses the configuration found in the config directory (i.e. D:\Oracle\Middleware\user_projects\epmsystem1\config\opmn) to direct startup, but ALSO to hold component status information through a restart.
Now with this background we can look under the hood when things go wrong. Remember the Essbase “service” (actually OPMN) is running, but Essbase is down. Open a command prompt and navigate to the bin directory (i.e. D:\Oracle\Middleware\user_projects\epmsystem1\bin), then run ‘opmnctl status’. This will (or at least should) show the actual status of the Essbase process. In my example, I get:
Processes in Instance: EPM_DEMO
---------------------------------+--------------------+---------+---------
ias-component | process-type | pid | status
---------------------------------+--------------------+---------+---------
Essbase1 | EssbaseAgent | 899 | Down
Note that a PID is listed, but the process is down. We try to start the Essbase instance with ‘opmnctl startproc ias-component=Essbase1’ and it rejects the command, saying Essbase is already running. That’s a huge flag! The status is down, but when you try to start it, it says it’s running.
Next step will be to see what’s actually running on PID 899. Open Task Manager, add the PID column and sort (on UNIX, use ps -e | grep <PID>). In my instance, I found “wininit.exe”, a core Windows service.
OK so we have OPMN, thinking Essbase is running on a PID that is in use by something else. If the process was a non-critical item, we could simply kill the process, clearing the 899 PID, but in this case, it was a core process (or it could be something equally undesirable to kill). Either way, killing the process is not an option in this example.
So how do we clear this? Restarting OPMN doesn’t clear it (and with this error, you will find stopping and starting OPMN take a very long time – after all, it’s trying to stop Essbase when it’s not running and has to time out).
The way to fix this is by removing the memory of the process while OPMN is down.
- Stop the Essbase (OPMN) service
- Navigate to the OPMN status folder in the config directory (i.e. D:\Oracle\Middleware\user_projects\epmsystem1\config\OPMN\opmn\states)
- You should find one or more files starting “p” (do NOT touch the .opmndat file). If there is more than one, open each of them and look for the one containing the PID that Essbase reports up on. If there is only one, you can skip that step.
- Delete the “p” file containing the PID that is a problem
- Start the Essbase (OPMN service)
- Verify Essbase starts correctly again
There are quite a few safeguards against the above scenario, however a combination of OPMN stopping abruptly, combined with bad luck on PID assignment can leave you high and dry. This gets you moving again.
