Friday, January 4, 2013

OSB cluster synching problem

We ran into an OSB cluster synching problem in our production environment. The key words are the "production environment". Otherwise, it won't be an issue, because you can always restart both nodes, then it would be synched without problem. We need a solution without shutting down the cluster.

The problem happened after we updated an OSB project which contains a collection of proxies, business service etc. To be exact, we used "ant", which in turn uses WLST to run the deployment, but i don't think that's relevant.

We have two OSB nodes, after the deployment we start to see exceptions in the logs of node1 (osb_server1).

<Error> <ConfigFwk> <BEA-000000> <notifyAfterEnd() failed for listener ResourceListenerNotifier
java.lang.NoClassDefFoundError: com/bea/alsb/coherence/impl/RemoveProcessor
at com.bea.alsb.coherence.impl.CoherenceCache.removeAll(CoherenceCache.java:132)
at com.bea.wli.sb.service.resultcache.ResultCache.removeAll(ResultCache.java:254)

and 
<Error> <OSB Kernel> <BEA-382016> <Failed to instantiate router for service ProxyService CIS/proxy/GetAutopayWithdrawOptionByAccountID: com.bea.wli.sb.management.BrokerManagementException: The configurations for the proxy service is in flux.
com.bea.wli.sb.management.BrokerManagementException: The configurations for the proxy service is in flux.
at com.bea.wli.sb.pipeline.RouterContext.getInstance(RouterContext.java:178)
at com.bea.wli.sb.pipeline.RouterManager.processMessagen


we had to shut down node1. Otherwise, it causes nasty service interruption. We logged a Sev-1 SR with Oracle. The response we got from the support borders on irresponsibility: they merely told us to shut down both nodes, and restart up. Hello..., this is production environment! You can't just simply use the "reboot" lame response like those Windows desktop tech support!

Additionally, Oracle support pointed out that we were deploying updates while OSB proxy is running. I am not sure I get it. Does it mean that we need to shut down the server to deploy any updates???

Anyway, I noticed the exceptions indicating some kind of issue with "coherence", so we decided to let node2 running for over 24 hours, hoping "coherence" cache may get cleared. Then we tried to start node1 again and, unfortunately, got the same error.

After poking my heads all over the place, I have a hunch that node1 (osb_server1) might be keeping its cached values somewhere under the <domain home>/servers/osb_server1.

I renamed the directory "osb_server1" to "osb_server1.bak", re-created an empty "osb_server1" directory. I restarted osb_server1 (from the Weblogic console), voila, it started up beatifically. Node1 synched up with node2 without incidents!

Lessons learned:

1. in a production environment, try to deploy OSB update using resources, not the whole project, as the whole project update has a bigger foot print that may affect many proxies and other resources. If you only changed a couple of resources, only update the changed resources so you can minimize the chance of cluster syching problem.

2. Do not just take the tech support's advice blindly. If we did, we would need to shut down a busy client portal site. Shutting down a production environment should only be the last resort, not the first and only advice to the clients.

additional note (4/29/13):

Recently, i noticed another example of OSB server is out synch. We initially added a "report" activity, however, it causes some issue with the OSB report DB. The symptom is OSB will continue try to commit the DB transaction after it fails. It creates an indefinite loop. It fills up the log file quickly and brings the server down. We quickly removed the report activity from the proxy, activated the change, and bounced both nodes of OSB. But it didn't help.

One particular behavior is the error only happens on one OSB node (node 2). My theory is when the original proxy call was made, the request was dispatched to node2, however, the request failed during db transaction, therefore it is stuck only in node2 (attempting to commit DB transaction).

I applied the same trick by blowing away osb_server2 directory. No luck :(

Eventually, I found out this deleted report activity (with one unfinished db transaction) is somehow stuck in the JMS persistent store (go figure!). In our case, the persistent store is .../osb_cluster/jms/FileStore_auto_2. We blow this file away (gee, not sure we should just do that), restarted node2, that took care of the problem.

3 comments:

  1. Hi Yuan,

    We are also facing a production issue with OSB Email adapter and which I felt might be similiar to the one described by you.


    I would like to summarise the scenario. We have used an OSB email adapator to poll emails with attachments and do subsequent processing.
    In Pre prod and Prod environments we have a clustered environment with 2 nodes(OSB1 & OSB2) both nodes are in separate physical machines.
    As per config of the OSB email adaptor, in the Email Transport configuration--> managed server->
    we have mentioned OSB1 to poll for emails.

    Now in production we are facing th error discussed below. What we understand is that when the request is being handled by OSB2 node
    then it's trying to look for the attachment file in the folder "D:\MI_EMAILS\Emails\Attachments\2013-08-26-16-54-05-746" which it's trying to look in the machine where OSB2 is running but is unable to find so... and giving the following error stack trace



    Error Details:
    [OSB Tracing] Internal Error while tracing message
    D:\MI_EMAILS\Emails\Attachments\2013-08-26-16-54-05-746 (The system cannot find the file specified) SUBSYSTEM = OSB Kernel USERID = SEVERITY = Error THREAD = [ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)' MSGID = BEA-398204 MACHINE = soaprd-app02 TXID = BEA1-09F8D776D51FE5E0D3D7 CONTEXTID = 4f30ff80bc2046d4:f2d474:140baebdca4:-8000-0000000000000003 TIMESTAMP = 1377532448920
    WatchAlarmType: AutomaticReset
    WatchAlarmResetPeriod: 60000
    >
    ####<26-Aug-2013 16:54:09 o'clock BST> <[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <4f30ff80bc2046d4:f2d474:140baebdca4:-8000-0000000000000003> <1377532449013> <
    [OSB Tracing] Exiting consuming-emails/proxy/EmailConsumer>
    ####<26-Aug-2013 16:54:09 o'clock BST> <[ACTIVE] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <4f30ff80bc2046d4:f2d474:140baebdca4:-8000-0000000000000003> <1377532449013> <Failed to process request message for service ProxyService consuming-emails/proxy/EmailConsumer: com.bea.wli.sb.context.BindingLayerException: Failure while unmarshalling message: Failed to parse MIME multipart
    com.bea.wli.sb.context.BindingLayerException: Failure while unmarshalling message: Failed to parse MIME multipart

    Also we have investigated that the file which is being looked here is available in the OSB1 server attachments folder, which is in a different box.

    So as we see it, the requests handled by OSB 1 are getting processed but not by OSB 2.
    Exactly same thing is happening in Pre prod environment.
    can you shed any light on this... This is a production issue and requires to be solved at the earliest.

    With regards,
    Rudra

    ReplyDelete
  2. Can you share what you have tried so far to address the issue?

    I never used email adapter with OSB. So my advice is only based on other adapters.

    Sometimes when people say OSB XYZ adapter, it can loosely refer to one of the two things: 1. an XYZ JCA adapter: created in JDeveloper, then imported into OSB. 2. a straight forward OSB XYZ transport protocol.

    For example, you can import JCA file adapter, or you can use OSB file transport protocol. Which one is it in your case?

    A few more things that I can think of:

    1. make sure that "D:\MI_EMAILS\Emails\Attachments" is a shared directory with proper permission. I assume you already confirmed you can access the directory from OSB2. BTW, are all servers running on windows?

    2. Look into if there is a HA email adapter. I know that's the case for file adapter. I know you HAVE to use HA file adapter in order for it to work in a clustered environment. For example, JCA file adaper has two versions: FileAdapter, HAFileAdapter. see http://yuanmengblog.blogspot.com/2012/02/jca-file-adapter-process-same-file.html

    3. For clustered environment (HA), JCA adapter may require DB configuration (e.g. HAFileAdapter or DB adapter), you need to make sure your SOADBAdapter (can't remember exactly the JNDI name) need be configured properly (targeted to your OSB managed servers).

    4. As mentioned in my post, if it's purely a out of synch problem, you may try to clean out those folders.

    5. finally, check SOA docs, e.g. http://docs.oracle.com/cd/E17904_01/doc.1111/e15866/http_poller.htm#autoId0
    find this doc or newer version of it:
    Oracle® Service Bus
    HTTP and Poller Transports User Guide
    10g Release 3 (10.3)
    October 2008

    good luck.

    ReplyDelete
  3. Hi Shiva
    Did you resolve this issue Email Adapter issue , even we are facing a similar problem?

    ReplyDelete