This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Error in SEC 4.5 Routing

I have an open support case that has been lingering for quite some time and was wondering if anyone has seen this issue or has any ideas on a workaround.

For the last month or so, I am having issues were the Remote Messaging from the console stops responding. The only fix is to stop and start the service and everything flows fine. 

We made a few changes which seemed to somewhat alleviate the issue but created a seperate issue. We upgraded from 9.7.7 to 10.0 which has soime fixes in RMS.

Now, RMS process shoots its memory usage up to about 1.9 GB then freezes. only fix is to kill the process and restart the service.  We have noticed EVERY SINGLE time this happens, an error in the router logs that read "SSL3_GET_RECORD:decryption failed or bad record mac", 99% of the time it is from a RELAY/SUM.  We have verified that all the certs are correct and it doesnt seem to be a cert issue.

I really hate spending my last 3 weekends in front of a machine restarting a service. Any ideas out there?

:23353


This thread was automatically locked due to age.
Parents
  • Hi,

    The TLS protocol: http://www.ietf.org/rfc/rfc2246.txt under section "7.2. Alert protocol" defines the errors thrown. For example: bad_record_mac = 20, certificate_unknown = 46, etc..

     
    If you have a network capture running, for example, you might use: Microsoft Network Monitor. You can apply a filter of something like:
     
    TLS and
    Conversation.ProcessName == "RouterNT.exe" and
    destination == "[ServerRouterIP]" and
    Description == "TLS:TLS Rec Layer-1 Encrypted Alert"

     
    Enter the IP of the machine throwing the errors in the logs as the "destination". You should be able to see the error message in the trace when this occurs.
     
    For example if you launch the OpenSSL client application (Windows version available here: http://www.slproweb.com/products/Win32OpenSSL.html) to make a connection to the message router on 8194 and using an unsupported version of TLS (-tls1_2):
    openssl.exe s_client -host [ServerRouterIP] -port 8194 -tls1_2
    This would fail and in the Router logs you would get:
     
    01.04.2012 23:41:43 AE34 W SSL connection alert, peer address [IP of Soruce]
    01.04.2012 23:41:43 AE34 E ACE_SSL (84540|110132) error code: 336151598 - error:1409442E:SSL routines:SSL3_READ_BYTES:tlsv1 alert protocol version

     
    In the packet trace I suppose you would see the 46 (certificate_unknown) following the code 2 for fatal.
     
    It would be good to confirm that in a packet trace when this occurs you see the error 20 (bad_record_mac). Ideally it would be good to see a trace from that. Maybe the trace cound help understand the cause for the problem.
     
    I suppose you could go for some sort of auto-recovery based on problematic symptoms that would ultimately restart the message router in order to keep things running. The markers for a problem seem to be the entry in the router log or the high memory usage. Parsing the router logs for the message could be one solution. Maybe a script that runs every 5 mins looking for the message and then restarts the router if found or one that checks the memory usage of the process and restarts it, for example:
    ' Check Router Memory
     
    intMaxWorkingSetSizeInMB                   = 1500
    intTimeBetweenStopAndStartOfServiceSeconds = 30
    sttLogFile                                 = "SophosRouterMemLog.txt"
    strServiceName                             = "Sophos Message Router"
     
    'init
    intCurrentInMB                             = 0
     
    dim objWMIService : set objWMIService = GetObject("winmgmts:{impersonationLevel=impersonate}!\\.\root\cimv2")
    dim colObjects : set colObjects = objWMIService.ExecQuery("Select * From Win32_Process where Name ='RouterNT.exe'")
     
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    dim objFile : set objFile = objFSO.OpenTextFile(sttLogFile, 8, True)
     
    objFile.WriteLine(Date & " - " & Time & " Starting check ========================================================")
     
    For Each Service in colObjects
        intCurrentInMB = (0.0009765625 * Service.WorkingSetSize ) / 1024 ' megabytes
    Next
     
    if intCurrentInMB = 0 then  ' service wasn't found so start
        objFile.WriteLine(Date & " - " & Time & " " & strServiceName & " service wasn't running, will start.")
        RestartService (strServiceName)
        objFile.WriteLine(Date & " - " & Time & " will not check memory usage on this check.")
        Cleanup()
        wscript.quit
    end if
     
    if intCurrentInMB > intMaxWorkingSetSizeInMB then
        objFile.WriteLine(Date & " - " & Time & " " & intCurrentInMB & " is > than " & intMaxWorkingSetSizeInMB & " will restart the service: " & strServiceName & ".")
        RestartService (strServiceName)
    else
        objFile.WriteLine(Date & " - " & Time & " " & intCurrentInMB & " is < than " & intMaxWorkingSetSizeInMB & " nothing to do")
    end if
     
    Cleanup()
     
     
    'Functions
    Function Cleanup()
        objFile.WriteLine(Date & " - " & Time & " Ending check ==========================================================")
        set objWMIService = nothing
        set colObjects    = nothing
        objFile.Close
        set objFile = nothing
        set objFSO = nothing
     
    end Function
     
    Function RestartService(strService)
     
        dim colListOfServices : Set colListOfServices = objWMIService.ExecQuery ("Select * from Win32_Service Where Name ='" & strService & "'")
        For Each objService in colListOfServices
            objService.StopService()
            WSCript.Sleep intTimeBetweenStopAndStartOfServiceSeconds * 1000
            objService.StartService()
            WSCript.Sleep intTimeBetweenStopAndStartOfServiceSeconds * 1000
            objFile.WriteLine(Date & " - " & Time & " Service " & strServiceName & " is: " & objService.state)
        Next 
     
    End Function

    The only downside to restarting the router on a busy system is it's need to re-read all the messages. This could take sometime and memory so you have to be careful not to get into a vicious cycle.

    Does the message correlate with any specific activity, message type, scenario?

    Regards,

    Jak

    :23393
Reply
  • Hi,

    The TLS protocol: http://www.ietf.org/rfc/rfc2246.txt under section "7.2. Alert protocol" defines the errors thrown. For example: bad_record_mac = 20, certificate_unknown = 46, etc..

     
    If you have a network capture running, for example, you might use: Microsoft Network Monitor. You can apply a filter of something like:
     
    TLS and
    Conversation.ProcessName == "RouterNT.exe" and
    destination == "[ServerRouterIP]" and
    Description == "TLS:TLS Rec Layer-1 Encrypted Alert"

     
    Enter the IP of the machine throwing the errors in the logs as the "destination". You should be able to see the error message in the trace when this occurs.
     
    For example if you launch the OpenSSL client application (Windows version available here: http://www.slproweb.com/products/Win32OpenSSL.html) to make a connection to the message router on 8194 and using an unsupported version of TLS (-tls1_2):
    openssl.exe s_client -host [ServerRouterIP] -port 8194 -tls1_2
    This would fail and in the Router logs you would get:
     
    01.04.2012 23:41:43 AE34 W SSL connection alert, peer address [IP of Soruce]
    01.04.2012 23:41:43 AE34 E ACE_SSL (84540|110132) error code: 336151598 - error:1409442E:SSL routines:SSL3_READ_BYTES:tlsv1 alert protocol version

     
    In the packet trace I suppose you would see the 46 (certificate_unknown) following the code 2 for fatal.
     
    It would be good to confirm that in a packet trace when this occurs you see the error 20 (bad_record_mac). Ideally it would be good to see a trace from that. Maybe the trace cound help understand the cause for the problem.
     
    I suppose you could go for some sort of auto-recovery based on problematic symptoms that would ultimately restart the message router in order to keep things running. The markers for a problem seem to be the entry in the router log or the high memory usage. Parsing the router logs for the message could be one solution. Maybe a script that runs every 5 mins looking for the message and then restarts the router if found or one that checks the memory usage of the process and restarts it, for example:
    ' Check Router Memory
     
    intMaxWorkingSetSizeInMB                   = 1500
    intTimeBetweenStopAndStartOfServiceSeconds = 30
    sttLogFile                                 = "SophosRouterMemLog.txt"
    strServiceName                             = "Sophos Message Router"
     
    'init
    intCurrentInMB                             = 0
     
    dim objWMIService : set objWMIService = GetObject("winmgmts:{impersonationLevel=impersonate}!\\.\root\cimv2")
    dim colObjects : set colObjects = objWMIService.ExecQuery("Select * From Win32_Process where Name ='RouterNT.exe'")
     
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    dim objFile : set objFile = objFSO.OpenTextFile(sttLogFile, 8, True)
     
    objFile.WriteLine(Date & " - " & Time & " Starting check ========================================================")
     
    For Each Service in colObjects
        intCurrentInMB = (0.0009765625 * Service.WorkingSetSize ) / 1024 ' megabytes
    Next
     
    if intCurrentInMB = 0 then  ' service wasn't found so start
        objFile.WriteLine(Date & " - " & Time & " " & strServiceName & " service wasn't running, will start.")
        RestartService (strServiceName)
        objFile.WriteLine(Date & " - " & Time & " will not check memory usage on this check.")
        Cleanup()
        wscript.quit
    end if
     
    if intCurrentInMB > intMaxWorkingSetSizeInMB then
        objFile.WriteLine(Date & " - " & Time & " " & intCurrentInMB & " is > than " & intMaxWorkingSetSizeInMB & " will restart the service: " & strServiceName & ".")
        RestartService (strServiceName)
    else
        objFile.WriteLine(Date & " - " & Time & " " & intCurrentInMB & " is < than " & intMaxWorkingSetSizeInMB & " nothing to do")
    end if
     
    Cleanup()
     
     
    'Functions
    Function Cleanup()
        objFile.WriteLine(Date & " - " & Time & " Ending check ==========================================================")
        set objWMIService = nothing
        set colObjects    = nothing
        objFile.Close
        set objFile = nothing
        set objFSO = nothing
     
    end Function
     
    Function RestartService(strService)
     
        dim colListOfServices : Set colListOfServices = objWMIService.ExecQuery ("Select * from Win32_Service Where Name ='" & strService & "'")
        For Each objService in colListOfServices
            objService.StopService()
            WSCript.Sleep intTimeBetweenStopAndStartOfServiceSeconds * 1000
            objService.StartService()
            WSCript.Sleep intTimeBetweenStopAndStartOfServiceSeconds * 1000
            objFile.WriteLine(Date & " - " & Time & " Service " & strServiceName & " is: " & objService.state)
        Next 
     
    End Function

    The only downside to restarting the router on a busy system is it's need to re-read all the messages. This could take sometime and memory so you have to be careful not to get into a vicious cycle.

    Does the message correlate with any specific activity, message type, scenario?

    Regards,

    Jak

    :23393
Children
No Data