This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Bug FQDND

LS,

Our FQDND process crashed. We are using Consul DNS to create hostnamed NAT policies to our VMs. It works very well, which is great. However, we stumbled upon a bug in FQDND.

MESSAGE   Apr 28 22:46:23 [4157512512]: execute_fqdn_opcode:call do_opcode for FQDN Host : traefik.service.consul
MESSAGE   Apr 28 22:46:25 [4158342912]: execute_fqdn_opcode:call do_opcode for FQDN Host : traefik.service.consul
MESSAGE   Apr 28 22:46:35 [4157512512]: execute_fqdn_opcode:call do_opcode for FQDN Host : traefik.service.consul
MESSAGE   Apr 28 22:46:36 [4158342912]: execute_fqdn_opcode:call do_opcode for FQDN Host : traefik.service.consul
MESSAGE   Apr 28 22:57:06 [4157512512]: execute_fqdn_opcode:call do_opcode for FQDN Host : traefik.service.consul
MESSAGE   Apr 28 22:57:06 [4157512512]: execute_fqdn_opcode:call do_opcode for FQDN Host : firewall-traefik.service.consul
MESSAGE   Apr 28 22:57:07 [4158342912]: execute_fqdn_opcode:call do_opcode for FQDN Host : firewall-traefik.service.consul
free(): invalid next size (normal)

Our FQDND process crashed, and we had to restart it via the advanced shell. When I got there, I captured the above lines from the logs (fqdndebug.log being empty). For your info, the hostname firewall-traefik has a TTL of 2s, and traefik.service.consul has 0s TTL. I guess I initially created the wrong hostname (without the firewall prefix) in the interface, and then switched to the correct hostname (with prefix). In the firewall we want to use a 2 sec TTL because we noticed with 0s TTL many DNS requests per second are executed. I have no idea why FQDND crashes at "invalid next size".

Hopefully you can figure out the problem and solve in a newer version.

Regards,
Frederik



This thread was automatically locked due to age.
Parents
  • Hi Frederik

    1) Please let us know on which platform (HW/SW) device and on which firmware this issue getting observed.

    2) If the device is not running with latest version, is there any chance to upgrade it to latest version (with V18) and confirm the issue status.

    3) Do you have any specific recreation steps which triggers this issue ? if yes please share it with us and we may try to check in local LAB.

    4) Is there any cordemup observed during above time on appliance ? Any segfault or call trace in syslog.log during above time? If yes please share those logs as well with us.

    #ls -lah /var/cores 

  • 1) XG135 (SFOS 18.0.0 GA-Build354.HF042920)

    2) n.a.

    3) no, ran into the issue once

    4) at the time of the last line in fqdnd.log, there is no log in syslog.log, but yes, ls -alh /var/cores gives me: -rw-------    1 root     0          18.0M Apr 28 23:03 core.fqdnd

    But core.fqdnd is not human readable. How do I share the content with you (18MB)?

     

Reply
  • 1) XG135 (SFOS 18.0.0 GA-Build354.HF042920)

    2) n.a.

    3) no, ran into the issue once

    4) at the time of the last line in fqdnd.log, there is no log in syslog.log, but yes, ls -alh /var/cores gives me: -rw-------    1 root     0          18.0M Apr 28 23:03 core.fqdnd

    But core.fqdnd is not human readable. How do I share the content with you (18MB)?

     

Children