Useful Command
1. Remove node's log
rm -rf /mnt/logs
2. Stream content of p2_ha log continuously
tail -f /mnt/logs/p2ha.log
3. Dump troubleshooting statistics for bug fixing
dumpStat.sh
4. Get neighbor from neighbor table
cat /sys/kernel/p2_nbr/nbr_tbl_dump
5. Restart syslogd when log is not outputting for some reasons
/etc/init.d/syslogd restart
6. View p2 routing protocol information
p2rp_cli -u -D
7. To find packets to and from eth1 interface filtered with only port 16001 related
tcpdump -i eth1 -p 16001
Useful Meta Data
1. Log path of launcher in Windows
C:\Program Files\Anywhere Node Manager Launcher\user_data\logdata
Troubleshooting Notes
Scenario 1: In A-NM, you find node information of any nodes not showing up in mesh topology after performing node recovery (given the lost node is physically on)
Procedure
1. Check console log http response output to compare the result with UI behavior
2. Check whether the node is in managed device list (Note: node recovery failure may cause lost node being unamanged)
3. If 1. returns fail to retrieve, get the reason of failure, try to map the failure object with the error object documented in "doc_ui_p2_controller", try to resolve through the detailed error message returned from the controller
4. Try to isolate the problem from UI using "controller_restful_tester", run get-nodeinfo command to the remote node from the controller, if success, connection of controller <--> node should be good
5. Try to further isolate the issue from controller by directly retreiving node information using protobuf tool. Remember to run the meshTopology command first to achieve the access port of the remote node first
& '<pythonEXE_dir>' .\cli\ha.py --pw mgnt_pwd mgnt_ip -p access_port 1 > mesh_topology.txt
The "1" refers to the request type, no need to place the whole action name
6. If in 5, we still cannot get the result, troubleshoot, in host node, using tcpdump, data-network packet anaylzer to at least confirm packets are flowing normally between controller <---> host_node and host_node <--> remote_node with their perspective interfaces and ports
e.g.:
Let's say we have a controller (10.240.2.34), a host node (10.240.222.224) and a remote node.
First we run "tcpdump -i eth1 port 16001" on host node to ensure packet is running to remote node through host node (as 16001 is the NAT port of the remote node), if there are packets routing through port 16001, it implies there are some traffic from 10.240.222.224 to the remote node through the ethernet port connecting between controller
Then we run "tcpdump -i mesh0 port 12381" on remote node, making sure packet is coming to and from the remote node
If the result is positive, we can be sure the layer 3 and layer 4 connectivity is working as expected.
7. Next, use command 4 to make sure neighbor link can be discovered in both nodes
Scenario 2: Capturing logs and output to other parties for bug fixing
1. Before capturing any logs, we need to make sure time stamp of 1) Node 2) Controller and 3) UI is in sync first.
2. For node to sync time, navigate to cluster configuration, and set the timezone to Hong Kong, reboot of nodes in cluster to apply the changes
3. Navigate to system settings, configure the ntp server (IP only, same L2 environment). Therefore we should find a PC (better Linux based), install the ntpd and configure with the HK time server first before proceeding 4)
4. Login to the cluster and run "date" command on all nodes make sure their time is in sync with the time server you have configured in 3)
5. Since controller and UI should share the same time (Windows system time), just make sure they are in sync with the Windows time should be fine
6. Navigate to /mnt/logs to remove all logs from the target nodes
7. Remove all controller logs in "C:\Program Files\Anywhere Node Manager Launcher\user_data\logdata"
8.
Miscellaneous Materials
1. SNAT, DNAT and masquerade
https://www.huaweicloud.com/articles/90a13a644803d0efcd024df76fb130ae.html