Hi,
I have two ConnectX-3 FDR Infiniband + 40Gige cards in separate machine where both are directly connected (Machine A:Port 1 Machine B:port 2). Both machine is equip with Ubuntu 12.04. I am actually trying to run MPI application using Infiniband cable instead of TCP/IP normal cable. I have manage to install the driver and set up the IP address for the ports by following this article. But I'm having problem to ping between this two machine using the IP that I had set. if I use "Ping" command it will return as follow.
root@gpu0:/# ping 172.31.128.53
PING 172.31.128.53 (172.31.128.53) 56(84) bytes of data.
From 172.31.128.51 icmp_seq=1 Destination Host Unreachable
From 172.31.128.51 icmp_seq=2 Destination Host Unreachable
From 172.31.128.51 icmp_seq=3 Destination Host Unreachable
while if using ibping ( ibping -G 0xf4521403007f6082 ) it will just hangs there. How to debug this problem? here is some info about the both machine setup.
Ibstat
root@gpu0:/# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.30.3110
Hardware version: 1
Node GUID: 0xf4521403007f6060
System image GUID: 0xf4521403007f6063
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0xf4521403007f6061
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 1
LMC: 0
SM lid: 1
Capability mask: 0x0251486a
Port GUID: 0xf4521403007f6062
Link layer: InfiniBand
root@gpu1:/# ibstat
CA 'mlx4_0'
CA type: MT4099
Number of ports: 2
Firmware version: 2.30.3110
Hardware version: 1
Node GUID: 0xf4521403007f6080
System image GUID: 0xf4521403007f6083
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x02514868
Port GUID: 0xf4521403007f6081
Link layer: InfiniBand
Port 2:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 2
LMC: 0
SM lid: 1
Capability mask: 0x02514868
Port GUID: 0xf4521403007f6082
Link layer: InfiniBand
Ifconfig
root@gpu0:/# ifconfig
ib0 Link encap:UNSPEC HWaddr A0-00-01-00-FE-80-00-00-00-00-00-00-00-00-00-00
inet6 addr: fe80::f652:1403:7f:6061/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:49 errors:0 dropped:0 overruns:0 frame:0
TX packets:233 errors:0 dropped:33 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:10933 (10.9 KB) TX bytes:33254 (33.2 KB)
ib1 Link encap:UNSPEC HWaddr A0-00-01-10-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:172.31.128.51 Bcast:172.31.128.255 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:2044 Metric:1
RX packets:104 errors:0 dropped:0 overruns:0 frame:0
TX packets:114 errors:0 dropped:5 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:14695 (14.6 KB) TX bytes:16489 (16.4 KB)
root@gpu1:/# ifconfig
ib0 Link encap:UNSPEC HWaddr A0-00-01-00-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:172.31.128.53 Bcast:172.31.128.255 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:4092 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
ib1 Link encap:UNSPEC HWaddr A0-00-01-10-FE-80-00-00-00-00-00-00-00-00-00-00
inet6 addr: fe80::f652:1403:7f:6082/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:262 errors:0 dropped:0 overruns:0 frame:0
TX packets:159 errors:0 dropped:31 overruns:0 carrier:0
collisions:0 txqueuelen:1024
RX bytes:31927 (31.9 KB) TX bytes:27854 (27.8 KB)
route
root@gpu0:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 10.1.32.254 0.0.0.0 UG 0 0 0 eth0
10.1.32.0 * 255.255.255.0 U 1 0 0 eth0
link-local * 255.255.0.0 U 1000 0 0 eth0
172.31.128.0 * 255.255.255.0 U 0 0 0 ib1
192.168.122.0 * 255.255.255.0 U 0 0 0 virbr0
root@gpu1:/# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
172.31.128.0 * 255.255.255.0 U 0 0 0 ib0
192.168.122.0 * 255.255.255.0 U 0 0 0 virbr0
Notes: Besides of this problem, i also face a problem to start opensm where it always hangs. Instead of starts the opensm I starts the opensmd. Im not sure it is the same thing or what.
Amirul