VREDPro HPC Raytracing Cluster

1 HPC Raytracing Cluster... 1 1.1 Introduction... 1 1.2 Configuration... 2 1.2.1 Cluster Options... 4 1.2.2 Network Options... 5 1.2.3 Render Node Options... 6 1.2.4 Preferences... 6 1.2.5 Starting the cluster service... 7 1.3 Network requirements... 8 1.4 VRED Cluster on Linux... 9 1.4.1 Extracting the Linux installation... 9 1.4.2 Automatically start VRED Cluster Service on Linux startup... 10 1.4.3 Connecting a license Server... 10 1.4.4 Accessing log information... 10 1.4.5 Useful commands... 11 1.5 Fully nonblocking networks / blocking networks... 11 1.6 Trouble shooting... 11 1.6.1 Unable to connect a rendering node... 11 1.6.2 Unable to use Infiniband to connect cluster servers... 12 1.6.3 The frame rate is not as high as expected... 12 1.6.4 Why is the network utilization less than 100%... 12 1.6.5 Why is the CPU utilization less than 100%... 13 1 HPC Raytracing Cluster 1.1 Introduction With VREDPro it is possible to speed up the rendering time by using many PCs connected over a network. To be able to distribute the rendering work between multiple PCs, the whole scene graph is transferred over the network to each PC. Then the Image is split into tiles. Each PC renders a subset of all tiles. The result is sent back to the master and there the final image is composed.

1.2 Configuration To be able to configure a cluster, the IP address or name of all hosts must be known. Master and rendering nodes are connected through a network. The configuration is presented as a tree view with the global options, network options and options for each rendering node. By clicking one of the entries in the tree, you ll see the corresponding options on the left side. Entries can be added or removed through the Edit -Menu or by right clicking a node a context menu is presented. Adding many nodes for large clusters can be very annoying. To speed up the configuration, the Edit compute nodes command from the Edit -Menu can be used. In the Configuration text box the configuration can be edited. A wide range of formats are recognized including the connection string from the previous Cluster -Module.

PC1 PC2 PC3 PC4 PC1, PC2, PC3, PC4 PC1 PC4 Master1 {PC1 PC2 PC3} Maste1[forwards=PC1 PC2 PC3] Master1 {C1PC01 - C1PC50} Master2 {C2PC00 C2PC50} The field Read configuration from file can be used to read the configuration from an externally generated file. For example a startup script could generate a file that contains the names of all usable cluster nodes. If this filed contains a valid file name, then the configuration is read each time the Start - Button is pressed. If the cluster nodes are located in a local network that is not accessible from the host where VRED is running, then in intermediate node must be configured. This node must be connected to the same network as the host where VRED is running and, through a second network card, to the cluster network. In this setup you have to configure two networks. The first one is the network that connects VRED with the gateway node and the second one is the network that connects the cluster nodes.

1.2.1 Cluster Options Data compression Image compression Start/Stop action Still supersampling Parallel frames Compression method used to transfer the scene graph from VRED to the cluster nodes. Off: No compression Medium: Fast but low compression ration High: Slow but higher compression ration Compression method used to transfer the rendered images from the cluster nodes to VRED Off: No compression Low (lossless): Low compression without loss of quality Medium (lossy): Fixed compression ratio of 1/3 and very fast High (lossy): Slow but high compression ration This option can be used to prevent very long rendering times on cluster stop or cluster crash. Do nothing: No action is performed on cluster start or cluster stop Downscal Off/On: Switch downscale off on cluster start and back to on when the cluster stops Raytracing On/Off: Enable raytracing during cluster rendering and use OpenSG rendering when the cluster stops Rendering On/Off: Rendering is switched of, if the cluster is not running If the CPU utilization during still antialiasing is very poor, then by increasing the image size the utilization could go up. With a supersampling of 2x2, 4 pixels are rendered by the cluster and later combined to a single pixel. To get the same rendering quality, the number of image samples should be reduced by a factor of 4. (9 with 3x3, 16 with 4x4) On clusters with many nodes (>100) and low resolution rendering, this option can be used to increase the frame rate. If there are 500 nodes and

Frame buffers parallel frames is set to 2, then 250 nodes are rendering one image while 250 are rendering the next image. This will increase the latency and should only be used on high frame rates. VRED is able to render a new image while the last images are still transferred over the network. On slow internet connections increasing the number of frame buffers could increase the frame rate. 1.2.2 Network Options Enabled Description Network type Network speed Service port Encrypted If not checked, all the nodes connected to this network will not be used during cluster rendering Just for documentation Ethernet: Standard network connection Infiniband: On linux the ibverbs interface is used to communicate with infiniband network cards. In some situations this can be faster than using IPoIB. To be able to use Infiniband, the nodes must also be reachable over Ethernet VRED uses this option to optimize the communication. In most cases Auto should be selected. With 10GBit Ethernet you can get a higher throughput by selecting 10 Gigabit. If you have a 10GBit link to a 1GBit switch, then 10G uplink to 1G should be selected. This socket port is used to connect cluster nodes. Make sure that this port is not blocked by a firewall The whole communication in the network is encrypted. For each session a new key is used. The communication is safe against listening and recording network sessions. If you expect attacks to your network infrastructure and elaborate man in the middle attacks, you should consider using other mechanisms like encryption hardware or encryptions that are secured by signed certificates.

Service Port The following algorithms are used for encryption: Key negotiation: Unified Diffie-Hellman, Ephemeral Unified Model Scheme (X9.63) or Static Unified Model (NIST). 1024 Bit Transport layer security: RSA 128 Bit This is the TCP socket port where the VREDClusterService is listening for incoming connections. Communication over this port must not be blocked by a firewall. 8889 is the default port. To start VREDClusterService on another port, use the p option. 1.2.3 Render Node Options Enabled Name If not checked, this node is not used for cluster rendering IP address or host name of the compute node. 1.2.4 Preferences

Default config path Consumption based license check out Check out all licenses from master This configuration file is loaded on startup Use special license for consumption based billing If the cluster nodes are not able to connect a license server, licenses for all nodes can be checked out from the master 1.2.5 Starting the cluster service For cluster rendering a rendering server must be started on the rendering node. To be able to automate this process, VREDClusterService is used. This program starts and stops cluster servers on demand. By default the service listens on port 8889 for incoming connections. Only one service can be started for a single port at a time. If there should be multiple services be started, then each service must use a different port. Use the p option to specify the port. For testing VREDClusterService can be started manually. Adding the options e c will produce some debug information on the console.

On windows the VRED installer will add a system service VRED Cluster Daemon. This service is responsible for starting VREDClusterService on system startup. You can check this by selecting view local services from the windows control panel. The path should point to the correct installation, startup type should be automatic and service status should be started. On Linux the service must be started by adding bin/vredclusterservice to the init system or by running bin/clusterservice manually. 1.3 Network requirements The network hardware is the limiting factor for the maximum reachable frame rate and the speed with which the scene graph can be distributed to the rendering nodes in a cluster. We have measured frame rates and data transfer rates on a 32 node cluster with different network configurations. We used a very simple scene with CPU rasterization to generate the images to make sure that the rendering speed is not CPU bound. All measurements were taken in a dedicated network without other traffic and well configured network components. The data transfer rate is always limited by the slowest network link in the cluster. Infiniband can in theory transfer up to 30gbit/s but the 10GBit link from the master makes it impossible to reach this speed. Currently we have no experience with large clusters connected by 10GB Ethernet. The 10GB- >Infiniband configuration has been tested with up to 300 nodes. Network uncompressed (MB/s) medium (MB/s) high (MB/s) 10GB->Infiniband 990 1100 910 10GB->1GB 110 215 280 1GB 110 215 280 In contrast to the transfer rate of the scene graph, the frame rate is in some cases not limited by the slowest connection. If the rendering nodes are connected to a 1GBit Switch with a 10GBit uplink to the

master, then 10 renderings nodes can in sum utilize the whole 10GBit uplink. As for large clusters the influence of network latency becomes more important, Infiniband should be the preferred cluster network. Resolution Network Uncompressed (fps) lossless (fps) medium lossy (fps) high lossy (fps) 1024x768 10GB->Infiniband 250 220 300 250 10GB->1GB 250 220 300 250 1GB 50 100 145 220 1920x1080 10GB->Infiniband 130 120 200 130 10GB->1GB 130 120 200 130 1GB 20 40 50 130 4096x2160 10GB->Infiniband 40 40 55 40 10GB->1GB 40 40 55 40 1GB 5 10 15 40 8192x4320 10GB->Infiniband 10 12 14 13 10GB->1GB 10 12 14 13 1GB 1 2 3 10 We have tested Mellanox and QLogic/Intel infiniband cards. For 10 Gigabit Ethernet we used an Intel X540-T1 network card. VRED will try to use as much bandwidth as possible. This can cause problems with other users on the same network. Otherwise e.g. large file transfers on the same network will slow down the rendering. If possible, try to use dedicated network infrastructure for the cluster. 1.4 VRED Cluster on Linux 1.4.1 Extracting the Linux installation The Linux VREDCluster installer is shipped as a self-extracting binary. The filename of the package contains the version number and a build date. To be able to extract the content to the current directory, run the installer as a shell command. sh VREDCluster-2016-SP4-8.04-Linux64-Build20150930.sh This creates the directory VREDCluster-8.04 containing all the files required for cluster rendering. Each cluster node must have read permission to this directory. It can be installed individually on each node or on a shared network drive. To be able to connect a cluster node, the cluster service must be started. This can be done by calling the following command manually or by adding the script to the Linux startup process. VREDCluster-8.04/bin/clusterService

1.4.2 Automatically start VRED Cluster Service on Linux startup On linux there are a many ways for starting a program on system startup. There are slightly different ways for each linux distribution. One possible solution is to add the start command to /etc/rc.local. This script is executed on startup and on each runlevel change. If the installation files are for example stored in /opt/vredcluster-8.04, then rc.local should be edited as follows. #!/bin/sh e killall -9 VREDClusterService /opt/vredcluster-8.04/bin/clusterservice & exit 0 If you do not want to run the service as root, you could start the service for example as vreduser. #!/bin/sh e killall -9 VREDClusterService su vreduser /opt/vredcluster-8.04/bin/clusterservice & exit 0 Another option to start the VRED service on system startup is to add an init script to /etc/init.d. Copy the init script from the installation directory bin/vredclusterservice to /etc/init.d. Then you have to activate the script for different run levels. On Redhat or CentOS Linux use: On Ubuntu use chkconfig vredclusterservice on update-rc.d vredclusterservice defaults 1.4.3 Connecting a license Server The address of the license server must be stored either read in a config file in your home directory.flexlmrc or can be set as an environment variable ADSKFLEX_LICENSE_FILE. If the server is named mylicenseserver. ~/.flexlmrc should contain. ADSKFLEX_LICENSE_FILE=@146.248.3.122 Or add to your login script export ADSKFLEX_LICENSE_FILE=@146.248.3.122 Make sure to setup the environment for the user that starts the cluster service. 1.4.4 Accessing log information Log files will be written to the directory /tmp/clusterserver/log. Log files should be viewed with a web browser. For trouble shooting it could be helpful to start the cluster service in console mode. Stop a running cluster service and start bin/clusterservice e c manually. Now all debug and error messages are written to the console.

1.4.5 Useful commands Check, if the service is running: ps -C VREDClusterService Stopping the cluster service: killall -9 VREDClusterService Stopping a cluster server: killall -9 VREDClusterServer 1.5 Fully nonblocking networks / blocking networks A fully nonblocking network allows all connected nodes to communicate with each other with the full link speed. For a 16 port gigabit switch this means that the switch is able to handle the traffic of 16 ports reading and writing in parallel. This requires an internal bandwidth of 32 gigabit/s. VRED will work best with nonblocking networks. For example the transfer of the scenegraph is done by sending packages from node to node like a pipeline. This is much faster than sending the whole scene directly to each node. If possible, do not use network hubs. A hub sends incoming packages to all connected nodes. This will result in a very slow scene graph transfer. On large clusters it could be very expensive to use fully nonblocking networks. If the cluster nodes are connected to multiple switches then a good performance could be reached by simply sorting the rendering nodes by the switches they are connected to. For example if you have a cluster with 256 nodes. These nodes are connected to 4 switches. The worst configuration would be Node001 Node177 Node002 Node200 Node090... A better Result could be achieved with a sorted list Node001 Node256 If this configuration causes any performance problems, you could try to model your network topology into the configuration Node001 { Node002 Node064 } Node065 { Node066 Node128 } Node129 { Node130 Node192 } Node193 { Node194 Node256 } With this configuration the amount of data that is transferred between the switches is efficiently reduced. 1.6 Trouble shooting 1.6.1 Unable to connect a rendering node Use ping hostname to check if the node is reachable

Try to use the IP-Address instead of the host name The port 8889 must be allowed through the firewall VREDClusterService must be started on the rendering node. Try to connect over a web browser with http://hostname:8889 1.6.2 Unable to use Infiniband to connect cluster servers Communication over Infiniband without an IP layer is only supported on Linux The Infiniband network must be configured properly. Use ibstat to check you network state or use ib_write_bw to check the communication between two nodes. Check ulimit -l The amount of locked memory should be set to unlimited Check ulimit n The number of open files should be set to unlimited or to a large value If it s not possible to run VREDClusterServer over Infiniband, in most cases IP over Infiniband can be used with very little performance loss 1.6.2.1 Occasional hangings by an otherwise high frame rates Maybe the network is used by other applications Try to change network speed to Auto Try to reduce image compression or set image compression to off 1.6.3 The frame rate is not as high as expected If the CPU usage is high, rendering speed is limited by the CPUs. Frame rates can be increased by reducing the image quality If the CPU usage is low and the network utilization is high, then enabling compression could provide higher frame rates Small screen areas with very complex material properties can lead to a very bad load balancing between the rendering nodes. The result is a low CPU usage and a low network usage. The only option in this case is to locate these hot spots and reduce the material quality. 1.6.4 Why is the network utilization less than 100% If the rendering speed is limited by the CPU, then network utilization will always be less than 100%. Providing a faster network will, in this case, not provide a higher frame rate. Try to use tools like iperf to check the achievable network throughput. On 10 gigabit Ethernet VRED uses multiple channels to speedup communication. This could be simulated with iperf P4. Check the throughput in both directions. Network speed is limited by the used network cards but also by the network switches and hubs. If the network is used by many people then the throughput can vary strongly from time to time. Tests have shown that it is very difficult to utilize 100% of a 10 gigabit Ethernet card on windows. Without optimization you can expect 50%. You can try to use measurement tools and adjust performance parameters of you network card. If the rendering speed is really limited by your network then you should set image compression to lossless or higher For Intel 10Gigabit X540-T1 Ethernet cards the following options could increase the network throughput on Windows

o Receive Side Scaling Queues: Default = 2, changed to 4 o Performance Options / Receive Buffers: Default = 512, changed to 2048 o Performance Options / Transmit Buffers: Default = 512, changed to 2048 Sometimes installing the network card into another PCI slot could increase the network throughput On Linux you could increase the parameters net.ipv4.tcp_rmem, net.ipv4.tcp_wmem, net.ipv4.tcp_mem in /etc/sysctl.conf 1.6.5 Why is the CPU utilization less than 100% If the rendering speed is limited by the network, CPU utilization will not reach 100% To distribute the work between multiple rendering nodes, the screen is split into tiles. Each node then renders a number of these tiles. In most cases it is not possible to assign each node exactly the same amount of work. Depending on the scene this can lead to a low CPU utilization.