RapidScale Clusters, LLC

Amazon EC2 "Spot Instances"

New from Amazon! Spot instances will be perfect for large render farms that want to grow on demand at those times when capacity dynamically becomes less expensive. From Amazon:


Amazon EC2 Spot Instances
Spot Instances are a new way to purchase and consume Amazon EC2 Instances by allowing you to bid on unused Amazon EC2 capacity and run those instances for as long as your bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and when your bid meets or exceeds the Spot Price, you'll gain access to the available Spot Instances (paying only the Spot Price, regardless of what amount you bid). These instances will continue to run until you choose to terminate the instances, or your maximum bid falls below the Spot Price (whichever is sooner). Spot Instances are complementary to On-Demand Instances and Reserved Instances, providing another option for obtaining compute capacity.

If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs. Additionally, Spot Instances can provide access to large amounts of additional capacity for applications with urgent needs. A few examples of categories of applications well suited to Spot Instances include image and video processing, conversion and rendering, scientific research data processing, and financial modeling and analysis.

Visit the Amazon EC2 Spot Instances page to learn more and get started using this new pricing feature.

 

Mental Ray frame buffer issue

We just went through a render farm tech support call with a customer using Maya / Mental Ray on RenderFarmer 1.1, and I thought this information might be useful to others. The problem was that Mental Ray likes to use /usr/tmp for its local frame buffer, but /usr/tmp is shared by all nodes because of the shared root filesystem. So not only is it slow to access because it's across the network, but the nodes are writing all over each others temp files.

First we tried a bind mount that hooks /usr/tmp into /tmp, which was mounted on a local ramdisk. The access speed was great, but processing time was very slow. Why? The ramdisk was tiny, just 10 MB by default. Nowhere near enough for Mental Ray scratch space.

Solution two was to remove the symlink on each node to /usr/tmp, then create a local /usr/tmp (inside the rootfs, which is also a ramdisk). Presto!  We now have a /usr/tmp that is not limited to an arbitrary size, and because it's in memory, it's lightning fast.

Write speeds observed on RenderFarmer 1.1 cluster using gigabit ethernet :
Compute Node --> remote filesystem 70 MB/s
Master Server --> local disk filesystem 720 MB/s
Compute Node --> local ramdisk /usr/tmp 2900 MB/s
 

atl1e attansic linux driver

Several customers have had some issues getting the atl1e driver to work correctly on Centos Linux 5.2 and 5.3. There is a kmod-atl1e package floating around some RPM repositories, but that doesn't always work right. After spending a fair amount of time helping customers work around problems caused by kmod (drivers are not in the normal place, updating a kernel breaks it and then it won't let you remove / reinstall, etc.) I found the easy way to get these drivers.

Start here. This gentleman was kind enough to rework the Makefile that comes with the atl1e driver source and post the new one with lots of fixes. If that ever goes down you can download it from us here.

Just untar it with 'tar xvf l1e.tar' and cd into the src/ subdirectory, and run 'make install'. The modules are now in the usual place, /lib/modules/<kernel-version>/kernel/drivers/net/atl1e.

 

EC2 render speed comparison

Just an update on how the speed tests are going with the Amazon EC2 render farm. There does not seem to be any speed up by going 64 bit for my test renders. Possibly this is due to the particulars of the job I am running... But the "extra large" 64 bit EC2 instance I ran that had 4 dual cores and 8 GB of memory finished the render exactly 4 times faster than the 32 bit instance with 1 dual core. 

So the price and time to completion was identical either way. Now I am moving on to testing the EC2 cloud versus my own render nodes, mano a mano. And the results are in!

Several tests showed the same result, so I will not bore everyone with every last detail. The bottom line is that local renders were performed on a DL380 with 2 GB of memory and dual Xeon 2.8 Ghz CPU's. They are hyperthreaded so they show up in Linux as 4 CPU's. On average, frames were rendered fastest when processed 4 at a time and they finished in about 40 seconds per frame. 

In EC2 I used a single "high-cpu medium" instance for comparison. It claims to provide "virtual" dual Xeon 2.3 Ghz CPU's and 1.7 GB of memory.  It rendered the same images in about 65 seconds each, on average. So the remote renders were about 1.6 times slower than local, about 90% of which I attribute to CPU power. To estimate actual CPU power based on the experimental results, calling it a virtual dual Xeon 2.0 Ghz would be more accurate. Of course, the wonderful thing about EC2 is that I can provision as many of them as I want! The only thing you really need to know is how to baseline them... And now you have something for comparison. At 20 cents per hour per instance, it's a bargain if you want to burst your render farm capacity with Amazon's EC2. 

 

Amazon EC2 Cloud Render Farm Integration

On request from a prospective customer, I looked into the idea of integrating RenderFarmer with cloud computing technology. Of course we are already compatible with a PXE booted VMWare virtual machine image, but cloud computing is all about external resources that you can offload work to. Amazon EC2 is the leader in the "cloud for rent" category, so I did some homework to figure out how this might work. The idea is simple: If you want on demand capacity in your render farm, just boot up some cloud based "virtual render nodes" and they'll join your farm and go to work alongside whatever local nodes you might have. 

At first, the integration does not look straightforward. In RenderFarmer, everything gets tied in to a cluster shared root filesystem by PXE booting into an image that goes out over the local network and links to whatever it needs.  The nodes are diskless. But the latency between the cloud and us is far too great for that kind of live network interaction to work in that case. So no PXE. However, the whole point of PXE / shared root is that it scales fast. You can add nodes in a few minutes with no sysadmin skills or effort. Likewise, Amazon's EC2 scales fast. They have their own methods for doing this, so let's take advantage of them and not worry about PXE!

A "server" in EC2 is stored as an image, called an AMI (Amazon Machine Image). There are lots of basic images publicly available. You can choose one and then launch as many instances of it as you like. So our first challenge was to build our own AMI that had whatever cluster magic we would have gotten via PXE boot. After a while, a 32 bit Fedora Core 8 render node emerged that had all the required ingredients. 

Now the basic AMI instance was running but local rendering was incredibly slow. Turns out the caching does not work well enough; all the render engine files had to be local to the instance. I rebuilt the AMI to include the Maya render engine and tested again. OK, rendering is much better now!  

But we still need some level of tighter integration with the render farm here in the office. For instance,  the output directory for the images would ideally be the same for all nodes, whether local or remote. And there were some other remote directories that the DrQueue render queue manager needed in order for jobs to propagate information... There are several ways to do this. We could take the easy way and use NFS or even rsync at certain intervals, but this would take away a big advantage of RenderFarmer. By using the cluster filesystem, we retain the ability to add more servers later for read and write striping and fault tolerance.  This gives us the power to increase our i/o to almost any level that we need, and to do so transparently. By adding servers in the back end, "underneath" the clients, they all get the i/o benefits witout any configuration changes. Trying to add this on later to an NFS / rsync system would be painful or impossible. The cluster filesystem is more difficult to use, but it's worth it.

After many hours of script fu, we had a cloud based render node that processed jobs in real time. The question now is how fast are the different types of instances you can deploy from the cloud... There are three types of instances we are interested in. The first is a "small" that has one CPU. Amazon says it is equivalent to a 1 Ghz Xeon or Opteron and it costs 10 cents (.1 USD) per hour to run. The second is a "medium" instance that has dual Xeons and costs 20 cents per hour. 

The first test was to render 100 frames of GiantStormAnim.mb, a sample scene that comes with Maya. I clicked two or three buttons and deployed 10 "small" nodes, and started the job when they were done booting. 41 minutes and 20 seconds later, it was done! Next I deployed 5 dual cpu "medium" instances and ran it again. It was 30% faster, completing in just 28 minutes and 25 seconds. Good to know that the second setup has more CPU horsepower, despite having the same number of cores and the same $1 per hour cost (10 cents * 10 nodes or 20 cents * 5 nodes).

Next up is the Extra Large High CPU instance, which has 8 dual Xeons for a whopping 16 cores per instance. But I can't try it until I create a 64 bit OS image... That's ok, we came a long way so far and I'm very happy with the results. In effect, it is possible to deploy a temporary virtual render farm OF ANY SIZE that can be integrated 100% with RenderFarmer, just as if the nodes were local. The nodes can be all virtual, or the virtual nodes can just supplement any existing local nodes. From a user perspective (when launching or checking on jobs, say) one cannot tell that some nodes are local and some aren't. And when you aren't using your virtual nodes, you aren't paying for them! Also, just like with local PXE booting nodes, it only takes moments to add nodes on the fly and have them join into running jobs.

We are ready to help our customers get their own AMI images ready and working, and anyone interested in beta testing this exciting new technology should contact us as soon as possible.  Any customers with a support agreement will receive free assistance and integration! Of course, if you are not yet a customer but have questions about this technology, we are happy to assist you in determining if Amazon EC2 cloud integration will work for you. 

 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  Next 
  •  End 
  • »


Page 1 of 3

Subscribe to the Linux Admins Blog and get new posts delivered by email!
Enter your email address:

Delivered by FeedBurner

All things 3D:
3dlinksanim
go to Renderosity.com

Tell the developers:

Would you use "virtual" dual xeon render nodes from Amazon EC2 for 45 cents (.45 USD) per hour? No contract, no minimum, no need for your own render farm hardware!
 
Installing a Linux server to control a render farm is too technically demanding for me...
 

Copyright 2009    RapidScale Clusters, LLC