Networking-Forums.com

Professional Discussions => Everything Else in the Data Center => Topic started by: ggnfs000 on October 30, 2017, 09:05:09 PM

Title: DGX-1 Nvidia shortcoming?
Post by: ggnfs000 on October 30, 2017, 09:05:09 PM
I have been following nvidia a lot these days. Their DGX-1 supercomputers priced at 168K$ with Volta architecture claimed to replace 400 servers in a datacenter. specifically designed for AI tasks. There is smaller personal computer version encompassing several Volta GPU-s at 69K$.

https://www.computerworld.com/article/3196105/data-center/nvidias-volta-based-dgx-1-supercomputer-puts-400-servers-in-a-box.html

There are hundreds of apps on nvidia that takes advantage of gpu accelerating computing although most of them seems not mainstream apps, rather seems very scientific, professional, research based applications. Assuming more mainstream apps will take advantage of GPU accelerating computing, the question lingers is can one of these supercomputer replace 400 servers? If one needs to build DC with 400 centers, then there is no need to build one (not to mention the costs involved in doing: building infrastructure, cabling and cooling).

But there seems to be a one major catch. Data center performance is not just measured by its computing power which NVidia seems to excel. There are also networking and storage. But these boxes just does not compete. Now lets say one heads out to build a miniature datacenter with just one of these boxes.

There is going to be a bottleneck in term of networking as it has couple of 10G NIC adapters. Nowadays 10G is a norm, 40G is gaining traction and 100G is already out. Now assuming 400 servers just had one NIC interface at 40G, then overall network throughput for all of them is around 16000G/s. Just to point out, I am talking very rough number here.

Storage is another concern. What are typical storage server nowadays? I think it is measured in petabytes. It has about meager 4x 1.8T SSD with total of less than 10TB. I am not sure if this is a direct storage-to-CPU PCIe connection or still going over concentional storage controller.

So I still doubt that it can replace the 400 servers quite the way it is now. There is no question this supercomputer is just amazing beast. However to really replace 400 servers, i think nvidia needs to do more in terms of networking and storage.

Perhaps partner with storage and network partners to come out with PODs? I think this might be the best way for Nvidia going forward at least for now.
It might be too much for now to build storage and network gears in-house for Nvidia.






Title: Re: DGX-1 Nvidia shortcoming?
Post by: deanwebb on November 01, 2017, 07:33:52 AM
This could be great for a specialized, low-data / high-compute task such as running weather models or other analytical tasks of that nature, where the benefit is not how fast it comes up with one answer, but how fast it comes up with ten billion answers and then aggregates them into a set of most likely outcomes.
Title: Re: DGX-1 Nvidia shortcoming?
Post by: ggnfs000 on November 06, 2017, 06:42:20 PM
Quote from: deanwebb on November 01, 2017, 07:33:52 AM
This could be great for a specialized, low-data / high-compute task such as running weather models or other analytical tasks of that nature, where the benefit is not how fast it comes up with one answer, but how fast it comes up with ten billion answers and then aggregates them into a set of most likely outcomes.

yes for specific applications, i think it will be great. I dont think it is yet to replace entire data center.