DGX-1 Nvidia shortcoming?

Started by ggnfs000, October 30, 2017, 09:05:09 PM

Previous topic - Next topic

ggnfs000

I have been following nvidia a lot these days. Their DGX-1 supercomputers priced at 168K$ with Volta architecture claimed to replace 400 servers in a datacenter. specifically designed for AI tasks. There is smaller personal computer version encompassing several Volta GPU-s at 69K$.

https://www.computerworld.com/article/3196105/data-center/nvidias-volta-based-dgx-1-supercomputer-puts-400-servers-in-a-box.html

There are hundreds of apps on nvidia that takes advantage of gpu accelerating computing although most of them seems not mainstream apps, rather seems very scientific, professional, research based applications. Assuming more mainstream apps will take advantage of GPU accelerating computing, the question lingers is can one of these supercomputer replace 400 servers? If one needs to build DC with 400 centers, then there is no need to build one (not to mention the costs involved in doing: building infrastructure, cabling and cooling).

But there seems to be a one major catch. Data center performance is not just measured by its computing power which NVidia seems to excel. There are also networking and storage. But these boxes just does not compete. Now lets say one heads out to build a miniature datacenter with just one of these boxes.

There is going to be a bottleneck in term of networking as it has couple of 10G NIC adapters. Nowadays 10G is a norm, 40G is gaining traction and 100G is already out. Now assuming 400 servers just had one NIC interface at 40G, then overall network throughput for all of them is around 16000G/s. Just to point out, I am talking very rough number here.

Storage is another concern. What are typical storage server nowadays? I think it is measured in petabytes. It has about meager 4x 1.8T SSD with total of less than 10TB. I am not sure if this is a direct storage-to-CPU PCIe connection or still going over concentional storage controller.

So I still doubt that it can replace the 400 servers quite the way it is now. There is no question this supercomputer is just amazing beast. However to really replace 400 servers, i think nvidia needs to do more in terms of networking and storage.

Perhaps partner with storage and network partners to come out with PODs? I think this might be the best way for Nvidia going forward at least for now.
It might be too much for now to build storage and network gears in-house for Nvidia.







deanwebb

This could be great for a specialized, low-data / high-compute task such as running weather models or other analytical tasks of that nature, where the benefit is not how fast it comes up with one answer, but how fast it comes up with ten billion answers and then aggregates them into a set of most likely outcomes.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

ggnfs000

Quote from: deanwebb on November 01, 2017, 07:33:52 AM
This could be great for a specialized, low-data / high-compute task such as running weather models or other analytical tasks of that nature, where the benefit is not how fast it comes up with one answer, but how fast it comes up with ten billion answers and then aggregates them into a set of most likely outcomes.

yes for specific applications, i think it will be great. I dont think it is yet to replace entire data center.