03 August 2021
When describing the latest technologies, we use the phrase “state of the art” without being aware that we’re actually referring to a moving target.
Let’s consider high-performance computing (HPC) storage solutions, which help keep pace with the massive volumes of information that need processing.
Increasingly, HPC is central in tackling some of the most complicated tasks, from gene sequencing to vaccine development.
But describing a HPC system as state of the art doesn’t really account for all the factors that customers planning large-scale storage must have in mind. Not only are they expensive to buy, maintain and operate but the costs of downtime and outages are often overlooked until it is too late.
often overlooked until it is too late. Users are waking up to this. While 57 per cent of HPC storage buyers said performance was the top criterion, 37pc mentioned long-term value or total cost of ownership (TCO) as a key factor according to a study by Hyperion Research for Panasas.
Familiar headaches HPC storage historically focussed on managing “big” files, whether a massive climate simulation or streaming files needed for CGI. Many relied on file systems that were ostensibly open source. Often platforms required more tuning after completing a job or preparing for the next.
But in the commercial world, there’s no tolerance for downtime and the staff required to keep things running.
Systems are expected to show a return on investment and handle multiple workloads simultaneously. In recent years, small files have played an increasing role, partly due to the demands of AI workloads, though anecdotally a similar pattern is being seen in traditional HPC areas such as life sciences and computational fluid dynamics.
Parallel file systems, with all the components talking to each other, were in danger of being swamped as the ratio of comms overhead to processing overhead increased.
The use of flash helped, but it is expensive compared to hard drives. One solution is to integrate flash and traditional storage. But that raises the challenge of managing the various tiers, to ensure they are used in the most performant way possible.
The cost of complexity, Hyperion highlighted, revealing insights about the total cost of ownership (TCO) as it applies to HPC.
One major cost is people. In 43 per cent of installations, one to three people were required to maintain it, while eight per cent needed four or five. Five or more were required at 10pc of HPC storage installations.
So, although just over a quarter of installations spent $100,000 or less in staffing, almost a third saw costs of $100,000 to $300,000, and almost 14 pc cost over $500,000. Simply recruiting and training experts was the most challenging aspect of HPC storage for 38pc of organisations.
Installation is also a major challenge. Just six per cent of organisations had their HPC storage rigs operational within a day, with 38% needing two to three. Over a quarter needed four to five days, with a similar number still unboxing after a week.
Downtime is also a major headache: almost half said they had to tune and retune systems monthly, with four per cent retuning weekly and two per cent daily. Additionally, monthly failures were reported by one-third of organisations and eight per cent reported weekly outages.
While 59% said recovery usually took a day or less, 24% took two-three days, while 14% took up to a week and three per cent needed more.
This is expensive, particularly for commercial customers adopting HPC storage, with 41% costing a day’s outage at up to $99,000. Fourteen percent put the cost at $100,000 to $500,000, with six per cent hitting $500,000 to $1M.
For four per cent, the daily outage cost was a shocking one million dollars.
With HPC storage installations expected to facilitate a wider variety of jobs, involving different file types, and with organisations developing a lower tolerance for failure, buyers will inevitably pay more attention to TCO.
Measuring the performance of HPC storage is an inexact science. There is a range of well-established parallel file systems, on a variety of hardware. Each installation is built for the specific needs of the client and its chosen applications
These are important considerations as HPC storage installations increasingly tackle a broader range of problems. The shortcomings of traditional approaches are becoming increasingly clear and it is harder to disguise or ignore the hidden costs of staffing and outages.