Kicking things off with HPC and Big Data:
Coming from the research side in 2005 I wasn’t sure what defined HPC and Big Data and how to start. Sure, I had experience with a variety of operating systems on individual servers, but that was it. The hallway lab (see the previous blog) was more of a science project than anything else. It lacked the proper design, scalability, and automation. However, it did give us an initial direction to aim for right out of the gate. One thing I knew was that I wanted to run the new initiative as an R&D group: dynamic, flexible and driven by innovation.
Management demanded an environment that could grow significantly over time. The expectation was that data would grow at the pace of Moore’s Law or faster. Data would be streaming in at a steady rate from the outside for processing, consisting of files that were no larger than a few kilobytes in size at best. The resulting small file problem required addressing as early as possible in the process. Speech algorithms use statistical analysis and access data randomly across the environment. Every piece of data must have the same fast access time. It didn’t matter if the data was collected six months or six minutes ago, the access time had to be the same. There was no data expiration and no plan to delete data.
We wanted an architecture that was flexible enough from a technology perspective with the ability to scale massively. At first, people advised me to look at traditional HPC and Big Data vendors. So, I did. They delivered solutions built on dedicated hardware and a software layer wrapped around for ease of use to allow configuration and minor system adjustments. Not very cost effective with the focus mostly on hardware and not on software. There wasn’t a lot of flexibility if you wanted to deviate from the delivered solution. Growing was also a concern as I didn’t see the solutions being scalable enough to handle our projected growth. The outcome would be a static solution that would be tough to maintain and grow. Not exactly what we were looking for at the time.
Instead, we went for a design built around software that could handle a variety of storage, network and compute independent of the vendor. The ability to swap out devices when we felt the need to replace with more advanced technology or another provider was essential. No vendor lock in. An emphasis on automation tools for the heavy lifting, repetitive tasks, firmware upgrades and testing was critical.
At the heart of the environment, we used a clustered file system to handle the data access between the storage devices and the compute nodes. A cluster of servers running the clustered file system software creating an “abstraction” layer between storage devices and compute nodes. Ultimately, presenting a unified “logical” view of all our “physical” storage devices. The compute nodes would have access to the data through POSIX file systems exported by the cluster. We wanted a cluster setup for High Availability (HA) capabilities and performance.
The data-centric approach made it easy to add/remove storage devices. The cluster would be made aware of an additional storage device and added to a file system subsequently adding capacity and performance. Followed by the optional data re-balance making sure that all devices are carrying relatively the same amount of data. It prevents a storage device from being overloaded or reaching maximum capacity, all while the systems are up and running and users keep on hitting the file systems.
The Platform is the name we internally use to describe all the software tools that make up the environment.
In the beginning, I convinced myself that I needed to introduce three tiers of storage. Very expensive, expensive and hopefully not so expensive. I assumed that performance could only be obtained with large, expensive storage devices. In other words, performance and cost would go hand in hand. Luckily we first decided to see how far we could get with the ‘”hopefully not so expensive” track. One can think of white box servers with a bunch of SATA drives running over the AoE (ATA over Ethernet) network protocol.
So, we tried 10 SATA drives. Then we tried 100 drives and later 1000 drives. It worked! A server with SATA drives was a building block and performance was a matter of adding the right number of blocks to the Platform. We believed we could achieve all our data needs using building blocks.
We identified two data profiles based on the requirements with separate file systems:
- Store the incoming data acting as WORM storage (Write Once Read Many)
- Adding building blocks to meet the required capacity (not performance)
- Store intermediate test results and user specific tools
- Substantial reading and writing
- Adding building blocks to obtain the desired performance
- Performance defined as a certain throughput per compute core (requirement)
File system sizes had a size limit of ~100TB because a file system check (aka fsck) would take too long in case of file system errors. Considering the white box concept, we used RAID6 on all file systems. The incoming data was bundled and compressed in datasets to avoid a small file problem.
Eventually, we moved away from the white box building block for mainly two reasons. First, we had no idea when things broke. There was a need for an automated process to regularly query devices asking them how there were doing (positive query). Secondly, there was a lack of scalability, and we had to look for a solution that could scale beyond thousands of drives. Each server had a limited amount of drives and multiple 1Gbit/s connections requiring a lot of network ports (costly). We desperately needed a bigger and more efficient enclosure to keep up with the growth.
As the years went by we continuously upgraded the building blocks as new technology became available and we had to adapt to changing requirements. More details in a subsequent blog. Stay tuned!