To understand the CEPH storage architecture and architecture of an Infinite Storage System explained in Part-1, we need to first understand the very basics of information storage. In historic times, before the invention of paper and ink, humans shared information by passing them orally/verbally to one another. The only storage medium available then was the human brain. So in general, humans acted as both processing unit and the storage medium. As we humans advanced in civilization, we made different tools for hunting and other purposes. One of the tools used was a chisel which they used to carve their information on rocks, stones or leaves.
Then came the invention of Ink and Paper. Here humans acted as the computing device, the ink and paper served as the storage medium.
As humans advanced, the technology used also advanced. The invention of Magnetic Tapes was one among the top milestones achieved by us. Here, the computer or computing device sits between you (Humans) and your information (tape). ie; Technology sits between you and your data as shown in the diagram below. The technology decides where and how to put your data on the tape.
The above diagram can also be represented as shown below.
Till now this is a brief history of storage in layman’s language.
As the volume of information to be stored got bigger and bigger, a scenario as shown below was first adopted. Attached multiple storage disks to a Computer and the multiple humans handling the computer to increase computing.
But this also was not sufficient to handle the growing information storage needs. Later on, the above mentioned scenario got changed to something as shown in the diagram below. Only the number of humans handling the computer got increased at this stage.
The data or information was getting bigger and bigger day by day and the above representation too became inadequate to handle the ever expanding storage requirements. So instead of a single computer, we used a Giant Spendy Computer (dedicated/supercomputer, that could handle large volumes of data). With this architecture we were able to handle huge volumes of data effectively.
But after a long run, the volume of information was so big that even the giant spendy computer could not handle the read-writes effectively. Later on, the individual disks were replaced and embedded into the computer itself. ie; the computer had its own storage device as shown in the diagram below. This architecture minimized the human efforts to a certain extend.
Till this, we just kept updating or scaling the basic architecture to meet the growing storage requirements. To handle/store huge and ever growing information/data further, the entire architecture was to be rebuild rather than just scaling up, and thus came the invention of Storage Appliances.
Storage Appliance – Advantages and Limitations
A Storage Appliance is a physical device where a lots of computers and hard disks are incorporated into a single box or physical module. Roughly a storage appliance will look as shown below.
There are many manufacturers of Storage Appliances like IBM, Oracle etc. They have their own Proprietary Hardware, Proprietary Software and their support team. The below diagram will give you a skeleton view of a Storage Appliance.
Storage appliances are found almost in every data center. They use their own hardware and software for the storage appliance. Implementation of such storage appliances improved the efficiency of huge data storage and offered relatively easy administration. The vendor supplied or tailor made software has all the capability to work flawlessly with the proprietary hardware. So storage/handling of huge volumes of data was not an issue anymore.
But the entire architecture costs a lot. The proprietary hardware costs are always higher than any standard hardware, the software used is licensed and you have to pay to get any support/maintenance. The entire research and development of a storage appliance costs way more than any standard storage solution. In general, if you want to configure a storage system to handle huge volumes of data, you need to buy a storage appliance along with its support and maintenance that will cost you a hell lot of money.
Cloud Stacks And CEPH
Then came the invention of Cloud Stacks to handle different computing needs more efficiently and cost effectively. And the cloud stack architecture formed the basis of all computing needs that is used until recently.
There are mainly three cloud stacks used-
Compute Cloud
Network Cloud
Storage Cloud
CEPH is an OpenSource software that is community based and works with all standard storage hardware’s. CEPH is owned by a company named Inktank, who offers its support and maintenance at a low cost, which is also optional. CEPH storage architecture is similar to the architecture of a storage appliance except some key features.
Ceph comes under Storage Cloud stack architecture. For a better understanding about CEPH you can refer my previous blog
CEPH- BASIC ARCHITECTURE
The skeleton view of a Ceph storage architecture is shown in the diagram below.
CEPH uses Standard Hardware (hardware’s that we are already known/using), which effectively addresses the high cost factor of proprietary hardwares used in storage appliances. Almost all storage disks (in standard computers) are supported by Ceph.
Also Ceph is a community based open source software and hence peer support is quite high. This reduces the maintenance costs significantly while maintaining an optional support subscription for the vendor Inktank’s support.
PHILOSOPHY AND DESIGN
People behind CEPH storage architecture wanted it be an architecture which overcomes the design flaws of previous models with the following features
- Open Source – The people behind Ceph wanted the Ceph to be an opensource project because its the best available way to spread an upcoming technology fast.
- Community Based – Unlike many open source projects, Ceph is highly community focused. Any member in the community can decide what new feature Ceph should have.
- Scalable – The ceph storage architecture design team wanted Ceph to be infinitely scalable. We have discussed infinite scaling in Part -1, so I’m not going further into it here in this post
- No Single Point Failures – Developers of Ceph wanted the architecture to be having no single point failure, not even one. ie; it purely doesn’t want a client server model.
- Software Based – Ceph was to be software oriented rather than a hardware based architecture. Hardware based architecture will have a single point of failure and also the cost of appliances go high.
- Self Sustaining – Developers wanted Ceph to be a self managing model. If not it would be big problem where the number of nodes in a cluster is large and if a hardware fails, the only thing that can be done to fix it is a manual task(replacing the defected hardware).
CEPH came out to the public after 8 years of hardship and around 20,000 code commits.
References
The diagrams and metaphors used are inspired by Inktank’s Vice president Ross Turk’s speech on introduction to CEPH