Ceph Infrastructure Planning

Ceph is open source storage meant to facilitate highly scalable object, block and file-based storage. Because of it's open source nature it can run on any type of hardware. The CRUSH algorithm will balance and replicate all data throughout the cluster.

At first i was running Ceph on regular servers (HP Proliant, DELL) but since +/- 2 years i decided to play around with custom configurations hoping to squeeze more iops out of it. As i found out, big part of getting the most out of Ceph is: network. My last Ceph configuration was so fast it took down all our servers 🙁

What was i doing?

In order to get the most out of ceph i created a setup using only NVMe drives. Here is my configuration for my ceph nodes:

Supermicro SYS-1028U-TN10RT (link)
Intel E5-2640v4 (10 Core, 2.4Ghz)
1x Intel SSD DC P4600
1x ASUS M.2 Hyper x16 Card
4x Samsung Pro 960 NVMe
1x Intel X520-DA2 Dual 10Gbe

I am using the Intel SSD DC P4600 as main (fast) storage and Samsung Pro NVMe's for slightly slower storage. Everything was looking dandy.

So what happend?

One of our clients woke up and thought, "hey, let's migrate some stuff today". Generating only 1Gbps network traffic on the uplinks to the internet everything was looking fine. Until 15 minutes later everything went offline. All VM's stopped and storage was down. Turns out that the client was migrating a set of DB's to our network, after the scp copy stopped they extracted the files and imported them in the database server generating a lot of IO on that virtual server. That virtual server was however part of a database cluster so replicated all that data to 3 other VM's.

As Ceph wrote the data to disk, it also replicated it across the Ceph pool and through the network generating way to much (20 Gbps per Ceph host) traffic on the access layer. Ceph was so fast, it took down the entire network.

How to resolve

As storage (specially Ceph 😉 ) is getting faster, network topology needs to change in order to handle the traffic. A design that is becoming very popular is the leaf-spine design.

This topology replaces the well known access/aggregation/core design and using more bandwidth on the connections to the spines we can handle much more traffic.

A 10 Gbit network is at minimum recommended for Ceph. 1 Gbit will work however the latency and bandwidth limitations will be almost unacceptable for a production network. 10 Gbit is a significant improvement for the latency. Also recommended is that each Ceph node is connected to two different leafs for external traffic (clients/vm's) and internal traffic (replication).