HDDs, SSDs and friends in the enterprise world

Data, Storage and how to get from there to here

Organizations generate data.
Either via gathering external data, or by internal computations.
Data needs to be stored, and data needs to be retrieved.
The world of organizational data retrieval is going through rapid changes.
We will discuss some of the challenges
...and some of the methods being explored these days.

Data Explosion

The first thing we need to handle, is that data grows exponentially.
...doubling every 1-2 years.
Although the capacity of disks managed to double every year,
...their speed did not.
And even with capacity growing - so does electricity needs:
- to run the disks.
- to cool off data centers.

Data Need - ILM

If we could store all data on fast disks all the time - life would be sweet.
Unfortunately, this is not economically viable - so we need to move less needed data into cheaper (and slower) storage.
Thus a term was coined - ILM - Information Life-cycle Management.
There are two things normally done here:
- Manually moving data to slower storage (e.g. backup and delete from main storage).
- Automatic tiering - storage systems move data from expensive disks to cheaper disks based on LRU information.

Data - The Need for Speed

As data grows - finding it and processing it takes longer.
CPUs became faster, RAM became exponentially larger, we now get exponential growth in disks capacity.
The bottleneck moved to the disk systems performance.
Very large disk arrays can handle very high throughput for sequential I/O...
...but they are lousy at handling random I/O and inter-dependent I/O:
- If a disk can handle around 120 random I/O operations per second (IOPS).
- 100 disks will handle around 12000 IOPS - assuming I/O is split evenly.
- With fully inter-dependent I/Os (when the result of one will determinate how to send the next) - even 1000 disks won't do more then 120 IOPS.

The good ole' Hard Disk Drives

Lets take a closer look at the good old hard disk drives,
and find what causes their limitations.

HDD Structure

A disk is made of one or more platers.
Each plater is split into tracks.
Each track is split into sectors (of a fixed size)
- thus outer tracks contain more sectors.
Each plater may be two-sided, and has a read/write head.
All read/write heads are handled by a single arm, and thus all move together.
All tracks that can be accessed at once on all platers (i.e. that sit at the same distance from the outer edge) are collectively called a Cylinder.

The Rotating Nature Of HDDs

Platers revolve at a fixed speed.
We have disks revolving at 5400 Rounds per minute (RPM), 7200 RPM, 10000 RPM and 15000 RPM.
The arm moves only in or out (i.e. between cylinders). To access different sectors on a disk - the platers need to rotate.
To get to a given location on the disk, it needs to rotate to the right angle, and the arm needs to move in or out to the right cylinder.
This operation is called "seek".
The worst seek operation will require a full rotation of the platters.
For an 10000 RPM disk - the rotation time is 6 milli-seconds.
Add the time to move the arm - and you get several milli-seconds of seek time.

HDD - Sequential access patterns

When accessing data sequentially:
the read head is placed on the first sector
and data is read as the disk revolves.
This implies that throughput depends on the RPM of the disk.
- which means data is read slower as we move to inner tracks (which contain less sectors).
For long sequential access - the initial seek time is negligible.
Note: since moving the heads between tracks takes time, we can store the sectors of inner tracks with an offset:

HDD - Random access patterns

For random access - the seek time becomes very dominant.
Consider a disk that can read 100MB per second.
The time for a seek could be 5 milli-seconds.
The time to read 4KB of data will be 0.03 milli-seconds.
even if we need to read 1MB of data - the read time will be 1 milli-second.
Note: caching will be almost meaningless, for a system that contains 100TB of data (for instance).

Disk Interfaces

Let us talk a little about disk interfaces
which dictate how a computer is connected to the disks
and how it "talks" to the disks.

The IDE Interface

Originally designed by Western Digital (~1986)
Was used in the cheaper PC market.
The controller is part of the disk drive
So the "controller" on the host was a simple bridge (pass-through)
The ATA protocol was sector-based with simple read and write commands.
- which was different then its predecessors (ST-506 and ESDI).
The max cable length was short
- So it was used for internal storage.

The SCSI Interface

An earlier (~1982), and much more elaborate peripherals interface.
Defined commands, protocols, initiator/target operation mode.
Allowed attaching internal or external disks
supported chaining devices (with an address per device)
and sharing of devices between machines.
Later enhanced to work over networks:
- FCP - (SCSI over) Fibre-Channel (used in SANs).
- ISCSI - SCSI over IP

The SATA Interface

Serial-ATA - faster and better then ATA
- (which was renamed to PATA - Parallel ATA)
Due to electrical signalling issues - serial protocols scale better then parallel protocols.
and require simpler, thinner and cheaper cables.
Supports command queuing (NCP) at the controller level.
Today, most PCs come with SATA disks.

The SAS Interface

Serial-Attached SCSI - a modern (~2008) serial version of the (parallel) SCSI protocol.
allows connecting many more disks (up to 64K) to each chain, then what parallel SCSI allowed (up to 16).
Works at faster speeds (up to 6Gbps - Vs. ~5 Gbps for the fastest SCSI standard)

HDD - fighting for speed

Disk speeds grow very slowly
while the other parts of the machines increase in speed at an exponential rate.
Thus, a lot of creativity was used to make disks work faster.

HDD - Getting More Bandwidth

The first thing was getting more throughput (or bandwidth)
This was achieved by combining disks in groups.
This is usually done by using disk striping
which is the simplest form of RAID (RAID 0)
If 1 disk = 100MB/second, 10 disks = 1GB/second.
With many disks - the chance of a failure increase
Thus protection schemes were formed:
- Mirroring - RAID 1 - no performance penalty.
- parity protection - RAID 4 - parity (re)calculations slow down writes.
- Alternating Parity protection - RAID 5
- dual-failure parity protection - RAID 6.
- Combining mirroring with striping - RAID 10.

HDD - Getting More IOPS

The IOPS problem is harder - when we deal with random access.
Here, using 10 disks will give us only ~1250 IOPS.
So we aggregate hundreds of disks - but then we get too much capacity.
Which means that to get high IOPS we use a very small part of the storage capacity
which is wasteful in purchase price, cooling needs, and floor space.

FLASH - Non-Volatile Memory

Someone created FLASH memory.
Where no electrical current is required to sustain the data.
(Similar to EEPROM - but with FLASH one can program each cell separately).
This was first used in multimedia devices (e.g. cameras) and embedded devices (e.g. phones)
As the market for FLASH grew - the price per MB went down.
Until someone decided to use NAND-FLASH memory in the form-factor and interface of a disk.
And thus we got SSD - Solid-State Disks.
- First in the form of the M-SYSTEMS "disk on key"
- and later in other form factors.

The Internal Mechanism Of NAND-FLASH

NAND-FLASH memory can be read a page at a time.
it can be programmed in only one way - changing a bit from 1 to 0.
Changing back to 1 can be done only in large blocks (e.g. 512KB).
The last operation is called "erasing a block".
This means flash is much faster in reading, then in writing
- (because a write might require erasing the entire block, and re-programming its contents..)
FLASH has one great limitation - each cell may be programmed up to 100,000 times (in SLC) or less (in MLC - where each cell stores 2 or more bits).
After this - the cell deteriorates and looses its data.
If we have a "hot-spot" location with many writes - this could be a real problem real fast.

The Internal Mechanism Of an SSD

SSDs supply an interface of writing in pages - hiding the "full block erasure" nature of NAND-FLASH.
To achieve more capacity - an SSD may contain several FLASH chips.
This will allow faster access as well (striping data across the FLASH chips).
To make the writes faster, it uses dynamic mapping:
- If we write to the same cell again:
- The data is written to a different block
- and the LBA (Logical Block Address) is mapped to the new location.
This technique is known as LSA - Log-Structured Array.
This mapping is also used to handle wear leveling - see below...
SSDs often employ standard disk interfaces (e.g. SATA, SAS).

Endurance

As we said, each FLASH cell can sustain up to 100,000 program/erase cycles.
This makes the life of an SSD very short under heavy write load.
To alleviate this - the SSD controller employs "write-leveling" techniques.
- It counts how many times each page was programmed.
- When a write arrives, it chooses a block of pages that were least written into.
- This way, all cells will have performed approximately the same number of program/erase cycles, at any given time.

Write Amplification

Each write requires programming a fresh page (a page that wasn't programmed since its block was last erased).
Writing into the same LBA again - makes the specific page that contains the old data - invalidated.
This can cause fragmentation, which will leave no fresh cells for writing, over time.
Before this happens - the SSD will choose the least-valid blocks, copy their valid pages to an empty block - and mark all those blocks as free.
These techniques may cause a single page write, to perform several page writes - write amplification.
If each write causes two page writes on average, the SSD is said to have a write amplification factor of 2.
Enterprise-Grade SSD designers employ complicated algorithms to reduce the write-amplification factor of their devices.

SSD And The HDD Interface Limitations

The HDD interfaces were designed with slow devices (rotating disks) in mind.
As such, they introduce a very large latency on top of that of the FLASH memory of SSDs.
- If an SLC FLASH memory can support a latency of ~25 micro-seconds to fetch a 4KB page.
- The SAS or SATA protocols may add more then a 100 micro-seconds overhead.
In addition, going via the system peripheral bus can limit throughput.

PCIe Interface For SSD

In 2007, Fusion-IO introduced an SSD on a controller that attaches directly to a PCI slot.
This bypasses the slower interfaces, and allows for read latency of ~60 micro-seconds (for SLC FLASH) or a little more (for MLC FLASH).
This can also use the 2GB/second throughput of PCIe (vs. 300MB/second for SAS/SATA and 600MB/second for SAS2).

SSD and Price fights

FLASH media costs more then hard disk platers. Reliable FLASH-based SSD costs much more then hard disks, at similar capacities.
If consumer-grade hard disks cost about 7.5 cents per GB - consumer-grade SSDs cost about 1.5$ per GB. and these are not enterprise-grade SSDs (nor enterprise-grade HDDs).
As a result, putting all of your data on SSDs is very expensive.
Especially when considering an enterprise with hundreds of TB of data.

SSD and Price fights - Splitting Based On Applications

The first method to fight the high price of SSD - is using it for specific applications:
- Either applications where performance will give you the best ROI (Return On Investment)
- Or applications that won't work in time without the increased performance.
The problems:
- Putting SSDs inside traditional RAID systems - eliminates most performance gains.
- Putting SSDs outside the RAID systems - eliminates the ease of use of a single storage management system and the storage applications support of the large RAID manufacturers (snapshots, mirroring, cloning...)

SSD and Price fights - Data Tiering

Another method to fight the price - is using data tiering.
In this mode - you'll build a single system with both medias,
and make it automatically move data from SSD storage to HDD storage, when it is less used.
The problems:
- Sometimes the raw data is too large - if you have a single DB with random access patterns - how will you split it evenly?
- Even with a clear "active-set", moving the data on demand is not practical, as it takes a lot of time to move large quantities.
- A partial relief - using manual tiering - let the user decide which LUN contains critical production data, and which contains less critical data.

SSD and Price fights - Data Caching

Another possibility - using SSDs as cache for HDD systems.
This will place all writes onto the SSDs - which will slowly move them back to the HDDs later on.
This can be useful for applications that do mostly writes and less reads.
Of for applications with a small active set (i.e. that read the same set of disk blocks over and over again) - but with an active set too large to fit into a host system's RAM.

Caching inside the storage system

Some systems go with the direction of placing the cache inside the storage system.
The advantages:
- easier management (the single management system notion).
- Ease of provisioning the cache to different servers.
The disadvantages:
- high latency overhead of the RAID system on top of the SSD.
- Requires a completely revamped new SSD part in the RAID system, to avoid killing the performance gain of SSDs.

Caching inside the host

Others (sometimes - the same people) attempt to put the cache inside the servers.
Advantages:
- Full use of the low-latency of SSD storage.
- Allows doing the caching at the file-system level.
Disadvantages:
- No flexibility in assigning SSD space to servers - needs careful planning and on-going management.
- PCI-e SSDs (which are faster then SAS SSDs) are not hot-swappable - any need for upgrade means downtime.
- Not manageable without proper software support (e.g. the storage snapshots won't include data freshly written to the SSD cache).
- Requires a whole new software stack to really become useful.
Example: EMC's VFCache ("project lightning"): http://www.emc.com/about/news/press/2012/20120206-01.htm

References

IDE and ATA-1 - wikipedia - http://en.wikipedia.org/wiki/Parallel_ATA#IDE_and_ATA-1
SAS - Serial-Attached SCSI - http://en.wikipedia.org/wiki/Serial_attached_SCSI
Solid-State drive - http://en.wikipedia.org/wiki/Solid-state_drive

Originally written by

guy keren