Data, Storage and how to get from there to here
- Organizations generate data.
- Either via gathering external data, or by internal computations.
- Data needs to be stored, and data needs to be retrieved.
- The world of organizational data retrieval is going through rapid
changes.
- We will discuss some of the challenges
- ...and some of the methods being explored these days.
Data Explosion
- The first thing we need to handle, is that data grows exponentially.
- ...doubling every 1-2 years.
- Although the capacity of disks managed to double every year,
- ...their speed did not.
- And even with capacity growing - so does electricity needs:
- to run the disks.
- to cool off data centers.
Data Need - ILM
- If we could store all data on fast disks all the time - life would be
sweet.
- Unfortunately, this is not economically viable - so we need to move
less needed data into cheaper (and slower) storage.
- Thus a term was coined - ILM - Information Life-cycle Management.
- There are two things normally done here:
- Manually moving data to slower storage (e.g. backup and delete
from main storage).
- Automatic tiering - storage systems move data from expensive disks to
cheaper disks based on LRU information.
Data - The Need for Speed
- As data grows - finding it and processing it takes longer.
- CPUs became faster, RAM became exponentially larger, we now get
exponential growth in disks capacity.
- The bottleneck moved to the disk systems performance.
- Very large disk arrays can handle very high throughput for sequential
I/O...
- ...but they are lousy at handling random I/O and inter-dependent I/O:
- If a disk can handle around 120 random I/O operations per
second (IOPS).
- 100 disks will handle around 12000 IOPS - assuming I/O is split
evenly.
- With fully inter-dependent I/Os (when the result of one will
determinate how to send the next) - even 1000 disks won't do more
then 120 IOPS.
The good ole' Hard Disk Drives
- Lets take a closer look at the good old hard disk drives,
- and find what causes their limitations.
HDD Structure
- A disk is made of one or more platers.
- Each plater is split into tracks.
- Each track is split into sectors (of a fixed size)
- thus outer tracks contain more sectors.
- Each plater may be two-sided, and has a read/write head.
- All read/write heads are handled by a single arm, and thus all
move together.
- All tracks that can be accessed at once on all platers (i.e. that sit
at the same distance from the outer edge) are collectively called a
Cylinder.
The Rotating Nature Of HDDs
- Platers revolve at a fixed speed.
- We have disks revolving at 5400 Rounds per minute (RPM), 7200 RPM,
10000 RPM and 15000 RPM.
- The arm moves only in or out (i.e. between cylinders). To access different
sectors on a disk - the platers need to rotate.
- To get to a given location on the disk, it needs to rotate to the right
angle, and the arm needs to move in or out to the right cylinder.
- This operation is called "seek".
- The worst seek operation will require a full rotation of the platters.
- For an 10000 RPM disk - the rotation time is 6 milli-seconds.
- Add the time to move the arm - and you get several milli-seconds
of seek time.
HDD - Sequential access patterns
- When accessing data sequentially:
- the read head is placed on the first sector
- and data is read as the disk revolves.
- This implies that throughput depends on the RPM of the disk.
- which means data is read slower as we move to inner tracks (which
contain less sectors).
- For long sequential access - the initial seek time is negligible.
- Note: since moving the heads between tracks takes time, we can
store the sectors of inner tracks with an offset:
HDD - Random access patterns
- For random access - the seek time becomes very dominant.
- Consider a disk that can read 100MB per second.
- The time for a seek could be 5 milli-seconds.
- The time to read 4KB of data will be 0.03 milli-seconds.
- even if we need to read 1MB of data - the read time will be 1 milli-second.
- Note: caching will be almost meaningless, for a system
that contains 100TB of data (for instance).
Disk Interfaces
- Let us talk a little about disk interfaces
- which dictate how a computer is connected to the disks
- and how it "talks" to the disks.
The IDE Interface
- Originally designed by Western Digital (~1986)
- Was used in the cheaper PC market.
- The controller is part of the disk drive
- So the "controller" on the host was a simple bridge (pass-through)
- The ATA protocol was sector-based with simple read and write commands.
- which was different then its predecessors (ST-506 and ESDI).
- The max cable length was short
- So it was used for internal storage.
The SCSI Interface
- An earlier (~1982), and much more elaborate peripherals interface.
- Defined commands, protocols, initiator/target operation mode.
- Allowed attaching internal or external disks
- supported chaining devices (with an address per device)
- and sharing of devices between machines.
- Later enhanced to work over networks:
- FCP - (SCSI over) Fibre-Channel (used in SANs).
- ISCSI - SCSI over IP
The SATA Interface
- Serial-ATA - faster and better then ATA
- (which was renamed to PATA - Parallel ATA)
- Due to electrical signalling issues - serial protocols scale
better then parallel protocols.
- and require simpler, thinner and cheaper cables.
- Supports command queuing (NCP) at the controller level.
- Today, most PCs come with SATA disks.
The SAS Interface
- Serial-Attached SCSI - a modern (~2008) serial version of the (parallel)
SCSI protocol.
- allows connecting many more disks (up to 64K) to each chain, then what
parallel SCSI allowed (up to 16).
- Works at faster speeds (up to 6Gbps - Vs. ~5 Gbps for the fastest SCSI
standard)
HDD - fighting for speed
- Disk speeds grow very slowly
- while the other parts of the machines increase in speed at an
exponential rate.
- Thus, a lot of creativity was used to make disks work faster.
HDD - Getting More Bandwidth
- The first thing was getting more throughput (or bandwidth)
- This was achieved by combining disks in groups.
- This is usually done by using disk striping
- which is the simplest form of RAID (RAID 0)
- If 1 disk = 100MB/second, 10 disks = 1GB/second.
- With many disks - the chance of a failure increase
- Thus protection schemes were formed:
- Mirroring - RAID 1 - no performance penalty.
- parity protection - RAID 4 - parity (re)calculations slow down writes.
- Alternating Parity protection - RAID 5
- dual-failure parity protection - RAID 6.
- Combining mirroring with striping - RAID 10.
HDD - Getting More IOPS
- The IOPS problem is harder - when we deal with random access.
- Here, using 10 disks will give us only ~1250 IOPS.
- So we aggregate hundreds of disks - but then we get too much capacity.
- Which means that to get high IOPS we use a very small part of
the storage capacity
- which is wasteful in purchase price, cooling needs, and floor space.
FLASH - Non-Volatile Memory
- Someone created FLASH memory.
- Where no electrical current is required to sustain the data.
- (Similar to EEPROM - but with FLASH one can program each cell separately).
- This was first used in multimedia devices (e.g. cameras) and embedded
devices (e.g. phones)
- As the market for FLASH grew - the price per MB went down.
- Until someone decided to use NAND-FLASH memory in the form-factor and
interface of a disk.
- And thus we got SSD - Solid-State Disks.
- First in the form of the M-SYSTEMS "disk on key"
- and later in other form factors.
The Internal Mechanism Of NAND-FLASH
- NAND-FLASH memory can be read a page at a time.
- it can be programmed in only one way - changing a bit from 1 to 0.
- Changing back to 1 can be done only in large blocks (e.g. 512KB).
- The last operation is called "erasing a block".
- This means flash is much faster in reading, then in writing
- (because a write might require erasing the entire block, and
re-programming its contents..)
- FLASH has one great limitation - each cell may be programmed up to 100,000
times (in SLC) or less (in MLC - where each cell stores 2 or more bits).
- After this - the cell deteriorates and looses its data.
- If we have a "hot-spot" location with many writes - this could be a real
problem real fast.
The Internal Mechanism Of an SSD
- SSDs supply an interface of writing in pages - hiding the "full block
erasure" nature of NAND-FLASH.
- To achieve more capacity - an SSD may contain several FLASH chips.
- This will allow faster access as well (striping data across the FLASH
chips).
- To make the writes faster, it uses dynamic mapping:
- If we write to the same cell again:
- The data is written to a different block
- and the LBA (Logical Block Address) is mapped to the new
location.
- This technique is known as LSA - Log-Structured Array.
- This mapping is also used to handle wear leveling - see below...
- SSDs often employ standard disk interfaces (e.g. SATA, SAS).
Endurance
- As we said, each FLASH cell can sustain up to 100,000 program/erase
cycles.
- This makes the life of an SSD very short under heavy write load.
- To alleviate this - the SSD controller employs "write-leveling" techniques.
- It counts how many times each page was programmed.
- When a write arrives, it chooses a block of pages that were least
written into.
- This way, all cells will have performed approximately the same
number of program/erase cycles, at any given time.
Write Amplification
- Each write requires programming a fresh page (a page that wasn't
programmed since its block was last erased).
- Writing into the same LBA again - makes the specific page that contains
the old data - invalidated.
- This can cause fragmentation, which will leave no fresh cells for writing,
over time.
- Before this happens - the SSD will choose the least-valid blocks, copy
their valid pages to an empty block - and mark all those blocks as free.
- These techniques may cause a single page write, to perform several
page writes - write amplification.
- If each write causes two page writes on average, the SSD is said to
have a write amplification factor of 2.
- Enterprise-Grade SSD designers employ complicated algorithms to reduce
the write-amplification factor of their devices.
SSD And The HDD Interface Limitations
- The HDD interfaces were designed with slow devices (rotating disks) in
mind.
- As such, they introduce a very large latency on top of that of the FLASH
memory of SSDs.
- If an SLC FLASH memory can support a latency of ~25 micro-seconds to
fetch a 4KB page.
- The SAS or SATA protocols may add more then a 100 micro-seconds
overhead.
- In addition, going via the system peripheral bus can limit throughput.
PCIe Interface For SSD
- In 2007, Fusion-IO introduced an SSD on a controller that attaches
directly to a PCI slot.
- This bypasses the slower interfaces, and allows for read latency
of ~60 micro-seconds (for SLC FLASH) or a little more (for MLC FLASH).
- This can also use the 2GB/second throughput of PCIe (vs. 300MB/second for
SAS/SATA and 600MB/second for SAS2).
SSD and Price fights
- FLASH media costs more then hard disk platers. Reliable FLASH-based
SSD costs much more then hard disks, at similar capacities.
- If consumer-grade hard disks cost about 7.5 cents per GB - consumer-grade
SSDs cost about 1.5$ per GB.
and these are not enterprise-grade SSDs (nor enterprise-grade HDDs).
- As a result, putting all of your data on SSDs is very expensive.
- Especially when considering an enterprise with hundreds of TB of data.
SSD and Price fights - Splitting Based On Applications
- The first method to fight the high price of SSD - is using it for specific
applications:
- Either applications where performance will give you the best ROI
(Return On Investment)
- Or applications that won't work in time without the increased performance.
- The problems:
- Putting SSDs inside traditional RAID systems - eliminates most
performance gains.
- Putting SSDs outside the RAID systems - eliminates the ease of use
of a single storage management system and the storage applications
support of the large RAID manufacturers (snapshots, mirroring,
cloning...)
SSD and Price fights - Data Tiering
- Another method to fight the price - is using data tiering.
- In this mode - you'll build a single system with both medias,
- and make it automatically move data from SSD storage to HDD storage, when
it is less used.
- The problems:
- Sometimes the raw data is too large - if you have a single DB with
random access patterns - how will you split it evenly?
- Even with a clear "active-set", moving the data on demand is not
practical, as it takes a lot of time to move large quantities.
- A partial relief - using manual tiering - let the user decide which
LUN contains critical production data, and which contains less
critical data.
SSD and Price fights - Data Caching
- Another possibility - using SSDs as cache for HDD systems.
- This will place all writes onto the SSDs - which will slowly move them
back to the HDDs later on.
- This can be useful for applications that do mostly writes and less reads.
- Of for applications with a small active set (i.e. that read the same set
of disk blocks over and over again) - but with an active set too large to
fit into a host system's RAM.
Caching inside the storage system
- Some systems go with the direction of placing the cache inside the
storage system.
- The advantages:
- easier management (the single management system notion).
- Ease of provisioning the cache to different servers.
- The disadvantages:
- high latency overhead of the RAID system on top of the SSD.
- Requires a completely revamped new SSD part in the RAID system,
to avoid killing the performance gain of SSDs.
Caching inside the host
- Others (sometimes - the same people) attempt to put the cache inside the
servers.
- Advantages:
- Full use of the low-latency of SSD storage.
- Allows doing the caching at the file-system level.
- Disadvantages:
- No flexibility in assigning SSD space to servers - needs careful
planning and on-going management.
- PCI-e SSDs (which are faster then SAS SSDs) are not hot-swappable -
any need for upgrade means downtime.
- Not manageable without proper software support (e.g. the storage
snapshots won't include data freshly written to the SSD cache).
- Requires a whole new software stack to really become useful.
- Example: EMC's VFCache ("project lightning"): http://www.emc.com/about/news/press/2012/20120206-01.htm
References
- IDE and ATA-1 - wikipedia -
http://en.wikipedia.org/wiki/Parallel_ATA#IDE_and_ATA-1
- SAS - Serial-Attached SCSI -
http://en.wikipedia.org/wiki/Serial_attached_SCSI
- Solid-State drive -
http://en.wikipedia.org/wiki/Solid-state_drive
Originally written by
guy keren