ZFS Pool setup for SuperMicro server.
For this SuperMicro server, I'm using three-way mirroring, based mainly on the advice here,
33% storage
efficiency is acceptable for this system, partly because the 2TB SAS
drives are used so it's unknown how reliable they are, but since they
were fairly inexpensive (around $30 each)
having more mirrors can't hurt.
Choice of Hard Drive
The BPN-SAS-846A backplane accepts both SATA and SAS hard drives,
although it's tricky to find inexpensive used SAS drives, not sure why.
An example of an inexpensive (used) SAS drive is the Seagate 2TB ST32000445SS which
can be had for around $25 these days (in lots of ten drives). Here I'll be using 24 of them, for a total
of around $650 (plus $100 for the four spare drives, so as to be able to replace drives that might fail).
Even at these lowish unit prices, the total cost of all the hard drives tends to add up, rivaling the
price of the server itself.
The Seagate 2TB ST32000445SS is actually a "SED", which is one acronym used to describe self-encrypting drives.
See here on how to use the SED feature with hdparm
with SATA drives.
Since the Seagate 2TB ST32000445SS is a SAS drive and hdparm
only works with SATA drives
(see here),
I'm not sure yet if there's any way to encrypt the disk encryption key using Linux.
It seems it might only be possible to perform a secure erase of the Seagate 2TB ST32000445SS drives
using proprietary software
(see section "13.7 LSI MegaRAID SafeStore Encryption Services" of the "MegaRAID SAS Software User Guide" manual) and
a different controller card with SafeStore software pre-installed,
which wouldn't be of much use except for secure-erasing the drives
prior to disposing of them, as we obviously want to use the M1015 card
with Linux.
The Seagate 2TB ST32000445SS has a sector size of 512 bytes, so you have to use ashift=9
when creating the ZFS pool, which is a little old-fashioned compared to the modern "Advanced Format" drives
which have 4k sector size and use ashift=12, but these Seagate 2TB ST32000445SS SAS drives seem reliable
and fairly speedy and run at a temperature of less than around 40C even in the hot summertime.
Since it's a SAS drive, remember that in smartmontools
you have to use the -A
flag to scmartctl
read the drive's S.M.A.R.T. info.
In Debian Jessie, the drives show up as /dev/disk/by-id/scsi-35000c500342222cb
which is not their serial number
(perhaps it's the drive's WWN),
making things a little inconvenient when it comes to locating a specific drive.
Configuring the ZFS Pool
The ZFS pool is of size 16TB raw and consists of eight three-drive
vdevs. If a single controller fails, only one group of eight drives
would fail out of the 24 drives, so each vdev would still have one
mirror drive of redundancy remaining. Admittedly, on the surface, using
three-way mirroring seems wasteful of both power and storage, as only a
third of the raw storage is available, but the eight additional drives
don't use much power, and having a proper "enterprise-level" of
redundancy with a simple configuration which also provides very high
performance is a requirement for this system. The use of a three-way
mirror also allows for the technique of "splitting the mirror" to create
an immediate off-site backup (resilvering the replacement drives only
takes a few hours).
It's a good idea to take a photo of the label on each SAS hard drive
before placing it into a drive slot, so that it'll be easier later on to
figure out which drive has failed. I've connected the SAS-846A
backplane to the controller cards as follows:
SAS slots |
M1015 controller |
iPass connector |
iPass connector |
#0 to #7 |
#0 |
JSM1 goes to M1015 controller #1 port #0 |
JSM2 goes to M1015 controller #1 port #1 |
#8 to #15 |
#1 |
JSM3 goes to M1015 controller #1 port #0 |
JSM4 goes to M1015 controller #1 port #1 |
#16 to #23 |
#2 |
JSM5 goes to M1015 controller #1 port #0 |
JSM6 goes to M1015 controller #1 port #1 |
To connect the drives in the SAS-846A backplane to the M1015 controller
card I needed to hunt down the proper Mini SAS to Mini SAS SFF-8087
"iPass" connectors which turned out to also be quite tricky, I resorted
to reading the Molex data sheet to find the magical part number:
79576-2104. These elusive 79576-2104 cables are apparently what's needed
to connect the M1015 cards to the backplane, they're basically like
four SATA cables "rolled into one", and the one meter length connectors
are an eye-popping $16 each. I can't believe they even consider them to
be "enterprise" connectors: they seem very flimsy compared to proper
SCSI connectors, but I guess this is how things are in the SAS / SATA
world -- they need to keep the costs down however they can.
Drives on controller #0:
Device-by-id name |
Linux device name |
Backplane slot number |
scsi-35000c50034157a8f |
/dev/sdi |
SAS #0 |
scsi-35000c5003424a11f |
/dev/sdh |
SAS #1 |
scsi-35000c50034241d4f |
/dev/sdg |
SAS #2 |
scsi-35000c5003425027f |
/dev/sdf |
SAS #3 |
scsi-35000c500342222cb |
/dev/sde |
SAS #4 |
scsi-35000c5003417889f |
/dev/sdd |
SAS #5 |
scsi-35000c50034241cd3 |
/dev/sdc |
SAS #6 |
scsi-35000c50034249a1b |
/dev/sdb |
SAS #7 |
Drives on controller #1:
Device-by-id name |
Linux device name |
Backplane slot number |
scsi-35000c50034248657 |
/dev/sdq |
SAS #8 |
scsi-35000c50034150623 |
/dev/sdp |
SAS #9 |
scsi-35000c50034247a63 |
/dev/sdo |
SAS #10 |
scsi-35000c50034157d6f |
/dev/sdn |
SAS #11 |
scsi-35000c5003424a0e7 |
/dev/sdm |
SAS #12 |
scsi-35000c5003423f1ff |
/dev/sdl |
SAS #13 |
scsi-35000c50034149b6b |
/dev/sdk |
SAS #14 |
scsi-35000c5003414d617 |
/dev/sdj |
SAS #15 |
Drives on controller #2:
Device-by-id name |
Linux device name |
Backplane slot number |
scsi-35000c500342497a3 |
/dev/sdr |
SAS #16 |
scsi-35000c50034192787 |
/dev/sds |
SAS #17 |
scsi-35000c50034247a43 |
/dev/sdy |
SAS #18 |
scsi-35000c5003423c29b |
/dev/sdx |
SAS #19 |
scsi-35000c50034249537 |
/dev/sdw |
SAS #20 |
scsi-35000c50034241d67 |
/dev/sdv |
SAS #21 |
scsi-35000c5003423f337 |
/dev/sdu |
SAS #22 |
scsi-35000c5003418ca53 |
/dev/sdt |
SAS #23 |
Creating the ZFS pool
I created the ZFS pool in two steps, first setting it up as a two-way
mirror, so that controller #2 could be used for the initial import of
the data (since making a copy from drives attached to a local controller
is far quicker than a network copy). So the initial pool layout was
like this:
VDEV name |
SAS drive member |
SAS drive member |
mirror-0 |
SAS_#0 |
SAS_#8 |
mirror-1 |
SAS_#1 |
SAS_#9 |
mirror-2 |
SAS_#2 |
SAS_#10 |
mirror-3 |
SAS_#3 |
SAS_#11 |
mirror-4 |
SAS_#4 |
SAS_#12 |
mirror-5 |
SAS_#5 |
SAS_#13 |
mirror-6 |
SAS_#6 |
SAS_#14 |
mirror-7 |
SAS_#7 |
SAS_#15 |
The command to create the ZFS pool uses the /dev/disk/by-id
device names, as follows:
sudo zpool create -f sminception \
mirror scsi-35000c50034157a8f scsi-35000c50034248657 \
mirror scsi-35000c5003424a11f scsi-35000c50034150623 \
mirror scsi-35000c50034241d4f scsi-35000c50034247a63 \
mirror scsi-35000c5003425027f scsi-35000c50034157d6f \
mirror scsi-35000c500342222cb scsi-35000c5003424a0e7 \
mirror scsi-35000c5003417889f scsi-35000c5003423f1ff \
mirror scsi-35000c50034241cd3 scsi-35000c50034149b6b \
mirror scsi-35000c50034249a1b scsi-35000c5003414d617
At this point, it's always a good idea to export the ZFS pool and import it again, so as to
make sure nothing strange happens, and that the devices are always shown by their
device id in the zpool status
output (rather than via the /dev/sdb
device names,
which can change after drives are added or removed). When importing your newly-created
ZFS pool, you may get a ridiculous error such as "One or more devices are missing from the system."
with a nonsensical suggestion such as
"The pool cannot be imported. Attach the missing devices and try again."
with a useless link to the generic Oracle documentation.
When exporting and then re-importing the pool, I found that you always need to
use the -d /dev/disk/by-id
option to the zpool import
command, otherwise the ZFS pool
cannot be properly imported, devices mysteriously "go missing" (but only intermittently).
Creating the ZFS pool "by path" rather than "by id" does not work any better, the devices
still mysteriously go missing occasionally, so the trick is to always use the "-d /dev/disk/by-id"
option, e.g. as follows:
zpool export sminception
zpool import -d /dev/disk/by-id sminception
Al at this point, you can export and import the newly-created zpool, and check the zpool status using zpool status sminception
, e.g.:
root@sm:~# zpool status sminception
pool: sminception
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
sminception ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50034157a8f ONLINE 0 0 0
scsi-35000c50034248657 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c5003424a11f ONLINE 0 0 0
scsi-35000c50034150623 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
scsi-35000c50034241d4f ONLINE 0 0 0
scsi-35000c50034247a63 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
scsi-35000c5003425027f ONLINE 0 0 0
scsi-35000c50034157d6f ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
scsi-35000c500342222cb ONLINE 0 0 0
scsi-35000c5003424a0e7 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
scsi-35000c5003417889f ONLINE 0 0 0
scsi-35000c5003423f1ff ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
scsi-35000c50034241cd3 ONLINE 0 0 0
scsi-35000c50034149b6b ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
scsi-35000c50034249a1b ONLINE 0 0 0
scsi-35000c5003414d617 ONLINE 0 0 0
Quick test of the sequential I/O performance results are as follows at
this point (two-way mirror), so local sequential write I/O is about 700
MB/s and local sequential read I/O is at around 950 MB/s. Obviously,
this is not over the network, local only.
dd if=/dev/zero of=test conv=fsync bs=1M count=1000000
222198+0 records in
222198+0 records out
232991490048 bytes (233 GB) copied, 321.869 s, 724 MB/s
dd if=test of=/dev/null conv=fsync bs=1M
127933+0 records in
127932+0 records out
134146424832 bytes (134 GB) copied, 139.599 s, 961 MB/s
At this point, I created a new "dataset" called inception
and copied the data into this new dataset.
zfs create sminception/inception
For your ZFS pools, you should make sure not to fill them too much, they
should remain with a lot of free space as ZFS doesn't cope well when
pools get too full. The usable space is like this at the moment.
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
sminception/inception 15325971968 6738102656 8587869312 44% /mnt/sm/inception
sminception 8587869312 0 8587869312 0% /sminception
$ df -hl
Filesystem Size Used Avail Use% Mounted on
sminception/inception 15T 6.3T 8.0T 44% /mnt/sm/inception
sminception 8.0T 0 8.0T 0% /sminception
Adding eight more drives, to make three-way mirror
At this point, now that the copy of the data into the pool has
completed, to increase the redundancy of the pool, I populate the
remaining eight slots and attach the drives under controller #2 to the
mirrored pool sminception
, so that it becomes like this:
VDEV name |
SAS drive member |
SAS drive member |
SAS drive member |
mirror-0 |
SAS_#0 |
SAS_#8 |
SAS_#16 |
mirror-1 |
SAS_#1 |
SAS_#9 |
SAS_#17 |
mirror-2 |
SAS_#2 |
SAS_#10 |
SAS_#18 |
mirror-3 |
SAS_#3 |
SAS_#11 |
SAS_#19 |
mirror-4 |
SAS_#4 |
SAS_#12 |
SAS_#20 |
mirror-5 |
SAS_#5 |
SAS_#13 |
SAS_#21 |
mirror-6 |
SAS_#6 |
SAS_#14 |
SAS_#22 |
mirror-7 |
SAS_#7 |
SAS_#15 |
SAS_#23 |
The command used was:
sudo zpool attach -f sminception \
scsi-35000c50034248657 scsi-35000c500342497a3
sudo zpool attach -f sminception \
scsi-35000c50034150623 scsi-35000c50034192787
sudo zpool attach -f sminception \
scsi-35000c50034247a63 scsi-35000c50034247a43
sudo zpool attach -f sminception \
scsi-35000c50034157d6f scsi-35000c5003423c29b
sudo zpool attach -f sminception \
scsi-35000c5003424a0e7 scsi-35000c50034249537
sudo zpool attach -f sminception \
scsi-35000c5003423f1ff scsi-35000c50034241d67
sudo zpool attach -f sminception \
scsi-35000c50034149b6b scsi-35000c5003423f337
sudo zpool attach -f sminception \
scsi-35000c5003414d617 scsi-35000c5003418ca53
After the above additions, the resilvering is progressing as follows, which appears to be ridiculously slow,
apparently the default settings are designed to conserve bandwidth while leaving your data exposed for
longer to the pool's lower redundancy configuration during the resilver, see here, here and here for some explanation and suggestions on tuning.
# zpool status sminception
pool: sminception
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Jul 22 10:39:55 2015
1.21G scanned out of 4.38T at 177M/s, 7h13m to go
1.18G resilvered, 0.03% done
config:
NAME STATE READ WRITE CKSUM
sminception ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50034157a8f ONLINE 0 0 0
scsi-35000c50034248657 ONLINE 0 0 0
scsi-35000c500342497a3 ONLINE 0 0 0 (resilvering)
mirror-1 ONLINE 0 0 0
scsi-35000c5003424a11f ONLINE 0 0 0
scsi-35000c50034150623 ONLINE 0 0 0
scsi-35000c50034192787 ONLINE 0 0 0 (resilvering)
mirror-2 ONLINE 0 0 0
scsi-35000c50034241d4f ONLINE 0 0 0
scsi-35000c50034247a63 ONLINE 0 0 0
scsi-35000c50034247a43 ONLINE 0 0 0 (resilvering)
mirror-3 ONLINE 0 0 0
scsi-35000c5003425027f ONLINE 0 0 0
scsi-35000c50034157d6f ONLINE 0 0 0
scsi-35000c5003423c29b ONLINE 0 0 0 (resilvering)
mirror-4 ONLINE 0 0 0
scsi-35000c500342222cb ONLINE 0 0 0
scsi-35000c5003424a0e7 ONLINE 0 0 0
scsi-35000c50034249537 ONLINE 0 0 0 (resilvering)
mirror-5 ONLINE 0 0 0
scsi-35000c5003417889f ONLINE 0 0 0
scsi-35000c5003423f1ff ONLINE 0 0 0
scsi-35000c50034241d67 ONLINE 0 0 0 (resilvering)
mirror-6 ONLINE 0 0 0
scsi-35000c50034241cd3 ONLINE 0 0 0
scsi-35000c50034149b6b ONLINE 0 0 0
scsi-35000c5003423f337 ONLINE 0 0 0 (resilvering)
mirror-7 ONLINE 0 0 0
scsi-35000c50034249a1b ONLINE 0 0 0
scsi-35000c5003414d617 ONLINE 0 0 0
scsi-35000c5003418ca53 ONLINE 0 0 0 (resilvering)
errors: No known data errors
Speedy Resilvering Tunable zfs_resilver_delay
The tunables I decided to use to speed up the resilver (and scrubs) were as follows on Debian Jessie:
ZFS Tunable's Purpose |
Command to change the ZFS Tunable on Debian Jessie |
Prioritize resilvering by setting the delay to zero (the default is 2) |
echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay |
Prioritize scrubs by setting the delay to zero (the default is 4) |
echo 0 > /sys/module/zfs/parameters/zfs_scrub_delay |
maximum number of inflight I/Os (adjust for your environment, the default is 32) |
echo 128 > /sys/module/zfs/parameters/zfs_top_maxinflight |
resilver for five seconds per TXG (default is 3000) |
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms |
- For two-way mirror, local sequential write I/O performance was
around 724 MB/s, now with the three-way mirror it gives around 775 MB/s.
- For two-way mirror, local sequential read I/O performance was
around 961 MB/s, now with the three-way mirror it gives around 1024
MB/s.
dd if=/dev/zero of=test conv=fsync bs=1M count=100000
55871+0 records in
55871+0 records out
58584989696 bytes (59 GB) copied, 75.5945 s, 775 MB/s
dd if=test of=/dev/null conv=fsync bs=1M
55871+0 records in
55871+0 records out
58584989696 bytes (59 GB) copied, 56.3685 s, 1.0 GB/s
Speedy Resilvering Tunable zfs_resilver_delay
on Apple OSX Yosemite
On Apple OSX Yosemite, see here for how to do it.
The settings to the kernel tunables made by the following command
prioritize resilvering and scrubbing at the expense of
everything else, so they will likely noticeably reduce performance if
you want to use your pool for anything during a resilver or a scrub.
sudo /usr/sbin/sysctl -w \
kstat.zfs.darwin.tunable.scrub_max_active=6 \
kstat.zfs.darwin.tunable.zfs_resilver_delay=0 \
kstat.zfs.darwin.tunable.zfs_scrub_delay=0
Note that the above changes to the tunable settings will be restored to
their defaults after a reboot, but if you need to set the defaults back
earlier
they are:
Tunable |
Default |
scrub_max_active |
2 |
zfs_resilver_delay |
2 |
zfs_scrub_delay |
4 |
Temperature of Hard Drives
The following script can be used to check the temperatures of the drives
in the slots, to see if there are any overheating. In hot weather, the
drives may get to around 40C.
Slot temperatures
~~~~~~~~~~~~~~~~~
echo "Drives on controller 0"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c50034249a1b SAS #7"
smartctl /dev/disk/by-id/scsi-35000c50034249a1b -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241cd3 SAS #6"
smartctl /dev/disk/by-id/scsi-35000c50034241cd3 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003417889f SAS #5"
smartctl /dev/disk/by-id/scsi-35000c5003417889f -A|grep 'Current Drive Temperature'
echo "scsi-35000c500342222cb SAS #4"
smartctl /dev/disk/by-id/scsi-35000c500342222cb -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003425027f SAS #3"
smartctl /dev/disk/by-id/scsi-35000c5003425027f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241d4f SAS #2"
smartctl /dev/disk/by-id/scsi-35000c50034241d4f -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003424a11f SAS #1"
smartctl /dev/disk/by-id/scsi-35000c5003424a11f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034157a8f SAS #0"
smartctl /dev/disk/by-id/scsi-35000c50034157a8f -A|grep 'Current Drive Temperature'
echo "Drives on controller 1"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c5003414d617 SAS #15"
smartctl /dev/disk/by-id/scsi-35000c5003414d617 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034149b6b SAS #14"
smartctl /dev/disk/by-id/scsi-35000c50034149b6b -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423f1ff SAS #13"
smartctl /dev/disk/by-id/scsi-35000c5003423f1ff -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003424a0e7 SAS #12"
smartctl /dev/disk/by-id/scsi-35000c5003424a0e7 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034157d6f SAS #11"
smartctl /dev/disk/by-id/scsi-35000c50034157d6f -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034247a63 SAS #10"
smartctl /dev/disk/by-id/scsi-35000c50034247a63 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034150623 SAS #9"
smartctl /dev/disk/by-id/scsi-35000c50034150623 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034248657 SAS #8"
smartctl /dev/disk/by-id/scsi-35000c50034248657 -A|grep 'Current Drive Temperature'
echo "Drives on controller 2"
echo "~~~~~~~~~~~~~~~~~~~~~~"
echo "scsi-35000c500342497a3 SAS #16"
smartctl /dev/disk/by-id/scsi-35000c500342497a3 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034192787 SAS #17"
smartctl /dev/disk/by-id/scsi-35000c50034192787 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034247a43 SAS #18"
smartctl /dev/disk/by-id/scsi-35000c50034247a43 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423c29b SAS #19"
smartctl /dev/disk/by-id/scsi-35000c5003423c29b -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034249537 SAS #20"
smartctl /dev/disk/by-id/scsi-35000c50034249537 -A|grep 'Current Drive Temperature'
echo "scsi-35000c50034241d67 SAS #21"
smartctl /dev/disk/by-id/scsi-35000c50034241d67 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003423f337 SAS #22"
smartctl /dev/disk/by-id/scsi-35000c5003423f337 -A|grep 'Current Drive Temperature'
echo "scsi-35000c5003418ca53 SAS #23"
smartctl /dev/disk/by-id/scsi-35000c5003418ca53 -A|grep 'Current Drive Temperature'
Output is like this initially (on a hot day):
Drives on controller 0
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c50034249a1b SAS #7
Current Drive Temperature: 30 C
scsi-35000c50034241cd3 SAS #6
Current Drive Temperature: 30 C
scsi-35000c5003417889f SAS #5
Current Drive Temperature: 29 C
scsi-35000c500342222cb SAS #4
Current Drive Temperature: 30 C
scsi-35000c5003425027f SAS #3
Current Drive Temperature: 30 C
scsi-35000c50034241d4f SAS #2
Current Drive Temperature: 30 C
scsi-35000c5003424a11f SAS #1
Current Drive Temperature: 29 C
scsi-35000c50034157a8f SAS #0
Current Drive Temperature: 30 C
Drives on controller 1
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c5003414d617 SAS #15
Current Drive Temperature: 30 C
scsi-35000c50034149b6b SAS #14
Current Drive Temperature: 30 C
scsi-35000c5003423f1ff SAS #13
Current Drive Temperature: 30 C
scsi-35000c5003424a0e7 SAS #12
Current Drive Temperature: 30 C
scsi-35000c50034157d6f SAS #11
Current Drive Temperature: 29 C
scsi-35000c50034247a63 SAS #10
Current Drive Temperature: 29 C
scsi-35000c50034150623 SAS #9
Current Drive Temperature: 30 C
scsi-35000c50034248657 SAS #8
Current Drive Temperature: 30 C
Drives on controller 2
~~~~~~~~~~~~~~~~~~~~~~
scsi-35000c500342497a3 SAS #16
Current Drive Temperature: 30 C
scsi-35000c50034192787 SAS #17
Current Drive Temperature: 29 C
scsi-35000c50034247a43 SAS #18
Current Drive Temperature: 30 C
scsi-35000c5003423c29b SAS #19
Current Drive Temperature: 30 C
scsi-35000c50034249537 SAS #20
Current Drive Temperature: 31 C
scsi-35000c50034241d67 SAS #21
Current Drive Temperature: 31 C
scsi-35000c5003423f337 SAS #22
Current Drive Temperature: 30 C
scsi-35000c5003418ca53 SAS #23
Current Drive Temperature: 29 C
The output is like this after the machine has been up for an hour doing a zfs send,
on a cold day in winter (14C ambient temperature, 9C outside), the warmest drive is around 25C:
Drives on controller 0
~~~~~~~~~~~~~~~~~~~~
scsi-35000c50034249a1b SAS #7
Current Drive Temperature: 24 C
scsi-35000c50034241cd3 SAS #6
Current Drive Temperature: 25 C
scsi-35000c5003417889f SAS #5
Current Drive Temperature: 21 C
scsi-35000c500342222cb SAS #4
Current Drive Temperature: 23 C
scsi-35000c5003425027f SAS #3
Current Drive Temperature: 24 C
scsi-35000c50034241d4f SAS #2
Current Drive Temperature: 24 C
scsi-35000c5003424a11f SAS #1
Current Drive Temperature: 24 C
scsi-35000c50034157a8f SAS #0
Current Drive Temperature: 25 C
Drives on controller 1
~~~~~~~~~~~~~~~~~~~~
scsi-35000c5003414d617 SAS #15
Current Drive Temperature: 23 C
scsi-35000c50034149b6b SAS #14
Current Drive Temperature: 24 C
scsi-35000c5003423f1ff SAS #13
Current Drive Temperature: 24 C
scsi-35000c5003424a0e7 SAS #12
Current Drive Temperature: 24 C
scsi-35000c50034157d6f SAS #11
Current Drive Temperature: 22 C
scsi-35000c50034247a63 SAS #10
Current Drive Temperature: 23 C
scsi-35000c50034150623 SAS #9
Current Drive Temperature: 24 C
scsi-35000c50034248657 SAS #8
Current Drive Temperature: 24 C
Drives on controller 2
~~~~~~~~~~~~~~~~~~~~
scsi-35000c500342497a3 SAS #16
Current Drive Temperature: 23 C
scsi-35000c50034192787 SAS #17
Current Drive Temperature: 22 C
scsi-35000c50034247a43 SAS #18
Current Drive Temperature: 25 C
scsi-35000c5003423c29b SAS #19
Current Drive Temperature: 24 C
scsi-35000c50034249537 SAS #20
Current Drive Temperature: 24 C
scsi-35000c50034241d67 SAS #21
Current Drive Temperature: 25 C
scsi-35000c5003423f337 SAS #22
Current Drive Temperature: 24 C
scsi-35000c5003418ca53 SAS #23
Current Drive Temperature: 22 C
Monitoring performance.
Note you can use zpool iostat
as well as plain iostat
.
Checking the S.M.A.R.T. status.
Since these are SAS drives, the -A
flag to the smartctl
command is used to see the Elements in grown defect list.
Note that SAS devices do not provide SATA S.M.A.R.T. attributes like "Reallocated Sector Count".
sudo apt-get install smartmontools
Setting the mount point for a ZFS filesystem
By default, a ZFS filesystem's mountpoint value is "inherited", which
may not be what you want if you have your own convention for how
filesystems are mounted on your system.
So if we look at the inception
filesystem's mount point, it's like this by default (after the above default setup steps):
zfs get all sminception/inception|grep mountpoint
sminception/inception mountpoint /sminception/inception default
To change that, so that the mountpoint follows for example the convention of /mnt//
,
we can use this command,
which automatically attempts to unmount the filesystem from where it's
currently mounted, and re-mounts it at the new place. Note that this
only affects
local mounting of the filesystem, for NFS clients they can decide for
themselves where to mount the filesystem, but it's obviously best if
they follow the same convention.
zfs set mountpoint=/mnt/sm/inception sminception/inception
Sharing over NFS
To export the filesystem over NFS, the command is like this. On Debian Jessie, the statd
is started properly, note that statd
is required for Apple OSX Yosemite NFS clients to work properly (otherwise they hang).
apt-get install nfs-kernel-server
echo '/dummy_for_etc_exports_moronic localhost(ro,subtree_check)' >> /etc/exports
zfs set sharenfs="rw=@192.168.1.0/24,insecure" inception
sudo zfs share sminception/inception
showmount -e
zfs get sharenfs
# showmount -e
Export list for sm:
/mnt/sm/inception 192.168.1.0/24
/dummy_for_etc_exports_moronic localhost
# zfs get sharenfs
NAME PROPERTY VALUE SOURCE
sminception sharenfs off default
sminception/inception sharenfs rw=@192.168.1.0/24,insecure local
zdb
The output of the zdb
command is like this:
root@sm:~# zdb
sminception:
version: 5000
name: 'sminception'
state: 0
txg: 59992
pool_guid: 7993001279504182280
errata: 0
hostid: 8323329
hostname: 'sm'
vdev_children: 8
vdev_tree:
type: 'root'
id: 0
guid: 7993001279504182280
children[0]:
type: 'mirror'
id: 0
guid: 10523391604328058912
metaslab_array: 43
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 6139581275235454379
path: '/dev/disk/by-id/scsi-35000c50034157a8f-part1'
whole_disk: 1
DTL: 364
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 17713344311634408801
path: '/dev/disk/by-id/scsi-35000c50034248657-part1'
whole_disk: 1
DTL: 363
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 7461526471984904477
path: '/dev/disk/by-id/scsi-35000c500342497a3-part1'
whole_disk: 1
DTL: 345
create_txg: 4
children[1]:
type: 'mirror'
id: 1
guid: 13489788898704216389
metaslab_array: 41
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 9764520435028323644
path: '/dev/disk/by-id/scsi-35000c5003424a11f-part1'
whole_disk: 1
DTL: 360
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 6587559338994911545
path: '/dev/disk/by-id/scsi-35000c50034150623-part1'
whole_disk: 1
DTL: 359
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 14631248896257805436
path: '/dev/disk/by-id/scsi-35000c50034192787-part1'
whole_disk: 1
DTL: 347
create_txg: 4
children[2]:
type: 'mirror'
id: 2
guid: 18127905325371103265
metaslab_array: 40
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 43503354591006241
path: '/dev/disk/by-id/scsi-35000c50034241d4f-part1'
whole_disk: 1
DTL: 358
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 15701981273058322080
path: '/dev/disk/by-id/scsi-35000c50034247a63-part1'
whole_disk: 1
DTL: 357
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 17147423204722466029
path: '/dev/disk/by-id/scsi-35000c50034247a43-part1'
whole_disk: 1
DTL: 367
create_txg: 4
children[3]:
type: 'mirror'
id: 3
guid: 2871032375477364202
metaslab_array: 39
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 4236933644699323838
path: '/dev/disk/by-id/scsi-35000c5003425027f-part1'
whole_disk: 1
DTL: 356
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 9004728264698295353
path: '/dev/disk/by-id/scsi-35000c50034157d6f-part1'
whole_disk: 1
DTL: 355
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 18428541736462747800
path: '/dev/disk/by-id/scsi-35000c5003423c29b-part1'
whole_disk: 1
DTL: 370
create_txg: 4
children[4]:
type: 'mirror'
id: 4
guid: 1048981656851707422
metaslab_array: 38
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 4260926722300526777
path: '/dev/disk/by-id/scsi-35000c500342222cb-part1'
whole_disk: 1
DTL: 354
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 14783070303109957676
path: '/dev/disk/by-id/scsi-35000c5003424a0e7-part1'
whole_disk: 1
DTL: 353
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 10595685646790827725
path: '/dev/disk/by-id/scsi-35000c50034249537-part1'
whole_disk: 1
DTL: 372
create_txg: 4
children[5]:
type: 'mirror'
id: 5
guid: 14789964356999802181
metaslab_array: 37
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 65763515804082926
path: '/dev/disk/by-id/scsi-35000c5003417889f-part1'
whole_disk: 1
DTL: 352
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 6982179362716627328
path: '/dev/disk/by-id/scsi-35000c5003423f1ff-part1'
whole_disk: 1
DTL: 351
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 16167140138866994948
path: '/dev/disk/by-id/scsi-35000c50034241d67-part1'
whole_disk: 1
DTL: 375
create_txg: 4
children[6]:
type: 'mirror'
id: 6
guid: 17364531108284266094
metaslab_array: 36
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15427417627729432411
path: '/dev/disk/by-id/scsi-35000c50034241cd3-part1'
whole_disk: 1
DTL: 350
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 8043972884445530254
path: '/dev/disk/by-id/scsi-35000c50034149b6b-part1'
whole_disk: 1
DTL: 349
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 13316974147855006133
path: '/dev/disk/by-id/scsi-35000c5003423f337-part1'
whole_disk: 1
DTL: 378
create_txg: 4
children[7]:
type: 'mirror'
id: 7
guid: 755944469145100657
metaslab_array: 34
metaslab_shift: 34
ashift: 9
asize: 2000384688128
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 15959964977682225405
path: '/dev/disk/by-id/scsi-35000c50034249a1b-part1'
whole_disk: 1
DTL: 362
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 12475628906210417449
path: '/dev/disk/by-id/scsi-35000c5003414d617-part1'
whole_disk: 1
DTL: 361
create_txg: 4
children[2]:
type: 'disk'
id: 2
guid: 1030968181531841641
path: '/dev/disk/by-id/scsi-35000c5003418ca53-part1'
whole_disk: 1
DTL: 380
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
Scrubbing the pool
The pool can be checked using zpool scrub sminception
and
the scrub starts out quite slow and speeds up later on, here it's almost
finished, I think it took less than two hours to complete the scrub.
pool: sminception
state: ONLINE
scan: scrub in progress since Fri Jul 24 17:30:21 2015
2.93T scanned out of 4.38T at 643M/s, 0h39m to go
0 repaired, 66.82% done
config:
The iostat
looks like this during the scrub:
avg-cpu: %user %nice %system %iowait %steal %idle
0.17 0.00 23.47 0.15 0.00 76.21
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 0.00 0.00 0.00 0 0
sdc 710.40 90505.20 5.25 905052 52
sdk 709.40 90426.75 5.25 904267 52
sdb 703.00 88750.55 5.25 887505 52
sdj 704.10 88841.80 5.25 888418 52
sdr 703.30 88922.65 5.25 889226 52
sds 707.00 90069.05 5.25 900690 52
sdd 705.60 89016.15 0.40 890161 4
sdl 707.00 89023.20 0.40 890232 4
sdt 704.70 88888.20 0.40 888882 4
sde 730.00 89814.85 0.00 898148 0
sdf 730.70 90748.90 0.00 907489 0
sdn 731.10 90701.45 0.00 907014 0
sdm 724.00 89781.85 0.00 897818 0
sdv 730.00 90927.75 0.00 909277 0
sdh 720.60 89937.45 2.90 899374 29
sdu 725.50 89973.85 0.00 899738 0
sdg 723.90 89723.25 2.90 897232 29
sdp 715.10 90004.75 2.90 900047 29
sdo 717.20 89686.45 2.90 896864 29
sdx 713.60 89812.70 2.90 898127 29
sdw 712.50 89134.05 2.90 891340 29
sdi 716.00 90203.45 7.75 902034 77
sdq 714.50 90416.30 7.75 904163 77
sdy 719.40 91029.65 7.75 910296 77
dm-0 0.00 0.00 0.00 0 0
When running the zpool status
, remember to include -T d
to show the timestamp, and make sure your hard drives are shown "by-id" and not by their linux device names.
# zpool status -v -T d
Sun Aug 9 11:41:40 PDT 2015
pool: sminception
state: ONLINE
scan: scrub repaired 0 in 1h56m with 0 errors on Fri Jul 24 19:27:20 2015
config:
NAME STATE READ WRITE CKSUM
sminception ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-35000c50034157a8f ONLINE 0 0 0
scsi-35000c50034248657 ONLINE 0 0 0
scsi-35000c500342497a3 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-35000c5003424a11f ONLINE 0 0 0
scsi-35000c50034150623 ONLINE 0 0 0
scsi-35000c50034192787 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
scsi-35000c50034241d4f ONLINE 0 0 0
scsi-35000c50034247a63 ONLINE 0 0 0
scsi-35000c50034247a43 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
scsi-35000c5003425027f ONLINE 0 0 0
scsi-35000c50034157d6f ONLINE 0 0 0
scsi-35000c5003423c29b ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
scsi-35000c500342222cb ONLINE 0 0 0
scsi-35000c5003424a0e7 ONLINE 0 0 0
scsi-35000c50034249537 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
scsi-35000c5003417889f ONLINE 0 0 0
scsi-35000c5003423f1ff ONLINE 0 0 0
scsi-35000c50034241d67 ONLINE 0 0 0
mirror-6 ONLINE 0 0 0
scsi-35000c50034241cd3 ONLINE 0 0 0
scsi-35000c50034149b6b ONLINE 0 0 0
scsi-35000c5003423f337 ONLINE 0 0 0
mirror-7 ONLINE 0 0 0
scsi-35000c50034249a1b ONLINE 0 0 0
scsi-35000c5003414d617 ONLINE 0 0 0
scsi-35000c5003418ca53 ONLINE 0 0 0
errors: No known data errors
Quirks in ZFS
During resilvering, zpool iostat 1
doesn't show the write bandwidth, see here for the apparently unanswered question as to why not, and the suggestion to use the -v
flag to be able to see the write bandwidth to the resilvering drives. zpool iostat -v 1
Perhaps drives that are in the progress of being resilvered are not yet
considered to be full members of the pool so the write bandwidth isn't
of interest, but it seems strange that -v
would be used in this fashion as normally it just means "verbose".
zpool iostat -v 1
Further info...
For more on ZFS, Aaron's guide to ZFS is a good place to start. Also, his article about parchive is interesting.