mdadm command, which allows creating and manipulating RAID arrays, as well as scripts and tools integrating it to the rest of the system, including the monitoring system.
sda disk, 4 GB, is entirely available;
sde disk, 4 GB, is also entirely available;
sdg disk, only partition sdg2 (about 4 GB) is available;
sdh disk, still 4 GB, entirely available.
#mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sda /dev/sdemdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. #mdadm --query /dev/md0/dev/md0: 8.00GiB raid0 2 devices, 0 spares. Use mdadm --detail for more detail. #mdadm --detail /dev/md0/dev/md0: Version : 1.2 Creation Time : Thu Sep 30 15:21:15 2010 Raid Level : raid0 Array Size : 8388480 (8.00 GiB 8.59 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Sep 30 15:21:15 2010 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Name : squeeze:0 (local to host squeeze) UUID : 0012a273:cbdb8b83:0ee15f7f:aec5e3c3 Events : 0 Number Major Minor RaidDevice State 0 8 0 0 active sync /dev/sda 1 8 64 1 active sync /dev/sde #mkfs.ext4 /dev/md0mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 524288 inodes, 2097152 blocks 104857 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=2147483648 55 block groups 32768 blocks per group, 32768 fragments per group 8160 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 26 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. #mkdir /srv/raid-0#mount /dev/md0 /srv/raid-0#df -h /srv/raid-0Filesystem Size Used Avail Use% Mounted on /dev/md0 8.0G 249M 7.4G 4% /srv/raid-0
mdadm --create command requires several parameters: the name of the volume to create (/dev/md*, with MD standing for Multiple Device), the RAID level, the number of disks (which is compulsory despite being mostly meaningful only with RAID-1 and above), and the physical drives to use. Once the device is created, we can use it like we'd use a normal partition, create a filesystem on it, mount that filesystem, and so on. Note that our creation of a RAID-0 volume on md0 is nothing but coincidence, and the numbering of the array doesn't need to be correlated to the chosen amount of redundancy.
#mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdg2 /dev/sdhmdadm: largest drive (/dev/sdg2) exceed size (4194240K) by more than 1% Continue creating array?ymdadm: array /dev/md1 started. #mdadm --query /dev/md1/dev/md1: 4.00GiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail. #mdadm --detail /dev/md1/dev/md1: Version : 1.2 Creation Time : Thu Sep 30 15:39:13 2010 Raid Level : raid1 Array Size : 4194240 (4.00 GiB 4.29 GB) Used Dev Size : 4194240 (4.00 GiB 4.29 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Sep 30 15:39:26 2010 State : active, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 10% complete Name : squeeze:1 (local to host squeeze) UUID : 20a8419b:41612750:b9171cfe:00d9a432 Events : 27 Number Major Minor RaidDevice State 0 8 98 0 active sync /dev/sdg2 1 8 112 1 active sync /dev/sdh #mdadm --detail /dev/md1/dev/md1: [...] State : active [...]
mdadm notices that the physical elements have different sizes; since this implies that some space will be lost on the bigger element, a confirmation is required.
/dev/md1 is usable, and a filesystem can be created on it, as well as some data copied on it.
mdadm, in particular its --fail option, allows simulating such a disk failure:
#mdadm /dev/md1 --fail /dev/sdhmdadm: set /dev/sdh faulty in /dev/md1 #mdadm --detail /dev/md1/dev/md1: [...] Update Time : Thu Sep 30 15:45:50 2010 State : active, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : squeeze:1 (local to host squeeze) UUID : 20a8419b:41612750:b9171cfe:00d9a432 Events : 35 Number Major Minor RaidDevice State 0 8 98 0 active sync /dev/sdg2 1 0 0 1 removed 2 8 112 - faulty spare /dev/sdh
sdg disk fail in turn, the data would be lost. We want to avoid that risk, so we'll replace the failed disk with a new one, sdi:
#mdadm /dev/md1 --add /dev/sdimdadm: added /dev/sdi #mdadm --detail /dev/md1/dev/md1: [...] Raid Devices : 2 Total Devices : 3 Persistence : Superblock is persistent Update Time : Thu Sep 30 15:52:29 2010 State : active, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 1 Spare Devices : 1 Rebuild Status : 45% complete Name : squeeze:1 (local to host squeeze) UUID : 20a8419b:41612750:b9171cfe:00d9a432 Events : 53 Number Major Minor RaidDevice State 0 8 98 0 active sync /dev/sdg2 3 8 128 1 spare rebuilding /dev/sdi 2 8 112 - faulty spare /dev/sdh #[...][...] #mdadm --detail /dev/md1/dev/md1: [...] Update Time : Thu Sep 30 15:52:35 2010 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 1 Spare Devices : 0 Name : squeeze:1 (local to host squeeze) UUID : 20a8419b:41612750:b9171cfe:00d9a432 Events : 71 Number Major Minor RaidDevice State 0 8 98 0 active sync /dev/sdg2 1 8 128 1 active sync /dev/sdi 2 8 112 - faulty spare /dev/sdh
sdh disk is about to be removed from the array, so as to end up with a classical RAID mirror on two disks:
#mdadm /dev/md1 --remove /dev/sdhmdadm: hot removed /dev/sdh from /dev/md1 #mdadm --detail /dev/md1/dev/md1: [...] Number Major Minor RaidDevice State 0 8 98 0 active sync /dev/sdg2 1 8 128 1 active sync /dev/sdi
sdh disk failure had been real (instead of simulated) and the system had been restarted without removing this sdh disk, this disk could start working again due to having been probed during the reboot. The kernel would then have three physical elements, each claiming to contain half of the same RAID volume. Another source of confusion can come when RAID volumes from two servers are consolidated onto one server only. If these arrays were running normally before the disks were moved, the kernel would be able to detect and reassemble the pairs properly; but if the moved disks had been aggregated into an md1 on the old server, and the new server already has an md1, one of the mirrors would be renamed.
/etc/mdadm/mdadm.conf file, an example of which is listed here:
Example 12.1. mdadm configuration file
# mdadm.conf # # Please refer to mdadm.conf(5) for information about this file. # # by default, scan all partitions (/proc/partitions) for MD superblocks. # alternatively, specify devices to scan, using wildcards if desired. DEVICE /dev/sd* # auto-create devices with Debian standard permissions CREATE owner=root group=disk mode=0660 auto=yes # automatically tag new arrays as belonging to the local system HOMEHOST <system> # instruct the monitoring daemon where to send mail alerts MAILADDR root ARRAY /dev/md0 metadata=1.2 name=squeeze:0 UUID=6194b63f:69a40eb5:a79b7ad3:c91f20ee ARRAY /dev/md1 metadata=1.2 name=squeeze:1 UUID=20a8419b:41612750:b9171cfe:00d9a432
DEVICE option, which lists the devices where the system will automatically look for components of RAID volumes at start-up time. In our example, we replaced the default value, partitions, with an explicit list of device files, since we chose to use entire disks and not only partitions, for some volumes.
/dev/md* device name).
#mdadm --misc --detail --brief /dev/md?ARRAY /dev/md0 metadata=1.2 name=squeeze:0 UUID=6194b63f:69a40eb5:a79b7ad3:c91f20ee ARRAY /dev/md1 metadata=1.2 name=squeeze:1 UUID=20a8419b:41612750:b9171cfe:00d9a432
/dev hierarchy, so there's no risk of using them directly.
/dev, and it can be used as any other physical partition can be (most commonly, to host a filesystem or swap space).
sdb disk, a sdb2 partition, 4 GB;
sdc disk, a sdc3 partition, 3 GB;
sdd disk, 4 GB, in fully available;
sdf disk, a sdf1 partition, 4 GB; and a sdf2 partition, 5 GB.
sdb and sdf are faster than the other two.
pvcreate:
#pvdisplay#pvcreate /dev/sdb2Physical volume "/dev/sdb2" successfully created #pvdisplay"/dev/sdb2" is a new physical volume of "4,00 GiB" --- NEW Physical volume --- PV Name /dev/sdb2 VG Name PV Size 4.00 GiB Allocatable NO PE Size (KByte) 0 Total PE 0 Free PE 0 Allocated PE 0 PV UUID 9JuaGR-W7jc-pNgj-NU4l-2IX1-kUJ7-m8cRim #for i in sdc3 sdd sdf1 sdf2 ; do pvcreate /dev/$i ; donePhysical volume "/dev/sdc3" successfully created Physical volume "/dev/sdd" successfully created Physical volume "/dev/sdf1" successfully created Physical volume "/dev/sdf2" successfully created #pvdisplay -CPV VG Fmt Attr PSize PFree /dev/sdb2 lvm2 a- 4.00g 4.00g /dev/sdc3 lvm2 a- 3.09g 3.09g /dev/sdd lvm2 a- 4.00g 4.00g /dev/sdf1 lvm2 a- 4.10g 4.10g /dev/sdf2 lvm2 a- 5.22g 5.22g
pvdisplay command lists the existing PVs, with two possible output formats.
vgcreate. We'll gather only PVs from the fast disks into a vg_critical VG; the other VG, vg_normal, will also include slower elements.
#vgdisplay#vgcreate vg_critical /dev/sdb2 /dev/sdf1Volume group "vg_critical" successfully created #vgdisplay--- Volume group --- VG Name vg_critical System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 2 Act PV 2 VG Size 8.14 GB PE Size 4.00 MB Total PE 2084 Alloc PE / Size 0 / 0 Free PE / Size 2084 / 8.14 GB VG UUID 6eG6BW-MmJE-KB0J-dsB2-52iL-N6eD-1paeo8 #vgcreate vg_normal /dev/sdc3 /dev/sdd /dev/sdf2Volume group "vg_normal" successfully created #vgdisplay -CVG #PV #LV #SN Attr VSize VFree vg_critical 2 0 0 wz--n- 8.14g 8.14g vg_normal 3 0 0 wz--n- 12.30g 12.30g
vgdisplay proposes two output formats). Note that it's quite possible to use two partitions of the same physical disk into two different VGs. Note also that we used a vg_ prefix to name our VGs, but it's nothing more than a convention.
lvcreate command, and a slightly more complex syntax:
#lvdisplay#lvcreate -n lv_files -L 5G vg_criticalLogical volume "lv_files" created #lvdisplay--- Logical volume --- LV Name /dev/vg_critical/lv_files VG Name vg_critical LV UUID 4QLhl3-2cON-jRgQ-X4eT-93J4-6Ox9-GyRx3M LV Write Access read/write LV Status available # open 0 LV Size 5.00 GB Current LE 1280 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 #lvcreate -n lv_base -L 1G vg_criticalLogical volume "lv_base" created #lvcreate -n lv_backups -L 12G vg_normalLogical volume "lv_backups" created #lvdisplay -CLV VG Attr LSize Origin Snap% Move Log Copy% Convert lv_base vg_critical -wi-a- 1.00G lv_files vg_critical -wi-a- 5.00G lv_backups vg_normal -wi-a- 12.00G
lvcreate as options. The name of the LV to be created is specified with the -n option, and its size is generally given using the -L option. We also need to tell the command what VG to operate on, of course, hence the last parameter on the command line.
/dev/mapper/:
#ls -l /dev/mappertotal 0 crw-rw---- 1 root root 10, 59 5 oct. 17:40 control lrwxrwxrwx 1 root root 7 5 oct. 18:14 vg_critical-lv_base -> ../dm-1 lrwxrwxrwx 1 root root 7 5 oct. 18:14 vg_critical-lv_files -> ../dm-0 lrwxrwxrwx 1 root root 7 5 oct. 18:14 vg_normal-lv_backups -> ../dm-2 #ls -l /dev/dm-*brw-rw---- 1 root disk 253, 0 5 oct. 18:14 /dev/dm-0 brw-rw---- 1 root disk 253, 1 5 oct. 18:14 /dev/dm-1 brw-rw---- 1 root disk 253, 2 5 oct. 18:14 /dev/dm-2
#ls -l /dev/vg_criticaltotal 0 lrwxrwxrwx 1 root root 7 5 oct. 18:14 lv_base -> ../dm-1 lrwxrwxrwx 1 root root 7 5 oct. 18:14 lv_files -> ../dm-0 #ls -l /dev/vg_normaltotal 0 lrwxrwxrwx 1 root root 7 5 oct. 18:14 lv_backups -> ../dm-2
#mkfs.ext4 /dev/vg_normal/lv_backupsmke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) [...] This filesystem will be automatically checked every 34 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. #mkdir /srv/backups#mount /dev/vg_normal/lv_backups /srv/backups#df -h /srv/backupsFilesystem Size Used Avail Use% Mounted on /dev/mapper/vg_normal-lv_backups 12G 159M 12G 2% /srv/backups #[...][...] #cat /etc/fstab[...] /dev/vg_critical/lv_base /srv/base ext4 /dev/vg_critical/lv_files /srv/files ext4 /dev/vg_normal/lv_backups /srv/backups ext4
vg_critical, we can grow lv_files. For that purpose, we'll use the lvresize command, then resize2fs to adapt the filesystem accordingly:
#df -h /srv/files/Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_critical-lv_files 5.0G 4.6G 142M 98% /srv/files #lvdisplay -C vg_critical/lv_filesLV VG Attr LSize Origin Snap% Move Log Copy% Convert lv_files vg_critical -wi-ao 5.00g #vgdisplay -C vg_criticalVG #PV #LV #SN Attr VSize VFree vg_critical 2 2 0 wz--n- 8.14g 2.14g #lvresize -L 7G vg_critical/lv_filesExtending logical volume lv_files to 7.00 GB Logical volume lv_files successfully resized #lvdisplay -C vg_critical/lv_filesLV VG Attr LSize Origin Snap% Move Log Copy% Convert lv_files vg_critique -wi-ao 7.00g #resize2fs /dev/vg_critical/lv_filesresize2fs 1.41.12 (17-May-2010) Filesystem at /dev/vg_critical/lv_files is mounted on /srv/files; on-line resizing required old desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/vg_critical/lv_files to 1835008 (4k) blocks. The filesystem on /dev/vg_critical/lv_files is now 1835008 blocks long. #df -h /srv/files/Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_critical-lv_files 6.9G 4.6G 2.1G 70% /srv/files
#df -h /srv/base/Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_critical-lv_base 1008M 835M 123M 88% /srv/base #vgdisplay -C vg_criticalVG #PV #LV #SN Attr VSize VFree vg_critical 2 2 0 wz--n- 8.14g 144.00m
sdb1 partition, which was so far used outside of LVM, only contained archives that could be moved to lv_backups. We can now recycle it and integrate it to the volume group, and thereby reclaim some available space. This is the purpose of the vgextend command. Of course, the partition must be prepared as a physical volume beforehand. Once the VG has been extended, we can use similar commands as previously to grow the logical volume then the filesystem:
#pvcreate /dev/sdb1Physical volume "/dev/sdb1" successfully created #vgextend vg_critical /dev/sdb1Volume group "vg_critical" successfully extended #vgdisplay -C vg_criticalVG #PV #LV #SN Attr VSize VFree vg_critical 3 2 0 wz--n- 9.09g 1.09g #[...][...] #df -h /srv/base/Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_critical-lv_base 2.0G 835M 1.1G 44% /srv/base
sda and sdc. They are partitioned identically along the following scheme:
#fdisk -l /dev/sdaDisk /dev/hda: 300.0 GB, 300090728448 bytes 255 heads, 63 sectors/track, 36483 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00039a9f Device Boot Start End Blocks Id System /dev/sda1 * 1 124 995998+ fd Linux raid autodetect /dev/sda2 125 248 996030 82 Linux swap / Solaris /dev/sda3 249 36483 291057637+ 5 Extended /dev/sda5 249 12697 99996561 fd Linux raid autodetect /dev/sda6 12698 25146 99996561 fd Linux raid autodetect /dev/sda7 25147 36483 91064421 8e Linux LVM
md0. This mirror is directly used to store the root filesystem.
sda2 and sdc2 partitions are used as swap partitions, providing a total 2 GB of swap space. With 1 GB of RAM, the workstation has a comfortable amount of available memory.
sda5 and sdc5 partitions, as well as sda6 and sdc6, are assembled into two new RAID-1 volumes of about 100 GB each, md1 and md2. Both these mirrors are initialized as physical volumes for LVM, and assigned to the vg_raid volume group. This VG thus contains about 200 GB of safe space.
sda7 and sdc7, are directly used as physical volumes, and assigned to another VG called vg_bulk, which therefore ends up with roughly 200 GB of space.
vg_raid will be preserved even if one of the disks fails, which will not be the case for LVs created in vg_bulk; on the other hand, the latter will be allocated in parallel on both disks, which allows higher read or write speeds for large files.
lv_usr, lv_var and lv_home LVs on vg_raid, to host the matching filesystems; another large LV, lv_movies, will be used to host the definitive versions of movies after editing. The other VG will be split into a large lv_rushes, for data straight out of the digital video cameras, and a lv_tmp for temporary files. The location of the work area is a less straightforward choice to make: while good performance is needed for that volume, is it worth risking losing work if a disk fails during an editing session? Depending on the answer to that question, the relevant LV will be created on one VG or the other.
/usr/ can be grown painlessly.