[parted-devel] Small writes being split with fdatasync based on non-aligned partition ending

Jens Rosenboom j.rosenboom at x-ion.de
Thu Feb 11 09:54:41 UTC 2016


2016-02-11 4:48 GMT+01:00 Sitsofe Wheeler <sitsofe at gmail.com>:
> Trying to cc the GNU parted and linux-block mailing lists.
>
> On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom at x-ion.de> wrote:
>> While trying to reproduce some performance issues I have been seeing
>> with Ceph, I have come across a strange behaviour which is seemingly
>> affected only by the end point (and thereby the size) of a partition
>> being an odd number of sectors. Since all documentation about
>> alignment only refers to the starting point of the partition, this was
>> pretty surprising and I would like to know whether this is expected
>> behaviour or maybe a kernel issue.
>>
>> The command I am using is pretty simple:
>>
>> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k
>> --filename=/dev/sdb2 --runtime=10 --name=test
>>
>> The difference shows itself when the partition is created either by
>> sgdisk or by parted:
>>
>> sgdisk --new=2:6000M: /dev/sdb
>>
>> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100%
>>
>> The difference in the partition table looks like this:
>>
>> <  2      6291456000B  1600320962559B  1594029506560B
>> osd-device-1-block
>> ---
>>>  2      6291456000B  1600321297919B  1594029841920B               osd-device-1-block
>
> Looks like parted took you at your word when you asked for your
> partition at 100%. Just out of curiosity if you try and make the same
> partition interactively with parted do you get any warnings after
> making and after running align-check ?

No warnings and everything fine for align-check. I found out that I
can get the same effect if I step the partition ending manually in
parted in 1s increments. The sequence of write sizes is 8, 1, 2, 1, 4,
1, 2, 1, 8, ... which corresponds to the size (unit s) of the
resulting partion mod 8.

>> So this is really only the end of the partition that is different.
>> However, in the first case, the 4k writes all get broken up into 512b
>> writes somewhere in the kernel, as can be seen with btrace:
>>
>>   8,16   3       36     0.000102666  8184  A  WS 12353985 + 1 <- (8,18) 65985
>>   8,16   3       37     0.000102739  8184  Q  WS 12353985 + 1 [fio]
>>   8,16   3       38     0.000102875  8184  M  WS 12353985 + 1 [fio]
>>   8,16   3       39     0.000103038  8184  A  WS 12353986 + 1 <- (8,18) 65986
>>   8,16   3       40     0.000103109  8184  Q  WS 12353986 + 1 [fio]
>>   8,16   3       41     0.000103196  8184  M  WS 12353986 + 1 [fio]
>>   8,16   3       42     0.000103335  8184  A  WS 12353987 + 1 <- (8,18) 65987
>>   8,16   3       43     0.000103403  8184  Q  WS 12353987 + 1 [fio]
>>   8,16   3       44     0.000103489  8184  M  WS 12353987 + 1 [fio]
>>   8,16   3       45     0.000103609  8184  A  WS 12353988 + 1 <- (8,18) 65988
>>   8,16   3       46     0.000103678  8184  Q  WS 12353988 + 1 [fio]
>>   8,16   3       47     0.000103767  8184  M  WS 12353988 + 1 [fio]
>>   8,16   3       48     0.000103879  8184  A  WS 12353989 + 1 <- (8,18) 65989
>>   8,16   3       49     0.000103947  8184  Q  WS 12353989 + 1 [fio]
>>   8,16   3       50     0.000104035  8184  M  WS 12353989 + 1 [fio]
>>   8,16   3       51     0.000104150  8184  A  WS 12353990 + 1 <- (8,18) 65990
>>   8,16   3       52     0.000104219  8184  Q  WS 12353990 + 1 [fio]
>>   8,16   3       53     0.000104307  8184  M  WS 12353990 + 1 [fio]
>>   8,16   3       54     0.000104452  8184  A  WS 12353991 + 1 <- (8,18) 65991
>>   8,16   3       55     0.000104520  8184  Q  WS 12353991 + 1 [fio]
>>   8,16   3       56     0.000104609  8184  M  WS 12353991 + 1 [fio]
>>   8,16   3       57     0.000104885  8184  I  WS 12353984 + 8 [fio]
>>
>> whereas in the second case, I'm getting the expected 4k writes:
>>
>>   8,16   6       42 1266874889.659842036  8409  A  WS 12340232 + 8 <-
>> (8,18) 52232
>>   8,16   6       43 1266874889.659842167  8409  Q  WS 12340232 + 8 [fio]
>>   8,16   6       44 1266874889.659842393  8409  G  WS 12340232 + 8 [fio]
>
> This is weird because --size=1G should mean that fio is "seeing" an
> aligned end. Does direct=1 with a sequential job of iodepth=1 show the
> problem too?

IIUC fio uses the size only to find out where to write to, it opens
the block device and passes the resulting fd to the fdatasync call, so
the kernel will not know about what size fio thinks the device has. In
fact, the effect is the same without the size=1G option, I used it
just to make sure that the writes do not go anywhere near the badly
aligned partition ending.

direct=1 kills the effect, i.e. all writes will be 4k size again.
Astonishingly though, sequential writes also are affected, i.e.
changing to rw=write in my sample above behaves the same as randwrite.

>> The above examples are from running with an SSD, where the small
>> writes get merged together again before hitting the block device,
>> which is still pretty o.k. performance wise. But when I run the same
>> test on some NVMe device, the writes do not get merged, instead the
>> performance drops to less then 10% of what I get in the second case.
>
> Perhaps the ioscheduler doesn't have the opportunity with the NVMe device...

Yes, there is no scheduler available in this case:

$ cat /sys/block/nvme0n1/queue/scheduler
none

This is just to show that the argument "Don't bother, the writes get
merged back together anyway" doesn't hold true in all cases.

>> If this is indeed expected behaviour from the kernel pov, it might
>> need some better documentation and probably sgdisk should also be
>> enhanced to align the end of the partition as well. FWIW, this happens
>> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels.
>
> Do you mean parted?

No, as I am currently assuming that the issue is caused by some effect
happening inside the kernel during the fdatasync call, there was the
idea that only certain kernels might be affected. But I don't have a
clue yet how for back I would have to go in order to find a kernel
that behaves differently.



More information about the parted-devel mailing list