[Nut-upsdev] some megatec-usb issues
Alexander I. Gordeev
lasaine at lvk.cs.msu.su
Wed Feb 7 15:56:11 CET 2007
Hi All,
I've finally found solutions for my previous problems.
But since this includes changes in the shared files, I'd like
to discuss them here.
1. Driver restart problem.
When I start the driver for the first time, everything is ok. But when I want to
restart it, problems begin. The driver fails to read Report descriptor for the
second time (libusb_open is used to open the device). This is caused by my
particular UPS model's allergy. I can read the Report descriptor only once and
only before reading any "special" string descriptors. After that I can do
everything else without troubles. But if I attempt to read it after reading
string descriptors than everything stops working (string descriptors too). So I
have to reset UPS...
This works perfectly under Windows, as the Report descriptor is retrieved only
once - by the OS. Program that drives UPS doesn't do it.
So the solution is simple: not to retrieve any HID descriptors as I don't need
them and they cause problems. But I think the best time to determine whether to
retrieve HID stuff or not is while device matching is performed. There could be
other devices and subdrivers that need these things. My solution is to reserve
some value that a matcher can return indicating that we don't need HID. For
example, 2. I looked through other drivers that use libusb_open, and their
matchers all return 1. (If I'm not right, please, tell me). So this decision
wouldn't affect other drivers. Also I want to note that any matcher should be
able to prevent retrieving HID descriptors.
libusb_open has already everything that is needed. If the "mode" variable is set
to MODE_REOPEN then HID descriptors aren't retrieved. So the simpliest thing is
to change it:
--- libusb.c (revision 799)
+++ libusb.c (working copy)
@@ -180,7 +180,7 @@
} else if (ret==-2) {
upsdebugx(2, "matcher: unspecified error");
goto next_device;
- }
+ } else if (ret==2) mode = MODE_REOPEN;
}
upsdebugx(2, "Device matches");
This change is enough, it doesn't affect other drivers and IMHO it makes
libusb_open more generic. Of course, some comments should also be there.
2. "UPS No Ack" problem.
I also have a solution for the problem of the "UPS No Ack" answers.
The problem: when I do "informative" requests (Q1, I, F - descriptors 0x3, 0xc
and 0xd respectively) I sometimes get non-usual responses - "UPS No Ack". So the
data gets stale rather often. By the way, this is the only thing I get when
doing "non-informative" requests.
Example:
tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3 3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129 descriptor 0x03 index 3:
60 03 28 00 32 00 31 00 37 00 2e 00 30 00 20 00 31 00 36 00 35 00 2e 00
30 00 20 00 32 00 31 00 37 00 2e 00 30 00 20 00 30 00 30 00 30 00 20 00
35 00 30 00 2e 00 30 00 20 00 31 00 33 00 2e 00 34 00 20 00 30 00 30 00
2e 00 30 00 20 00 30 00 30 00 30 00 30 00 31 00 30 00 30 00 30 00 0d 00
`.(.2.1.7...0. .1.6.5...0. .2.1.7...0. .0.0.0. .5.0...0. .1.3...4. .0.0.
..0. .0.0.0.0.1.0.0.0...
tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3 3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129 descriptor 0x03 index 3:
16 03 55 00 50 00 53 00 20 00 4e 00 6f 00 20 00 41 00 63 00 6b 00
..U.P.S. .N.o. .A.c.k.
As you see, the program doesn't report errors. So this is not a USB transfer
error. It should be an internal UPS issue. I think, my UPS is serial internaly
but uses some custom hack that enables USB communication. But sometimes this
mechanism fails because of the bad design or some other things. There is another
fact in support of this theory: if I power off the UPS its USB stuff continues
to work. It behaves quite the same but the only data I can get from these
"special" string descriptors is "UPS No Ack"! So, I think, the custom
serial-over-USB hardware is USB-powered but it fails to read data from the
serial part since the latter is powered off.
My tests show that the driver gets "UPS No Ack" in about 17,5% of all requests.
This is quite a huge value because every time the data gets stale I have some
log records. The result is great log pollution. There would be several thousands
of records per day. This is inadmissible for just a buggy hardware.
Also the driver sometimes refuses to start if there are too many failed attempts
during driver startup.
My solution is to try to get descriptor again without delay in the same get_data
call. You may think this way is not good because it will increase number of
errors because of the additional load. No, this is not true. This is incredible,
but my tests show that there will be _less_ errors with this solution. And
furthermore UPS feels best when it is asked permanently without any delay.
Now, about the tests. I've made 3 versions of the driver: (1) without retries,
(2) with unlimited number of retries (until success) and (3) the one, which
didn't stop asking UPS without delay even in the case of success (this was not a
driver actually since it never returns data). Here are there results
respectively:
(1)
All requests: 17779
Fails: 3112
Percent failed: 17,5038
Distribution 1:
19.236178 18.223747 17.211317 14.680241 16.705102 17.211317 17.717532 19.067439
21.092300 14.342764 18.055009 16.198886 16.198886 17.717532 15.355194 18.561224
17.042578 20.248608 15.523933 18.561224 16.536363 16.367625 21.261038 18.898701
15.692671 18.392486 16.030148 17.886270 18.223747 16.873840
Distribution 2:
17.042578 15.861410 15.017718 20.417346 18.561224 20.586085 17.548794 17.042578
19.742393 16.536363 18.223747 20.248608 18.392486 17.211317 16.705102 16.367625
15.692671 16.873840 18.392486 14.174026 20.248608 16.536363 16.536363 15.523933
15.861410 17.211317 18.392486 18.561224 18.561224 17.042578
Continuous fail subsequences lengths:
1: 1769 (73.463455%)
2: 582 (24.169435%)
3: 50 (2.076412%)
4: 6 (0.249169%)
5: 1 (0.041528%)
(2)
All requests: 20056
Fails: 2629
Percent failed: 13,1083
Distribution 1:
11.368169 10.769844 14.509374 10.919426 15.107698 13.462306 14.509374 12.116075
14.210211 14.658955 14.210211 12.116075 13.312724 10.620263 13.163143 13.312724
13.611887 12.564819 13.013562 13.163143 13.013562 15.406861 13.163143 13.462306
12.714400 14.359793 12.863981 11.816913 12.564819 13.163143
Distribution 2:
11.966494 10.769844 13.462306 11.816913 12.863981 13.312724 11.368169 15.706023
11.667331 12.116075 14.958117 11.966494 11.816913 15.706023 12.714400 12.714400
14.658955 12.714400 13.611887 14.958117 13.911049 10.769844 15.556442 12.116075
13.462306 10.470682 13.611887 13.761468 14.060630 14.658955
Continuous fail subsequences lengths:
1: 2623 (99.885758%)
2: 3 (0.114242%)
(3)
All requests: 7349
Fails: 385
Percent failed: 5,23881
Distribution 1:
3.265750 6.123282 4.082188 3.265750 4.898626 7.347938 4.082188 5.306844
9.797251 3.673969 7.756157 12.246564 8.572595 6.939720 5.715063 5.306844
6.531501 3.673969 2.449313 4.490407 4.898626 6.123282 2.449313 2.857532
6.939720 4.898626 4.490407 4.898626 2.041094 2.041094
Distribution 2:
5.715063 5.306844 6.939720 3.673969 1.632875 7.756157 7.756157 4.898626
6.531501 8.164376 5.715063 4.082188 4.490407 4.082188 4.490407 4.490407
7.756157 2.857532 6.123282 4.898626 3.265750 4.490407 3.673969 6.123282
3.265750 4.082188 7.347938 6.123282 4.082188 7.347938
Continuous fail subsequences lengths:
1: 385 (100.000000%)
(By fail sequences I mean sequences of numbers of failed attempts in the list of
all attempts.)
Distribution 1-2 are two simple tests I've thought out myself since I don't
remember how to check sequences for being well-distributed. I think, they show
at least that these fail sequences are distributed. :)
In these tests every element of fail sequence falls into one of 30 clusters if
it meets some criterion (cnum = number of clusters, e = current element, anum =
overall number of attempts): anum * i/cnum <= e < anum * (i+1)/cnum for the i-th
cluster in the first test and e%cnum == i for the i-th cluster in the second
test. Then the percent of the paticular cluster size from the maximum size is
printed.
Maybe distribution tests don't have much sense in the test with unlimited number
of retries (2).
These tests show at least that the flow of fails is rather constant.
The other thing which is of particular interest (at least in the tests 2 and 3)
is the maximum length of continuous fail subsequences and the number of fail
subsequences of a certain length.
These tests show that the probability of fail is much less in the case of no
delay between attempts.
But this will increase the overall number of attempts during the same time
period if we have retries so it should be better to compare probabilities of
fail in a single get_data call for both cases with retries and with no retries.
The latter is 17,5038% - the same as in the first test.
The former is "Fails"/("All requests" - "Fails") in the test with unlimited
retries because if there are retries then the number calls will be the same as
the number of successful attempts. So the former = 15,0858%.
These values show that the possibility of a failed attempt per single get_data
call is _less_ if my solution is used. This is also a great addition to the
benefit of having much less useless records in logs!
Tests show that having 2-3 extra attempts would be quite enough. I'd like not to
have unlimited number of retries because I want to detect if UPS powers off.
What do you think about all this? Please, tell me if you want me to define
something more exactly.
I can say that with this two solutions applied the driver works pretty stable.
--
Alexander
More information about the Nut-upsdev
mailing list