[Nut-upsdev] some megatec-usb issues

Wed Feb 7 15:56:11 CET 2007

Hi All,

I've finally found solutions for my previous problems.
But since this includes changes in the shared files, I'd like
to discuss them here.

1. Driver restart problem.
When I start the driver for the first time, everything is ok. But when I want to
restart it, problems begin. The driver fails to read Report descriptor for the
second time (libusb_open is used to open the device). This is caused by my
particular UPS model's allergy. I can read the Report descriptor only once and
only before reading any "special" string descriptors. After that I can do
everything else without troubles. But if I attempt to read it after reading
string descriptors than everything stops working (string descriptors too). So I
have to reset UPS...

This works perfectly under Windows, as the Report descriptor is retrieved only
once - by the OS. Program that drives UPS doesn't do it.

So the solution is simple: not to retrieve any HID descriptors as I don't need
them and they cause problems. But I think the best time to determine whether to
retrieve HID stuff or not is while device matching is performed. There could be
other devices and subdrivers that need these things. My solution is to reserve
some value that a matcher can return indicating that we don't need HID. For
example, 2. I looked through other drivers that use libusb_open, and their
matchers all return 1. (If I'm not right, please, tell me). So this decision
wouldn't affect other drivers. Also I want to note that any matcher should be
able to prevent retrieving HID descriptors.

libusb_open has already everything that is needed. If the "mode" variable is set
to MODE_REOPEN then HID descriptors aren't retrieved. So the simpliest thing is
to change it:

--- libusb.c    (revision 799)
+++ libusb.c    (working copy)
@@ -180,7 +180,7 @@
                                 } else if (ret==-2) {
                                         upsdebugx(2, "matcher: unspecified error");
                                         goto next_device;
-                               }
+                               } else if (ret==2) mode = MODE_REOPEN;
                         }
                         upsdebugx(2, "Device matches");

This change is enough, it doesn't affect other drivers and IMHO it makes
libusb_open more generic. Of course, some comments should also be there.

2. "UPS No Ack" problem.

I also have a solution for the problem of the "UPS No Ack" answers.
The problem: when I do "informative" requests (Q1, I, F - descriptors 0x3, 0xc
and 0xd respectively) I sometimes get non-usual responses - "UPS No Ack". So the
data gets stale rather often. By the way, this is the only thing I get when
doing "non-informative" requests.

Example:

tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3 3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129 descriptor 0x03 index 3:

  60 03 28 00 32 00 31 00 37 00 2e 00 30 00 20 00 31 00 36 00 35 00 2e 00
  30 00 20 00 32 00 31 00 37 00 2e 00 30 00 20 00 30 00 30 00 30 00 20 00
  35 00 30 00 2e 00 30 00 20 00 31 00 33 00 2e 00 34 00 20 00 30 00 30 00
  2e 00 30 00 20 00 30 00 30 00 30 00 30 00 31 00 30 00 30 00 30 00 0d 00

  `.(.2.1.7...0. .1.6.5...0. .2.1.7...0. .0.0.0. .5.0...0. .1.3...4. .0.0.
  ..0. .0.0.0.0.1.0.0.0...
tornado:/home/alex/development/nut/tools# ./get_descriptor 002 003 1 0 0 0x81 3 3
Bus 002 device 003 configuration 1 interface 0 altsetting 0 endpoint 129 descriptor 0x03 index 3:

  16 03 55 00 50 00 53 00 20 00 4e 00 6f 00 20 00 41 00 63 00 6b 00

  ..U.P.S. .N.o. .A.c.k.

As you see, the program doesn't report errors. So this is not a USB transfer
error. It should be an internal UPS issue. I think, my UPS is serial internaly
but uses some custom hack that enables USB communication. But sometimes this
mechanism fails because of the bad design or some other things. There is another
fact in support of this theory: if I power off the UPS its USB stuff continues
to work. It behaves quite the same but the only data I can get from these
"special" string descriptors is "UPS No Ack"! So, I think, the custom
serial-over-USB hardware is USB-powered but it fails to read data from the
serial part since the latter is powered off.

My tests show that the driver gets "UPS No Ack" in about 17,5% of all requests.
This is quite a huge value because every time the data gets stale I have some
log records. The result is great log pollution. There would be several thousands
of records per day. This is inadmissible for just a buggy hardware.
Also the driver sometimes refuses to start if there are too many failed attempts
during driver startup.

My solution is to try to get descriptor again without delay in the same get_data
call. You may think this way is not good because it will increase number of
errors because of the additional load. No, this is not true. This is incredible,
but my tests show that there will be _less_ errors with this solution. And
furthermore UPS feels best when it is asked permanently without any delay.

Now, about the tests. I've made 3 versions of the driver: (1) without retries,
(2) with unlimited number of retries (until success) and (3) the one, which
didn't stop asking UPS without delay even in the case of success (this was not a
driver actually since it never returns data). Here are there results
respectively:

(1)

All requests: 17779
Fails: 3112
Percent failed: 17,5038
Distribution 1:
19.236178       18.223747       17.211317       14.680241       16.705102       17.211317       17.717532       19.067439
21.092300       14.342764       18.055009       16.198886       16.198886       17.717532       15.355194       18.561224
17.042578       20.248608       15.523933       18.561224       16.536363       16.367625       21.261038       18.898701
15.692671       18.392486       16.030148       17.886270       18.223747       16.873840
Distribution 2:
17.042578       15.861410       15.017718       20.417346       18.561224       20.586085       17.548794       17.042578
19.742393       16.536363       18.223747       20.248608       18.392486       17.211317       16.705102       16.367625
15.692671       16.873840       18.392486       14.174026       20.248608       16.536363       16.536363       15.523933
15.861410       17.211317       18.392486       18.561224       18.561224       17.042578
Continuous fail subsequences lengths:
1:      1769 (73.463455%)
2:      582 (24.169435%)
3:      50 (2.076412%)
4:      6 (0.249169%)
5:      1 (0.041528%)

(2)

All requests: 20056
Fails: 2629
Percent failed: 13,1083
Distribution 1:
11.368169       10.769844       14.509374       10.919426       15.107698       13.462306       14.509374       12.116075
14.210211       14.658955       14.210211       12.116075       13.312724       10.620263       13.163143       13.312724
13.611887       12.564819       13.013562       13.163143       13.013562       15.406861       13.163143       13.462306
12.714400       14.359793       12.863981       11.816913       12.564819       13.163143
Distribution 2:
11.966494       10.769844       13.462306       11.816913       12.863981       13.312724       11.368169       15.706023
11.667331       12.116075       14.958117       11.966494       11.816913       15.706023       12.714400       12.714400
14.658955       12.714400       13.611887       14.958117       13.911049       10.769844       15.556442       12.116075
13.462306       10.470682       13.611887       13.761468       14.060630       14.658955
Continuous fail subsequences lengths:
1:      2623 (99.885758%)
2:      3 (0.114242%)

(3)

All requests: 7349
Fails: 385
Percent failed: 5,23881
Distribution 1:
3.265750        6.123282        4.082188        3.265750        4.898626        7.347938        4.082188        5.306844
9.797251        3.673969        7.756157        12.246564       8.572595        6.939720        5.715063        5.306844
6.531501        3.673969        2.449313        4.490407        4.898626        6.123282        2.449313        2.857532
6.939720        4.898626        4.490407        4.898626        2.041094        2.041094
Distribution 2:
5.715063        5.306844        6.939720        3.673969        1.632875        7.756157        7.756157        4.898626
6.531501        8.164376        5.715063        4.082188        4.490407        4.082188        4.490407        4.490407
7.756157        2.857532        6.123282        4.898626        3.265750        4.490407        3.673969        6.123282
3.265750        4.082188        7.347938        6.123282        4.082188        7.347938
Continuous fail subsequences lengths:
1:      385 (100.000000%)

(By fail sequences I mean sequences of numbers of failed attempts in the list of
all attempts.)

Distribution 1-2 are two simple tests I've thought out myself since I don't
remember how to check sequences for being well-distributed. I think, they show
at least that these fail sequences are distributed. :)
In these tests every element of fail sequence falls into one of 30 clusters if
it meets some criterion (cnum = number of clusters, e = current element, anum =
overall number of attempts): anum * i/cnum <= e < anum * (i+1)/cnum for the i-th
cluster in the first test and e%cnum == i for the i-th cluster in the second
test. Then the percent of the paticular cluster size from the maximum size is
printed.
Maybe distribution tests don't have much sense in the test with unlimited number
of retries (2).
These tests show at least that the flow of fails is rather constant.

The other thing which is of particular interest (at least in the tests 2 and 3)
is the maximum length of continuous fail subsequences and the number of fail
subsequences of a certain length.

These tests show that the probability of fail is much less in the case of no
delay between attempts.

But this will increase the overall number of attempts during the same time
period if we have retries so it should be better to compare probabilities of
fail in a single get_data call for both cases with retries and with no retries.
The latter is 17,5038% - the same as in the first test.
The former is "Fails"/("All requests" - "Fails") in the test with unlimited
retries because if there are retries then the number calls will be the same as
the number of successful attempts. So the former = 15,0858%.

These values show that the possibility of a failed attempt per single get_data
call is _less_ if my solution is used. This is also a great addition to the
benefit of having much less useless records in logs!

Tests show that having 2-3 extra attempts would be quite enough. I'd like not to
have unlimited number of retries because I want to detect if UPS powers off.


What do you think about all this? Please, tell me if you want me to define
something more exactly.

I can say that with this two solutions applied the driver works pretty stable.

-- 
   Alexander