Need help diagnosing failure with Linux [CLOSED]

XanXan
edited December 2013 in BlinkyTape Troubleshooting
Hello,

I need some help diagnosing some issues that BlinkyTape has under Linux. NONE of these problems are replicated on Windows with the very same code.

I've been using the Python library provided, with a few modifications, but I consistently run into the following failure (with the original firmware): after a certain number of sent 255's BlinkyTape starts behaving erratically. Any of the following can happen:

1) The port will give abyssmal pauses, about a second, on writes.
2) BlinkyTape crashes and stops responding.
3) BlinkyTape crashes and reboots.
4) BlinkyTape starts erratically displaying data sent (wrong pixel, wrong color).

Interesting pointers:
* Effect (1) is prevalent, and the place at which it happens in the stream of commands is deterministic: it doesn't depend on host hardware (checked with another Linux machine), it doesn't depend on the command rate, and it doesn't depend on pixel data sent, only on the number of 255-codes sent.
* If you close and then reopen the serial port, the effect vanishes and you can usually run without problems; not always though, Blinky can still crash or the port may give an I/O error.
* If you just issue commands as fast as you can, without (at least 10ms) pauses between control characters, Blinky usually promptly crashes.

This might have something with how serial port / USB driver works on Linux, or may have something to do with PySerial library. Maybe it's just that the serial protocol differs from Windows.

I'm still working on isolating the root of the problem. However, I'm more of a QA guy than a developer. Therefore, I created test code for you to try and replicate the same problem:
https://github.com/kav2k/BlinkyTape_Python/tree/master/failure_test

* scanline.py is a script that features a 10ms delay between each pixel array sent and reopens the port every time. This mostly works stable.
* scanline_no_reopen.py is a script that features a 10ms delay between each pixel array sent, but keeps the port open. This on both machines I tested exhibited problem (1) after about 4000 animation frames sent. Debug mode, that is enabled, will count the frames for you.
* scanline_half_no_reopen.py is the same as above, but only uses 30 pixels. Data stream is therefore different, but the effect is the same as for no_reopen, down to the same number of frames.
* scanline_no_reopen_no_buffer.py additionally doesn't use my command buffering, making it closer to original library. It still doesn't work, it's just slower overall to send each command separately.
* scanline_no_reopen_no_wait.py is the last one; it issues commands as fast as you can pump them out and BlinkyTape promptly crashes.

Note, ALL OF IT works under Windows with no problems, even the no_wait one. I haven't tried OS X yet.

I need your help solving this enigma.
Questions are, assuming you can reproduce the problem:
* Where precisely the fault lies?
* Why is it happening?
* Is there an easy way to fix it, say, by setting some PySerial options?
* If not, who is to blame (i.e., who should be asked for a fix)?

Best regards,
Alexander.

Comments

  • XanXan
    edited December 2013
    P.S. Test machines ran Ubuntu 12.04 and 12.10
    I tested with provided versions of python, pyserial and arduino-core, as well as tried to upgrade the last two to the latest versions. No dice.
  • Thanks for the super detailed analysis. I've got a 12.04 machine here and will go through this procedure later.

    Serial support across platforms does seem to be wildly varying, and of course pySerial has a different backend for each platform.

    We've had good success with Linux and pySerial on previous projects (domeStar and Crystal Archway). There were some nonobvious tricks required, though, such as sending smaller blocks of data (10 bytes, 20 bytes, etc) at a time, in order to trick the usb serial driver into sending data at a rate that the Arduino firmware can handle without crashing. We ended up rolling a non-Arduino based firmware using LUFA, which might be an option for the BlinkyTape as well- it still uses the bootloader, so it doesn't prevent anyone from using Arduino sketches, and can be flashed back on transparently using PatternPaint. Another option that we've used has been to write a BlinkyTape host program in C++, and then connect to it in Python using a port, circumventing pySerial completely. That's a bit heavy handed though, and it would be better to find a suitable solution using pySerial for desktop (Ubuntu 12.04 on desktop and Raspberry Pi are the main targets).
  • Okay, I found the culprit (and can't believe I didn't before).

    While diagnosing the problem with my friend, we took another look at the firmware, and noticed that it writes back a diagnostic byte for every show command. Problem is, the Python library is not reading those from the queue.

    On Windows, this causes no problems whatsoever. But under Linux the port and/or BlinkyTape controller start to behave erratically when the buffer fills up.

    As soon as we remove that diagnostic write, the problems disappear. That correlates with the facts above (depends on the number of shows and not on the rest of the stream or the rate).

    I'm going to see how the library should be modified to work with the stock firmware and the firmware without diagnostics at the same time.
  • Aha, that makes sense. I think if you just read any data that came back from the tape after (or before) a write to it, that should flush it out. Thanks for catching it!
  • Indeed, now it is simply a matter of adding the following line to show() on the python side:

    self.serial.flushInput() # Clear responses from BlinkyTape, if any

    Consider this question closed; I will tidy up my library version and pull-request it in the main repo.
  • Cool, thanks Xan!
Sign In or Register to comment.