APM 3.3-rc8 beta crash log and forensics. Help appreciated

Hey

Yesterday we had a flight crash with the Navio version of AC3.3-rc8. We had some software panics with the compass and the ArduCopter-octa-quad not starting on bootup. It was spurious and sometimes the APM started.

We did a first flight where everything went well. On the second flight, the one where we had the crash, we had a first sign that things were not good, when the aircraft suddenly changed attitude about 13 seconds before the event that resulted in the crash.

The pilot tried to slowly bring it to the ground so as to land it safely, but the aircraft again changed attitude fast and started moving horizontally in a slightly oblique trajectory. The logs are not perfectly formated because the battery was ejected upon touch down, immediately turning off the Navio.

Even so we detected some very odd things that we find worthy of mentioning and are noted in the graphics:

  1. In both of the uncontrollable moments there is a visible slope change in the PM message TIMEUS and the RCIN/RCOUT TIMEUS values.
  2. For some unexpected reason some time between the first glitch and the crash the PM message stops being emitted.
  3. In the first glitch the desired pitch and RCIN are maintained but the PITCH has a step input out of nowhere.
  4. In the crash glitch(2nd), as can be seen in the video and logs the pilot was slowly reducing altitude when suddenly the aircraft started
  5. The NLon time starts increasing since the beginning of the flight and the last values are 125, which for a NLoop of 4000 amounts to 3.125%
  6. Most importantly the MaxT is completely out of space in the entire flight. This is the most likely reason and the significant difference between the previous normal flight, where the MaxT never goes indecently high. Comparing to the 23.BIN, a flight 10 minutes before with no changes in hardware at all, we find absolutely normal MaxT values for the whole flight as can be seen in the attachment. The MaxT values we used to guide ourselves are stated in Randy Mackay post in Drones Discuss

One important detail is that a USB dongle attached to the Navio. We really really hope it was not the USB but we will make further tests to confirm this. We had previously flied successfully with USBs attached in the previous firmware version. We will later check the logs to try to spot something abnormal.

The first flight log is 23.BIN
The crash flight log is 24.BIN
I also keep a file which is a mavlink dump of the secondary telemetry, don’t know if i really helps. ard.log

We also recorded a bit of video. In the video you can see the slight attitude change around here:

And the moments before the crash the attitude change is:


@pneves , Sorry about the crash! What were the specs of the octa-quad? You said you have a USB dongle, what was the dongle for? 4g, wifi, etc.

In the “attitude change” you point out in your video, does the pilot re-correct from the horizontal movement, or was the entire movement done by the autopilot?

I too was getting intermittent compass panic messages w/ 3.3-rc8 when I initialized arducopter.

My logs were not formatted correctly either, I will compare your logs to mine and see if the formatting is similar.

Keep in mind my flights are on a 550 hexa. I have usb in use on both firmware’s without problem.

@aquila We had a wifi dongle connected. The same setup flew previously without any problems also…Can you share your logs with the dongles on?

In the first attitude change it was so fast the pilot, did not even attempt to correct it.
On the second, due to its violent nature the pilot tried to correct but there was no visible response. As you can see in the last part of the video it just banked violently and started moving horizontally with no changes until it crashed against a wall. It didn’t loose much altitude.

Regarding the logs not being correctly formatted, it seems a bug then. Strangely some really old logs seem to parse correctly but more recent ones not. Even logs produced with the stable version of Ardupilot for Navio. Can you confirm this?

Regarding the compass it seems a general problem then…maybe somebody from the Navio team can weigh in.

Thanks for the feedback @aquila

I got the LogAnalyzer found in Ardupilot repository to work by modifying it to ignore the lack of the ThrCr parameter in mode change. It is really not a solution but after the change it ran with no problems on the Navio’s BIN files. I edited the first post to include the graphs.

The result of the LogAnalizer only shows problems with the compass but i think the offset ones are normal. May be somebody from Navio can tell me what are the Navio normal parameters so I can fork the file and make the LogAnalizer for Navio typical values.

Test Results:

            Empty:  GOOD                                                              
               VCC:  FAIL       VCC below minimum of 4.6v (0.0v)                           [GRAPH]
          Brownout:  GOOD                                                              
        Parameters:  GOOD                                                              
    Event/Failsafe:  GOOD                                                              
           Compass:  FAIL       FAIL: Large compass offset params (X:-229.00, Y:-716.00, Z:-302.00)    [GRAPH]
                                FAIL: Large compass offset in MAG data (X:-229.00, Y:-716.00, Z:-302.00)
                                Large change in mag_field (113.33%)
                                Max mag field length (816.13) > recommended (550.00)
     Dupe Log Data:  GOOD                                                              
               GPS:  GOOD

Hello Paulo,

Sorry about the crash and thank you for a detailed report.

Firstly, regarding the compass offsets - they are not too large, they are just different units. We are introducing changes to APM code, that will allow all compasses report values in SI units.

We will look into your logs first thing tomorrow morning. Please provide as much information as you can. Have you used a fresh image install, or was something added\ reconfigured?

Hi, I was quiet, but same happen to me with this revision and crash too, but i already rebuild it and i am afraid of flying it again so I’m doing a lot of test before i fly it (hexa 680), I didn’t comment because i been analyzing my logs, and trying to understand the problem, i don’t use the USB dongle, so i will say that is not the problem i think. Understand your problem will be a big help for me too, i will take a look at your log and comment later. Your TimeUS step has nothing compere with mine, mine looks like and abyss.

Regards
Mojica

@pneves

Looking at the logs, timing is very bad on both flights (23.bin, 24.bin), so there is definitely something wrong with the set up. I think that if you turn it on and record logs on the bench, you will see bad timings again. Also, try to turn your copter on again and post a screenshot of top command with everything running.

Our initial guess is that you have two instances of APM running at the same time for some reason. They both try to access same buses, sensors etc and this results in bad timing. It is a miracle that it even works for some time this way.

Please post your start up scripts and any modifications you have made to the system, because otherwise it is really hard to tell what had happened.

It is definitely a good call that we should add protection from running second instance of APM.

@porelmundo

We will gladly have a look at your logs as well.

Maybe one could implement a pid-file support and check in the vehicle code and avoid double executions already at this point.

This is my file, I already correct some problems, but is good that you take a look, when start to do erratic movements i get nervous and start change modes, also my transmitter is doing spikes on other modes when i flip the switch, im working on that. Thank you for your support

Regards.

https://mega.nz/#!0QYTADba!MYpJrAmZcmJ4RngLCnzXuWTN3HpOOqIK6a8ocz1ICfs

Very good idea, dgrat.

Igor Please,

How can I check if timing is good or not? Only to double check my envirolment.

And how can I check if two instances of APM running at the same time? Because when I start the code is showing the start message twice.

Try to use:

ps -A or # top

that will show you what is running or use # pstree

@ivereninov Thanks for your feedback. Unfortunately I am almost sure that 2 executions are not happening at the first time.

First because I have tried it before and it doesn’t work. The second instance just aborts because it can’t get a lock on something…(pigpio?) I can’t tell you exactly because I am away from the office. This behavior was consistent in the stable version.

The second reason it is not running is because I have my /etc/rc.local simply with:
sudo ArduCopter-octa-quad -C /dev/ttyAMA0.
(I know the sudo redundant in rc.local which is ran as root, but it surely does not run twice, see 3rd)

The 3rd and to answer @robertb I have killed the Navio many times by PID when needed running:

 ps aux |grep Ardu*

And there are never more than 1 processes. For the record this will also return your own command as a result because the regex will match itself.

There were no further modifications to the system and we ran the original SD card image with the new binaries downloaded from diydrones as stated in the announcement post. I replaced the previous binary directly in /usr/bin/ and added the needed executable file.

i really hope that you are right, and it was my mistake. Unfortunately I cant verify the status of the /etc/rc.local because the card got corrupted later. Even so I will use ext4undelete to recover the specific file when i have the card with me again.

Also, and because it is good to cross check behavior, I would like to know if the second instance also aborts if one is already running. Maybe the 3.3-rc8 doesn’t abort, and the stable did. Hope it helps and thanks for the feedback.

Thanks for the “commands”… I know Linux a little…
But how can I check what’s the process “are running” related with APM, and what’s a good threshold for timing?
Other point, when I started de 3.3 rc-8 I receiving the following message twice , is it normal ?

Raspberry PI 2 with CM2709!
Raspberry PI 2 with CM2709!

Really sorry but I have more than 1.000 dollars flying over my head….

@robertb the process name is the ArduCopter-quad for example if you are running a quad version. Like i said if you put a * you will match all possible Arducopter versions running.

Regarding the repeated lines…indeed that also happens in mine and I noticed it but didnt remember until now. Good catch!

###Running two instances at once
We have conducted tests and it is impossible to run two rc8 instances at the same time, but you actually can run one from the .deb and one rc8 at the same time! It results in really bad timings, similar to what we see in your logs.

Why are we so sure that it is not a random bug? We can never be sure, but it looks like the issue persisted continuously for two flights, which means that it should be directly reproducible. If it was a bug, we would see a short lag and then everything going back to normal. It is really hard to stress the system to a point where performance of APM is so bad, It can only happen if you start using the buses that it tries to access. I see two ways of achieving that - either running second instance of APM or launching some of examples that would try to write to I2C\SPI as well.

###Repeated BCM2709 lines
This is actually normal, it happens due to RPi version detection code working in two separate locations within APM. We will look into optimizing that in the future.

###Checking your timing
Look for PM->maxT in logs which shows maximum main loop execution time. Ideally it should be 2.5ms (2500) for copter with reasonable threshold, if it occasionally spikes to 3-4ms there is nothing to worry about. PM->NLon will show you how many loops actually took more than expected to execute.

In @pneves logs, maxT reaches 130ms! It is a miracle that the thing was even flying. Something was really wrong and we need to find out what exactly.

On this graph, made from @porelmundo logs, you can see perfect timing in flight with Nlon = 0. This is what you would expect when running on defaults.

###Another possible scenario
If you have stopped the stable version, which relies on pigpio by PID, pigpio daemon was not stopped. New rc-8 does not rely on pigpio, but interferes with it. As a result you had rc8 and pigpio running at the same time, which could have caused these issues as well.

###Let’s exclude hardware
To make sure that it is not caused by some weird hardware behaviour please flash a new SD card, run APM and upload the logs. We will be looking at the maxT numbers. If they are reasonable, we will then proceed to reproduce your issue and eliminate the possibility to repeat it for other users.

2 Likes

I just wrote a patch. This is not a real problem. Just some redundant code.

@ivereninov Well you seem convincing and the explanation is thorough but then I am out of words as to how 2 different versions were running at the same time if the /usr/bin ArduCopter binaries that the deb installed, were overwritten by the new versions when I installed 3.3, and this I am very sure of.

The alternative scenario is not on the table because I only kill manually when I log through ssh and not when I launch directly from /etc/rc.local, which was the case for both flight logs.

@ivereninov Thank you for the assistance and we will try a fresh image, repeat my previous installation procedure and check back with you.

@dgrat Where I can i find the patch you wrote?

@dgrat We really appreciate the work that you are doing on APM, but please could we discuss that in a separate topic? It is the second time when other users get confused that it will solve issues that they are experiencing.

Thank you and keep up the good work!

@ivereninov
Yes, I think this was a mistake…