Ok, I’ve got some updates on the issue I’m having in particular. I’ve done loads of reboots and have gathered some more info.
A general point. There appear to be four different outcomes of any given boot. Note that the estimates % of occurrence really is a rough estimate:
A (roughly 50% of boots). Everything works as intended. Example dmesg:
$ sudo dmesg | grep -i cros_ec
[ 2.292908] cros_ec_lpcs GOOG0004:00: Chrome EC device registered
[ 2.297332] input: cros_ec_buttons as /devices/pci0000:00/0000:00:1f.0/PNP0C09:00/GOOG0004:00/GOOG0007:00/input/input6
B (roughly 30% of boots). There are errors in dmesg but suspend works correctly:
$ sudo dmesg | grep -i cros_ec
[ 2.272811] cros_ec_lpcs GOOG0004:00: packet too long (4 bytes, expected 0)
[ 2.285466] input: cros_ec_buttons as /devices/pci0000:00/0000:00:1f.0/PNP0C09:00/GOOG0004:00/GOOG0007:00/input/input6
[ 2.384045] cros_ec_lpcs GOOG0004:00: Chrome EC device registered
C (roughly 10% of boots). There are errors in dmesg, though the Chrome EC device gets registered. Sending suspend to the EC does not work:
$ journalctl --boot=-30 -k | grep cros_ec
Nov 26 02:42:46 archlinux kernel: input: cros_ec_buttons as /devices/pci0000:00/0000:00:1f.0/PNP0C09:00/GOOG0004:00/GOOG0007:00/input/input6
Nov 26 02:42:46 archlinux kernel: cros_ec_lpcs GOOG0004:00: failed to retrieve wake mask: -22
Nov 26 02:42:46 archlinux kernel: cros_ec_lpcs GOOG0004:00: Chrome EC device registered
In that case here’s some dmesg output from a suspend that fails to communicate with the EC:
Nov 26 02:50:47 archlinux kernel: PM: suspend entry (s2idle)
Nov 26 02:50:47 archlinux kernel: Filesystems sync: 0.027 seconds
Nov 26 02:51:13 archlinux kernel: Freezing user space processes
Nov 26 02:51:13 archlinux kernel: Freezing user space processes completed (elapsed 0.001 seconds)
Nov 26 02:51:13 archlinux kernel: OOM killer disabled.
Nov 26 02:51:13 archlinux kernel: Freezing remaining freezable tasks
Nov 26 02:51:13 archlinux kernel: Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
Nov 26 02:51:13 archlinux kernel: printk: Suspending console(s) (use no_console_suspend to debug)
Nov 26 02:51:13 archlinux kernel: wlan0: deauthenticating from 94:2a:6f:b6:d7:9f by local choice (Reason: 3=DEAUTH_LEAVING)
Nov 26 02:51:13 archlinux kernel: ACPI: EC: interrupt blocked
Nov 26 02:51:13 archlinux kernel: ACPI: EC: interrupt unblocked
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: GuC firmware i915/tgl_guc_70.bin version 70.49.4
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: HuC firmware i915/tgl_huc.bin version 7.9.3
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: HuC: authenticated for all workloads
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: GUC: submission enabled
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: GUC: SLPC enabled
Nov 26 02:51:13 archlinux kernel: i915 0000:00:02.0: [drm] GT0: GUC: RC enabled
Nov 26 02:51:13 archlinux kernel: cros_ec_lpcs GOOG0004:00: Transfer error 2/4: -22
Nov 26 02:51:13 archlinux kernel: cros-ec-keyb GOOG0007:00: PM: dpm_run_callback(): acpi_subsys_resume returns -22
Nov 26 02:51:13 archlinux kernel: cros-ec-keyb GOOG0007:00: PM: failed to resume: error -22
Nov 26 02:51:13 archlinux kernel: mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915])
Nov 26 02:51:13 archlinux kernel: OOM killer enabled.
Nov 26 02:51:13 archlinux kernel: Restarting tasks ...
Nov 26 02:51:13 archlinux kernel: mei_pxp 0000:00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:00:02.0 (ops i915_pxp_tee_component_ops [i915])
Nov 26 02:51:13 archlinux kernel: done.
Nov 26 02:51:13 archlinux kernel: random: crng reseeded on system resumption
Nov 26 02:51:13 archlinux kernel: PM: suspend exit
D(roughly 10% of boots). The Chrome EC device fails to register completely:
$ sudo dmesg | grep -i cros_ec
[ 4.358557] cros_ec_lpcs GOOG0004:00: bad packet checksum 95
[ 5.359213] cros_ec_lpcs GOOG0004:00: EC response timed out
[ 5.359220] cros_ec_lpcs GOOG0004:00: Transfer error 2/4: -5
[ 5.359344] cros_ec_lpcs GOOG0004:00: EC response timed out
[ 5.360806] cros_ec_lpcs GOOG0004:00: Cannot identify the EC: error -5
[ 5.362360] cros_ec_lpcs GOOG0004:00: couldn't register ec_dev (-5)
[ 5.363817] cros_ec_lpcs GOOG0004:00: probe with driver cros_ec_lpcs failed with error -5
Here’s some observations:
- In all failure cases (C and D) reloading the kernel module fixes the issue
- In both failure cases (C and D) manually sending
HOST_SLEEP_EVENT_S0IX_SUSPEND via ectool --interface=lpc hostsleepstate freeze makes the suspend work correctly
- This issue was observed on kernel 6.14 and 6.17 (up to 6.17.10)
- blocking the
cros_ec_lpcs module from loading automatically and then loading it later fixes the issue
- waiting in the bootloader (
systemd-boot) for ~30 seconds before starting the kernel does not fix the issue
- adding a 10 second sleep into
cros_ec_lpc_probe fixes the issue
- kernel 6.18 (released this Sunday) fixes the issue (~30 boots that resulted in case A)
git cherry-picking all 8 commits to drivers/platform/chrome from 6.17 to 6.18 onto 6.17.10 does not fix the issue on 6.17
To me this definitely looks like some timing / race condition somewhere. It’s unclear that 6.18 does actually fix it. It might just get lucky, especially considering that adding all changes to drivers/platform/chrome onto 6.17.10 does not fix the issue.
@MrChromebox have you seen any of these issues (B, C, and D) on any of your devices? I’m seeing this on both roric and craasneto. If you haven’t could you try looking at the dmesg from ~10 or so boots?