One more milestone!

After combining this USB-to-Serial converter, this USB 3.0 breakout board, and this mini-USB 3.0 cable into one thing I got the following:

/galleries/dropbox/acer_a1-serial-debug.thumbnail.jpg

Which in turn granted me with the following on the serial console when I powered up the device (pastebin link):

set AMSS mode:0 (Normal)
[AMSS_LOG] ..\..\acer\l1\smem\acer_l1_smem_cmd.c - 18 : Acer smem proc comm cmd 0.
Device will now reset to get into OS!
[AMSS_LOG] ..\..\drivers\hw\src\hw.c - 681 : Acer main reset.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 113 : HW version is DVT5.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 132 : RF Band is EU.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 138 : Acer AMSS Boot Check.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 139 : magic number:A4BC385F, reason:00000000.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 140 : AMSS SW version = A1-05.01.06, Date = Nov 24 2010, Time = 17:34:05
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 143 : Boot Check magic num 680601 87654321 6
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 144 : DDR REG CHECK 40 60 1E0 1E0
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 148 : SA Fuse value = A077
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 744 : Boot power on reason : 00000051.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 745 : Boot reset status : 6.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 772 : Button check reason : 00000000.
[AMSS_LOG] ..\..\..\..\..\..\..\acer\l1\boot\acer_l1_boot.c - 232 : Boot in normal mode.
[AMSS_LOG] ..\..\..\..\..\..\..\core\boot\secboot2\osbl\shared\src\osbl_aarm_boot.c - 363 : Boot aARM
on_msm_proc_comm
@@ i2c_init: HW mode
[AMSS_LOG] ..\..\services\task\mobile.c - 525 : Qualcomm SW info = 8650B-SDCAOLYM-4075
[AMSS_LOG] ..\..\services\task\mobile.c - 526 : AMSS SW version = A1-05.01.06, Date = Nov 24 2010, Time = 16:32:58
[AMSS_LOG] ..\..\services\task\mobile.c - 528 : Acer AMSS main code start -----.
[AMSS_LOG] ..\..\drivers\chg\charger.c - 1298 : Charging state machine change from 33 to 0
[AMSS_LOG] ..\..\drivers\chg\charger.c - 1298 : Charging state machine change from 0 to 32
[AMSS_LOG] ..\..\acer\l1\smem\acer_l1_smem_cmd.c - 18 : Acer smem proc comm cmd 2.
ptn 0 name='misc' start=451 len=5
ptn 1 name='recovery' start=456 len=40
ptn 2 name='boot' start=496 len=40
ptn 3 name='userdata' start=536 len=1600
ptn 4 name='system' start=2136 len=1600
ptn 5 name='cache' start=3736 len=335
ptn 6 name='kpanic' start=4071 len=16
ptn 7 name='dinfo' start=4087 len=2
ptn 8 name='setupdata' start=4089 len=3
ptn 9 name='splash1' start=4092 len=4
ptn 10 name='modem' start=24 len=156
nandid: 5500bcec maker ec device bc
nandcfg: aad40000 0004745e (initial)
nandcfg: e85408c0 0004745e (used)
load logo...
load checksum size:780
flash_read_image: block:4089~4091 success (0 errors)
flash_read_image: block:4092~4095 success (0 errors)
mddi_init(): AUO, build date Feb 18 2011-15:00:12
AVR reset#1
LCDC: HW_VER > 0.2, Power-On Seq v2!
[AMSS_LOG] ..\..\drivers\chg\charger.c - 1298 : Charging state machine change from 32 to 1
AVR reset#2
panel is 480 x 800
fail:msm_i2c_poll_notbusy,status:00002100
I2C_INTERFACE_SELECT:00000300
[AMSS_LOG] ..\..\acer\l1\smem\acer_l1_smem_cmd.c - 18 : Acer smem proc comm cmd 1.
[AMSS_LOG] ..\..\acer\l1\smem\acer_l1_smem_cmd.c - 18 : Acer smem proc comm cmd 2.
fail:I2C:Error waiting for notbusy
USB FastBoot:  V1.10.00
Machine ID:    1009000 v0
Build Date:    Feb 18 2011, 14:59:50
HW version:    V1.0
AMSS version:  A1-05.01.06
usb_init_basic
usb_chg_detect_type: AC Charger Detected!
get AMSS mode:0 (Normal)
boot_reason=00000000
@@BUILDID=Acer_LiquidE_4.008.08_EMEA-VFIT
on_msm_proc_comm
Check 512MB: true
sn[0]:00000020
sn[1]:00000045
sn[2]:00000095
sn[3]:00000018
sn[4]:00000019
sn[5]:00000005
sn[6]:00000000
sn[7]:00000000
sn[8]:00000000
sn[9]:00000000
sn[10]:00000000
sn[11]:00000000
serialno:025459819150
@@ i2c_init: GPIO mode
@@ i2c_init: HW mode
[AMSS_LOG] ..\..\acer\l1\smem\acer_l1_smem_cmd.c - 18 : Acer smem proc comm cmd 2.
boot_reason=00000000
SMEM: disable Modem UART messages(2)
check bat_cap
read battery:  6342,6819=92%%
** BOOTING LINUX FROM FLASH **
Check device is 256M or 512M
Check 512MB: true
Pass!
flash_read_image: block:496~535 success (0 errors)
flash_read_image: block:496~535 success (0 errors)
flash_read_image: block:496~535 success (0 errors)

kernel  @ 20008000 (2609080 bytes)
ramdisk @ 24000000 (177420 bytes)


LCM ID:00000013 000000a1 00000000
Check 512MB: true
cmdline = 'console=null hw_ver=6 amss_ver=A1-05.01.06 androidboot.serialno=025459819150 g_android.serial_number=025459819150 lcm_id=2 androidboot'

Booting Linux
Check 512MB: true

The RX pin should be connected to the GND_D, TX should go to SSTX-, GND should go to GND on the board and VBUS should be powered by 5V from the serial-to-usb converter if you are using mini USB3.0 hack. The device has serial root console immediately available if you set androidboot.console=ttyMSM2 console=ttyMSM2 (no init changes needed). Serial port settings: 115200 8N1, no hw flow control, no sw flow control.

Even if this does not yield anything useful, it is a good thing to know I can solder things and they continue working after that.

I've seen all these strings in the string dump of appsboot.mbn, amss.mbn and it is such a relief to see these are actually being used. Also it is interesting to see that the device is using a L1 bootloader. Acer L1 is actually a beTouch 200 phone running Microsoft Windows Mobile 6.5 Professional. Sweet.

One drawback for my current setup is that I can't have both serial debug and regular USB connection (though charging works), so it's time for some more soldering!

Building Gingerbread: 1

Roman Yepishev

2013-06-27 18:44

W/dalvikvm( 1729): Exception Ljava/util/regex/PatternSyntaxException; thrown while initializing Landroid/graphics/Rect;
I/dalvikvm( 1729): Rejecting re-init on previously-failed class Landroid/graphics/Rect; v=0x0
E/dalvikvm( 1729): Class lookup Ljava/lang/NoClassDefFoundError; attempted while exception Ljava/lang/ExceptionInInitializerError; pending

This exception is misleading and frustrating. It means that the ICU library failed to load its data from /system/usr/lib/icu/icudt44l.dat. Put the proper file there to fix this.

froyo_almond on Acer Liquid E

Roman Yepishev

2013-05-29 13:41

Great success!

No, really, that was (and still is) an interesting experience — bringing a newer kernel to a device abandoned by the vendor. The result is here acer_a1_froyo_almond_kernel and it requires the updated liblights for the LCD backlight and keypad LEDs to work.

So here are the interesting things:

RAM console

Before you start debugging the device, enable RAM console. This is a region in the device memory where the console output goes. You may not have UART console but RAM console is something that you will definitely want to have. Upon kernel panic or a spontaneous reboot the /proc/last_kmsg will have the contents of the last session. Since the memory of the device is not reinitialized on startup, you will be able to find the memory the way it was immediately before the reboot. Of course, if the RAM is powered down then there is some corruption, but during my tests that was not the case. Eventually I disabled the kpanic support because a) Acer changes could not be easily updated to match newer kernel and b) ram console provided everything I needed.

This functionality is now present in the mainline kernel as pstore driver but it appears that android starts using it only around 3.8 kernel.

Kernel patches

It looked like an easy task, one needs to find the revision the kernel was branched from, apply Acer changes and continue merging Codeaurora changes.

Not so fast.

Not only it was nearly impossible to figure at what time the branch was made, the cherry-picks of 2.6.32 features brought to 2.6.29 were not making it easier.

Since that was not crazy enough, Acer decided not to guard the changes they have introduced with the proper #ifdef's, so hunting for the useful changes became even harder. Eventually I managed to get Codeaurora eclair_chocolate kernel to boot with Acer patches, refactor some of them and bring them over to froyo_almond, but that was not a pleasant experience.

unlock_john

Qualcomm MSM kernel allows one to talk to the AMSS via diag interface, however it is by default guarded by a strong password. Before you enter that secret password, the diag interface is locked.

The password for unlocking is "acer.llxdiafkZidf#$i1234(@01xdiP" and the password to lock the device is "acer.dfzse,eizdfXD3#($%)@dxiexAA". These can be discovered by reading the diagfwd.c module in the released kernel, the firmware update application is using them before communicating with AMSS.

These passwords are long, so Acer developers found a better way to unlock the device, If you are running Acer kernel, echo unlock_john > /proc/debug_smem will unlock the diag interface for you. Reading that file would tell you the boot mode from the shared memory allocated by Acer in AMSS space.

The diag interface is fun if you have the information about what methods you can call, but all this info is proprietary.

Who is that John?

Userspace woes

Updating kernel is half the problem, userspace libraries may be even a bigger one.

In order to release Android 2.2 on Liquid E, Acer did not make anything better than bring ONLY the changes that are critical for Android 2.2 and apply these changes onto 2.1 libraries and 2.1 kernel (2.6.29). So when the kernel gets updated, the userspace breaks:

kgsl

KGSL is Kernel Graphics Support Layer and it provides the access to the hardware through the proprietary drivers. At some point (commit ebb6af2d125dca230c6ca539f58a63d0c077dc25) the 3D driver was split into 2 devices. Since Acer used old libraries, they are unaware of these changes. As these libraries are proprietary, it is not possible to bring in the newer userspace, so all kgsl commits had to be reverted. Without hardware acceleration support the screen went completely crazy, flickering and alternating between different framebuffer contents.

qaudio6

Update: Added a workaround to the kernel so notifications are played properly both from the headset and the phone speaker. It looks like the libaudio library is neither old nor sufficiently new. In any case a replacement is ready to be tested with bluetooth devices too (wired/speaker is ok).

Framebuffer format

If your screenshots appear red like the one pictured here, you should check that you have the proper framebuffer depth That's how a RGB565 framebuffer is read as RGBA8888:

/galleries/dropbox/device-2013-05-29-115731-180x300.png

Some other non-Acer devices had the issue too originally. As of 2013 it looks like no vendor releases a device with broken framebuffer setting.

FM2018

The legends say that original Acer A1 had 2 microphones so that it would be possible to get rid of background noise to achieve crystal-clear voice quality. Nexus One definitely had 2 microphones, but Acer decided to remove the hardware at some point.

Nowadays the kernel module is present, it is even being initialized but quits early after finding out that the hardware revision it is running at is newer than expected. I am fairly sure that there is no customer version of Acer Liquid E with dual microphones so that fm2018 module is completely useless.

Update: I found the pictures of the second microphone in early hw version in the service manual:

/galleries/dropbox/noise-reduction-mic.jpg

Wifi

Update: Most things are fixed. The key was to prevent wifi_carddetect to be disabled when the driver was switching the power on and off.

Summary

I am now using 2.6.32 as my main kernel, and it is actually working. It's not yet 2.6.35 and Android 2.3 is still far away with all the userspace issues but it is worth it. I learned how to use Git the hard way. I had over 500 different kernel builds and I will never buy a device made by Acer again.

Fixing USB disconnect on tapping in Lenovo E420

Roman Yepishev

2013-05-19 18:59

Update: It looks like this is a manufacturing fault since the issue re-appeared after a month.

USB device connected to the right (yellow) USB port got disconnected whenever I typed something. Upon closer examination I found that a fairly light tap on the bottom right corner of the laptop would immediately disconnect and reconnect the USB device.

I decided to check what's going on there and took apart the laptop. It turns out that the Express card adapter and the USB port are on the daughterboard which is connected to the motherboard and that it is possible for it to have a less than an ideal connection. I cleaned the contacts, reassembled the laptop and it does not disconnect the USB device any more.

You have to be extremely careful when disassembling the device, I broke all three tabs that held the top cover on the right (then superglued them back).

You will want to detach the right and bottom side first, then gently disconnect the left one. The tabs on the left side hold the bottom assembly from the inside and they can be easily broken.

My Development Board

Roman Yepishev

2013-05-15 11:33

CPU	S1 Scorpion ARMv7
Chipset	Qualcomm 8250 SoC
LCD Module & Touchscreen	AUO H353VL10
Accelerometer	Bosch SMB380
Ambient Light and Proximity Sensor	Intersil ISL29018
Compass	Yamaha MS-3C
Amplifier	Texas Instruments TPA2018
Power Management IC	Texas Instruments TPS65023
WiFi/Bluetooth/FM Radio module	Broadcom 4325
Camera	Aptina MT9P012
LED driver	Texas Instruments TCA6507
Input	Atmel AVR-driven 4 keys capacitive keypad over I2C, GPIO keyboard

That's my Acer Liquid E phone which I am trying to use in more ways than it was originally expected. Since I have never done anything major in the kernel aside a couple of simple patches for spca and the alps touchpad module I decided to try getting familiar with the way things are done in the kernel by updating Acer Liquid E kernel from 2.6.29+acer patches+cherrypicks from codeaurora to 2.6.35+codeaurora patches.

Before that I have decompiled Acer's EUU with Red Gate Reflector and wrote a .bin file extractor (no longer used on newer devices) and a simple test application to talk to the device via the DIAG interface that's present on all of Qualcomm MSM devices but it may be disabled by the OEM. It does not look like these two have helped me to get closer to the goal but it was a nice re-introduction to C programming which I haven't worked with for quite a while.

I am still working on cleaning up the 2.6.29 kernel and the major changes already implemented are:

AVR MFD Device

AVR is a programmable uC by Atmel, the exact version is not known (and I was not able to obtain high-resolution photos of Acer Liquid (A1) board. The original Acer driver had everything in one place and was driven by ioctl(). It took me a long time to realize how things are working with MFD devices and the result is in avr_mfd branch. There's also a repository for hardware/msm7k which implements the required userspace changes which is compatible with Acer android framework.

LED Driver, TI TCA6507

This seemed to be easy - a couple of cherry-picks for LED class and I am done. Well, the kernel and liblights are complete, but now the vold changes are required that would fix the permissions - see my previous post for the quick info on that.

The repository holds some other changes too, I am still learning and Linux kernel looks really beautiful on the inside.

Android timer-triggered LED

Roman Yepishev

2013-05-15 00:29

Enabling LED timer trigger support in Android requires:

LED class driver implementing the blinking.
Patch the kernel to send uevents on sysfs changes:
Vold patch for changing the ownership of the delay_on/delay_off files upon receiving that uevent.
liblights setting up the “timer” trigger every time blinking is required (setting the brightness to 0 would disable the trigger).

This was discussed here – http://comments.gmane.org/gmane.linux.kernel/1331117, but the patches have not entered the mainline kernel.

It turns that’s not as straightforward as I thought.

How I Failed a Google SRE Interview

Roman Yepishev

2013-04-23 14:25

Yesterday I got the final answer from the hiring committee, I am not going to be a SRE (Site Reliability Engineer) at Google.

It all started 3 months ago when I got the first screening interview. I answered some basic questions (50% answers were correct, though) and I was notified that I will get the second phone interview with a real SRE.

During this interview I enjoyed the troubleshooting part the most. It was basically a role-playing game when you ask the interviewer the questions about the system or the commands you would run on the machine and the interviewer tells you what you see on the screen. It took me mere seconds to figure out the failing machine but then I started going back and forth around the different components. In the end I did manage to find the actual failure. The coding exercise was fun too. I learned that I forgot everything about hashes, dict.items()? dict.entries()? (actually it is dict.items()) but managed to write a properly working code from the first try in the shared Google document. There were also some basic UNIX questions, like how you can delete a file starting with a dash as well as basic networking - how does a switch work.

After a week or so I was notified that they would like me to come to Zurich, Switzerland for an on-site interview. I had 2 weeks to read the recommended books, play with the various systems and network configuration, get the Visa and come to Switzerland (the flights and the hotel was paid for by Google). I had to leave home quite early since we just had a huge snow storm, so I was quite sleepy and did not walk around much.

The Engimatt Hotel was within the walking distance from the Google Zurich office which looked quite... let's say "cost-effective".

/galleries/dropbox/GIMG_20130325_1534143.thumbnail.jpg

I walked around the location to get familiar with the surroundings, found the city to be quite expensive, returned back to the hotel, and shut down.

On the 26th of March I woke up, went to the hotel restaurant for a breakfast being the only one who did not wear the suit (the recruiters explicitly told me to stay away from the suit and tie), and went to the Google office.

The inside of the office looked extremely different, it was joyful, interesting and well thought. I passed through the cafeteria, received my bottle of Google Water and was led to the conference room. Each interview lasts for 45 minutes and there are 5 interviews and a lunch break.

The interviews focused on the following:

Linux internals.
System administration.
Python coding skills.
Troubleshooting.
Large scale systems design.

Linux internals

You will need to know how the malloc() works and how the memory allocator is implemented in, say, glibc, how the processes are started and pass the data between themselves on the low level. I believe I got something around 3/5 there, because I never jumped that far into the kernel for the memory allocation and I never had to find the reason why strace output is full of sbrk() and mmap() calls.

System administration.

Having just finished building my little system for the kernel bisection with PXE I had the answer even before the interviewer finished the question and the overall experience was fun. I played with puppet and rsync in the past, so the questions about synchronizing machine data and configuration were simple. After all the prepared questions were answered we had a chat about ZFS and how awesome it is. I believe this is the only interview where I scored 5/5.

Python coding skills.

I started pretty well with explaining what's wrong with def foo(data=[]): ... but I failed miserably at implementing a simple calculator performing only additions and multiplications. For some reason I forgot all the python I knew ("elseif?", definitely not "else if?"... it is "elif", yes) and the rest 30 minutes I struggled to come up with something I believed to be working. After spending 3 years primarily on the little one-time scripts and support issues I got really rusty and uncertain in programming something w/o running the intermediate solution to see whether it is working or not. I feel that I got 2/5 there and when I returned back home I had a beautiful solution written in 5 minutes. The critical point is that you need to find out how you would do this task, don't try to think how the computer should be doing it immediately.

Troubleshooting

Oh this was fun. I managed to get through 2 tasks (I don't know how much more the interviewer has prepared), but they were great. I never dealt with so many failures as what I got at the interview while trying to do a simple "umount /usr", however I've spent more than I should on finding the reason of hanging terminal during the SSH connection. Locales, environment, libraries, networking MTU... everything should have been considered. I think I got 4/5 there but the score might have been lower.

Large scale systems design

I've never designed anything like this before and while I read a lot about this, I haven't had the hands-on experience. The task was humble, design the logging system for the whole Google. I had difficult time coming up with a design, used technologies and it was a completely failed interview. 1/5 or even 0/5.

Afterwards

I was escorted out of the building, went to the hotel, had dinner there, prepared for the upcoming flight and shut down. It was a pretty tiring experience but it was well worth it.

Usually it takes one or two weeks for the hiring committee to come up with a decision, however this year the H1B US visa cap was reached in mere 5 days, and while the original US hiring committee decision was negative, the recruiters decided to forward the interview results to the Europe-based committee too. The overall review process took almost 4 weeks and yesterday I finally received an answer, "No". The committee has noted that I have a potential though.

I feared I would face the people who would be all "I work for Google, you are nobody". Instead, everybody remained positive even when I was obviously failing. The office internals are interesting. You do want to spend the day inside. I sometimes visit my wife at Global Logic Kiev and now I know where they tried to copy the environment from. One of the interviewers brought the Chromebook Pixel to take the notes (and guess what, I did not ask to look at the screen, double fail!), Lenovo Thinkpads and Apple MacBooks were spotted as well. The food is free for all the employees and there are 4 or 5 restaurants which prevent the over-crowding during the lunch hours, there is a gym and a workshop where you can build something (as in, the real, physical workshop with the power tools etc.), there is a room full of trees and plants and you definitely feel that there is more oxygen there. Every room has a natural light source, i.e. a window while all the stairs are located in the center of the building. The office felt really bright and light.

I saw a Hipster Dog image pinned to the corkboard.

/galleries/dropbox/i-was-on-google-plus-before-hipster-dog-meme.thumbnail.jpg

It's definitely a place one would like working at.

Recommended books

You would want to read UNIX and Linux System Administration Handbook for system administration. It took me a week to get through all the linux things and it structured my administration knowledge. It is fun and easy to read.

As per recommended reading by Google itself:

TCP Illustrated; Volume 3 by W. Richard Stevens, but you don’t really need to dive into the details of TCP for Transactions, which turned out to have security flaws. Understanding of IP, TCP/IP and UDP/IP is a must.
The Design and Implementation of the 4.4BSD Operating System by Marshall Kirk McKusick
The UNIX Programming Environment by Brian Kernighan and Rob Pike
The Practice of System and Network Administration by Tom Limoncelli, Christina Hogan and Strata R Chalup

Bisecting Kernel Using PXE Boot

Roman Yepishev

2013-03-19 15:01

Recently I found that an Atheros AR5BXB64 wireless card in my Acer Aspire One AOA150 netbook stopped working properly in 13.04 with 3.8 kernel. The problem was that any file download over the wireless interface had a high chance of being corrupted. I was not able to download updates - every time something big (like a kernel upgrade) got downloaded, I got the file, but the checksum did not match. The card was able to inject copies of previous buffers into the stream in a way that still passed IP and TCP checksum verification.

This was exactly the same symptom we had when Active State Power Management (ASPM) L0s was enabled for that PCI-E device, when the hardware itself was not actually able to handle this state properly. It was fixed in 2009 but suddenly got broken in 3.8 again.

I filed a bug (LP:1137513) and, since I had a workaround, I just continued using the netbook with a setpci call in rc.local until I was asked whether any prior Raring kernels exhibited this behavior.

I booted 12.10 from a Flash drive and found that the card was working properly there. Then I checked 3.7 kernels and found that the issue did not exist there and appeared only in 3.8.0-rc1, which suggested there was indeed a regression in the kernel. And the difference was that 3.7 and earlier kernels had LinkCtl: ASPM Disabled while 3.8+ had LinkCtl: ASPM L0s L1 Enabled in the lspci -vv output.

Since I was not able to find any relevant commits that looked like they were deliberately doing something to ASPM handling, the only way to proceed was to run a binary search. Since it was possible to check whether the bug was there or not just by checking lspci, it could and should have been automated.

Bisect works by splitting the commits between known good and known bad revision in half with every iteration and testing the commit in between. Rinse and repeat until you hit a commit that made the switch from good to bad and that's where the things broke.

I had to test 5840 commits, with roughly 13 steps (2¹³=8192, 2¹²=4096). 13 manual reboots are 12 reboots too many so I went on to set up an environment that would do everything for me. Additionally I did not want to build the kernels on that netbook itself as it is pretty slow. The build host and the netbook should have been separated if I wanted to finish the task in a reasonable amount of time.

So, here's the plan:

Set up a PXE environment.
Find the minimal kernel config that exhibits the issue.
Create a custom initrd containing all the tools needed to detect whether an issue is there.
Create a script that can be reliably used by git bisect to make it all run automatically.

Poor man's Linux bridge port security

Roman Yepishev

2013-03-14 17:17

Bridge is a L2 device that brings two Ethernet segments together. A segment is a collision domain. Since we all use switches now, collision domains are restricted to single ports, so a segment in our modern world is just a path between your computer and the switch. In case there are more than 2 segments, such bridge is called a “switch”.

In 7-layer OSI model L2 is a Data Link layer and it sits between Physical layer (L1) which actually carries the bits back and forth and the Network layer (L3), which is what routers are busy working with. Bridge is what you usually have virtual machines connected to and this is what allows your wireless-enabled router to connect the wireless and wired interface together and have a single network configuration.

If you need to restrict access to a certain port on a standard Linux bridge (not an OpenvSwitch one), then you will have to add the rules to PREROUTING chain on nat table in ebtables.

You may want to group rules by port and not restrict the access to the bridge from all other ports the way libvirt is doing that:

ebtables -t nat -N port-eth0-1
ebtables -t nat -A port-eth0-1 -s  52:54:00:3f:e0:8c -j RETURN
ebtables -t nat -A port-eth0-1 -j DROP
ebtables -t nat -A PREROUTING -i eth0.1 -j port-eth0-1

Why is that even needed?

I have a br-lan bridge that has eth0.1 (LAN) and wlan0 (Wi-Fi) interfaces connected and I found that I can easily reboot my TP-Link 1043ND OpenWRT-based router with macof tool from dsniff package because the Linux bridge module does not limit the amount of MAC addresses it learns. It took me quite a while to find why this happens until I figured out that OpenWRT's busybox shell brctl applet lacks a vital command, brctl showmacs $bridge-interface. And when I installed a proper brctl from the repository and ran brctl showmacs br-lan after a second of macof, I was impressed.

root@gw:~# brctl showmacs br-lan | wc -l
25821

By default the MAC addresses are set to expire after 5 minutes (300 seconds). After 5 minutes any hosts that haven't contacted the bridge during this time will be purged from the MAC address cache. macof can generate tens of thousands of Ethernet frames per second consuming the router's RAM even before the aging time lets any entries expire.

You can not set the number of MAC addresses that are allowed on the port as in IOS switchport port-security maximum value though, so it is not as versatile as dedicated managed switches can be, however it is enough to make sure that an untrusted machine connected to a linux-based router won't be able to crash it just by sending a ton of spoofed Ethernet frames.

Contrary to the documentation, brctl setageing $bridge_interface 0 does not make the existing entries permanent. It drops all the learned non-local addresses and stops learning the new ones effectively transforming the switch into a hub that floods all the interfaces when it receives a frame on one of its ports.

In case of libvirt machines, this can be accomplished by adding a no-mac-spoofing filter to the guest domain definition under network interface:

<interface type='network'>
    <mac address='52:54:00:82:e4:6c'/>
    <source network='vm100'/>
    <model type='virtio'/>
    <filterref filter='no-mac-spoofing'/>
    <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>

I found that ebtables was removed from OpenWRT at some point due to performance issues. My recent measurements, however, displayed almost no difference at all – I was getting 570Mb/s with and without the module when connected via the wire. While 1043ND is a gigabit router, it looks like 1 gigabit is shared among all the ports, thus each one got roughly half of what the backplane is capable of during my test.

It does not look like there is a built-in support for manipulating ebtables using uci, so for now I have a simple script that sets up the restrictions:

set -x

ebtables -t nat -P PREROUTING ACCEPT

ebtables -t nat -F port-eth0-1
ebtables -t nat -F PREROUTING

ebtables -t nat -X port-eth0-1

ebtables -t nat -N port-eth0-1

# My entries in /etc/ethers have the following format:
# de:ad:be:ef:00:12  eth.servername
# 01:02:03:04:05:06  wlan0.laptopname

for X in $(awk '/eth[0-9]?\./ { print $1 }' /etc/ethers); do
    ebtables -t nat -A port-eth0-1 -s $X -j RETURN
done

ebtables -t nat -A PREROUTING -i eth0.1 -j port-eth0-1

ebtables -t nat -P port-eth0-1 DROP

MySQL SSL Checklist

Roman Yepishev

2013-02-27 10:43

Verify that the modulus of the key is the same as the certificate one.

These two hashes should be the same. In case they are not, then the certificate does not match the key. MySQL server will have have_ssl set to DISABLED in this case.

$ openssl x509 -noout -modulus -in server.crt | md5sum
$ openssl rsa -noout -modulus -in server.key | md5sum

Verify that the CA certificate is correct and that the certificate chain is in proper order.

For example, StartSSL Class 1 certificates should have sub.class1.server.ca.pem first and then the ca.pem root CA certificate itself. However, I found out that specifying the Class 1 intermediate certificate is enough. Failure to set the CA certificates right would lead to the following message on the client:

ERROR 2026 (HY000): SSL connection error: ASN: bad other signature confirmation

After connection, check that it is actually done over SSL:

mysql> status
...
SSL:            Cipher in use is DHE-RSA-AES256-SHA
...

Use REQUIRE keyword to specify the SSL connection checks.

SSL simply requires that a connection was using SSL. X509 would require the client certificate to be verifiable against server CA certificate. Additionally you can use ISSUER, SUBJECT and CIPHER. However, due to bug in MySQL that requires specific order of RDN and Ubuntu MySQL upstart configuration bug that breaks logging to syslog, that may take a while to figure out, see the bug report for a workaround. The following works with my StartSSL certificate:

GRANT ALL on *.* TO 'user'@'my-host' REQUIRE
    SUBJECT '/C=UA/CN=my-clienthost.lappyfamily.net/emailAddress=my-email@example.net'
    AND ISSUER '/C=IL/O=StartCom Ltd./OU=Secure Digital Certificate Signing/CN=StartCom Class 1 Primary Intermediate Server CA';

Configure firewall to limit access to MySQL server to a set of known addresses.

Remember that mysql clients don't verify the server certificate host name. This is also one of the cases when you may want to create your own CA to prevent other clients of the same CA get through the certificate check phase just because they happen to be issued by the same authority in case you are using REQUIRE X509 only.

while not keypressed

Serial Debug on Acer A1

Building Gingerbread: 1

froyo_almond on Acer Liquid E

RAM console

Kernel patches

unlock_john

Userspace woes

kgsl

qaudio6

Framebuffer format

FM2018

Wifi

Summary

Fixing USB disconnect on tapping in Lenovo E420

My Development Board

AVR MFD Device

LED Driver, TI TCA6507

Android timer-triggered LED

How I Failed a Google SRE Interview

Linux internals

System administration.

Python coding skills.

Troubleshooting

Large scale systems design

Afterwards

Recommended books

Bisecting Kernel Using PXE Boot

Poor man's Linux bridge port security

MySQL SSL Checklist