Improving En Pyssant

How to improve the performance of a Python program

I have been working on En Pyssant for a while, a Free Software chess engine written in Python. The engine part of the program is not (yet) complete, but the rules of chess are wholly implemented. My primary goal in writing this engine is to use everything and the kitchen sink to learn new things: Maybe the program does not need a full testing and benchmarking suite, but I have implemented it anyway.

Though there is a benchmarking suite, I have not really prioritised performance at all during development. But when it takes 10.4 seconds to parse the following game blurb, I figured maybe it would be a good time to have a look at performance:

[Event "F/S Return Match"]
[Site "Belgrade, Serbia JUG"]
[Date "1992.11.04"]
[Round "29"]
[White "Fischer, Robert J."]
[Black "Spassky, Boris V."]
[Result "1/2-1/2"]

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 4. Ba4 Nf6 5. O-O Be7 6. Re1 b5
7. Bb3 d6 8. c3 O-O 9. h3 Nb8 10. d4 Nbd7 11. c4 c6 12. cxb5 axb5
13. Nc3 Bb7 14. Bg5 b4 15.  Nb1 h6 16. Bh4 c5 17. dxe5 Nxe4
18. Bxe7 Qxe7 19. exd6 Qf6 20. Nbd2 Nxd6 21.  Nc4 Nxc4 22. Bxc4 Nb6
23. Ne5 Rae8 24. Bxf7+ Rxf7 25. Nxf7 Rxe1+ 26. Qxe1 Kxf7 27. Qe3
Qg5 28. Qxg5 hxg5 29. b3 Ke6 30. a3 Kd6 31. axb4 cxb4 32. Ra5 Nd5
33.  f3 Bc8 34. Kf2 Bf5 35. Ra7 g6 36. Ra6+ Kc5 37. Ke1 Nf4 38. g3
Nxh3 39. Kd2 Kb5 40. Rd6 Kc5 41. Ra6 Nf2 42. g4 Bd3 43. Re6 1/2-1/2

So let’s dive deeper. First, let’s see what hardware I’m running this on:

> inxi -c0 -v3 -z
System:    Host: carmen-desktop Kernel: 4.15.13-1-default x86_64
           bits: 64 gcc: 7.3.1
           Desktop: Gnome 3.28.0 (Gtk 3.22.29)
           Distro: openSUSE Tumbleweed 20180403
Machine:   Device: desktop Mobo: ASUSTeK model: M5A97 R2.0 v: Rev 1.xx serial: N/A
           UEFI: American Megatrends v: 0601 date: 07/17/2012
Battery    hidpp__0: charge: N/A condition: NA/NA Wh
           model: Logitech Performance MX status: Discharging
CPU:       Octa core AMD FX-8350 Eight-Core (-MCP-)
           arch: Bulldozer rev.0 cache: 16384 KB

           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm) bmips: 64217
           clock speeds: max: 4000 MHz 1: 1403 MHz 2: 1403 MHz 3: 1547 MHz
           4: 1493 MHz 5: 1396 MHz 6: 1402 MHz 7: 1477 MHz 8: 1531 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] Tonga PRO [Radeon R9 285/380]
           bus-ID: 01:00.0
           Display Server: x11 (X.Org 1.19.6 ) driver: amdgpu
           Resolution: 1920x1080@59.93hz
           OpenGL: renderer: AMD Radeon R9 380 Series (TONGA / DRM 3.23.0 / 4.15.13-1-default, LLVM 6.0.0)
           version: 4.5 Mesa 18.0.0 Direct Render: Yes
Network:   Card: Realtek RTL8111/8168/8411 PCIE Gigabit Ethernet Controller
           driver: r8169 v: 2.3LK-NAPI port: d000 bus-ID: 02:00.0
           IF: enp2s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:    HDD Total Size: 1750.4GB (13.0% used)
           ID-1: model: Crucial_CT750MX3
           ID-2: model: ST1000DX001
Info:      Processes: 352 Uptime: 1 day 15:24 Memory: 9945.2/15968.2MB
           Init: systemd runlevel: 5 Gcc sys: 7.3.1
           Client: Shell (bash 4.4.191) inxi: 2.3.40

An AMD FX-8350, basically. I did my development on a Core 2 Duo, though, which is quite a bit slower.

Then, let’s take a look at the profiler graph generated by running cProfile on Game.from_pgn(...) on the above game blurb, using e0630c98 as starting point. See here for the code that generates the graph. I will be looking at the BitBoard implementation exclusively for this article.

Profiler graph before applying anyoptimalisations

How to read the graph: Every square has a function name, a cumulative percentage, a total percentage, and the amount of times it has been called. The arrows show cumulatively how much time one function spends calling another, and how often one function calls another. The colours are there to help you read the graph. Red means “spending a lot of (cumulative) time in this function”, and blue means “barely spending any time in this function”.

As a short addendum: “Cumulative time” means how much time a function takes altogether. “Total time” (deceptively) means how much time goes by while inside the function, excluding time made in calls to sub-functions. Thus logically, the top function has a cumulative time of 100%, but very little total time, because it doesn’t do a whole lot of computation on its own.

The graph seems more or less normal. One thing that sticks out is that it spends a lot of time in Enum.__hash__. That’s a bit of a pain, because a lot of my core data types (Type, Side, Direction, MoveFlag) are Enums. And I’ve grown quite attached to the Enums. They are very pretty, you can do some nifty things with them, and removing them would mean breaking backward compatibility. But let’s see what happens if I replace the Enums with regular classes, and fix all the other code to do away with Enum-specific magic.

In 45aa9ed4, I have reduced the time it takes to run the program to 6.7 seconds. Great success! It only cost us our precious Enums. But we can do better than that, probably.

If we look at the graph generated from running the code without enums, we notice that over 40% of our time is spent in Board.__getitem__, which wastes 10% just calling Board.get (because function calls are expensive in Python). Why not just call Board.get directly instead? So I go through my codebase, and I replace (almost) all instances of board[square] with board.get(square) (5279dd45). As a result, the program now takes 6.0 seconds. This is predictable, because we have shaved off the 10% overhead. But we can do better again.

If we look at the graph again, we notice that Move.from_san calls Move.san a lot, and Move.san performs the lion’s share of all expensive calls. Maybe we can improve this. What if we didn’t depend on Move.san?

To understand what’s going on, it’s probably a good idea to illustrate what these functions do first. Move.san turns a Move object (an object containing an origin, a destination, and some other information about a given move) into Standard Algebraic Notation (SAN). For instance, ‘Na3’ is SAN for moving a kNight from ‘b1’ to ‘a3'. Where does ‘b1’ come from? Well, that’s implied, because the only knight that could possibly move to ‘a3’ is on ‘b1’. If there were another knight that could move to ‘a3’, you would have to disambiguate the SAN by adding more coordinates (for instance, ‘N1a3’ if there is another knight on ‘c4’. Where did the ‘c’ go? Don’t ask).

As such, generating a SAN string requires you to generate all moves that can land on a certain destination square and perform a bit of voodoo magic. Generating all legal moves is expensive, but not outrageous. What is outrageous is that Move.san is called over a thousand times, even though there aren’t that many notations in the game blurb that is being parsed. What gives?

Laziness gives. Having performed the SAN voodoo once, I did not want to do it in reverse for Move.from_san: generating a Move object from a SAN string. So I created a lazy implementation. Given san and position, just create all legal moves for position, and generate the SAN of each legal move. Then compare each generated SAN string against san. If it’s a match, then obviously we have found the correct move, and we can return that. This looks a little like this (57c6f1a7):

def from_san(cls, position, ruleset, san):
    for move in ruleset.moves(position):
        move_san = move.san(position, ruleset)
        if move_san == san:
            return move

    raise ValueError('{!r} is not a valid move'.format(san))

We can do better than that… And it turns out, I did do better than that. I changed that function some time after to be more relaxed about its input. ‘Nb1a3’ and ‘Na3’ are the same move, only one of them is needlessly disambiguated. I wanted a way to also accept the verbose SAN strings, so I got to work and ended up with this:

SAN_END_PATTERN = re.compile(
SAN_PATTERN = re.compile(
    r'([PNBRQK])?([a-h])?([1-8])?x?-?([a-h][1-8])([QRBN])?', re.IGNORECASE)

def stripped_san(san: str) -> str:
    """Remove superfluous symbols from *san*."""
    return re.sub(SAN_END_PATTERN, '', san)


def from_san(
        san: str,
        strict: bool = False) -> 'Move':
    if not strict:
        san = stripped_san(san)
    match = SAN_PATTERN.match(san)
    castling = 'O-O' in san
    if not match:
        if not castling:
            raise ValueError('{!r} is not a valid notation'.format(san))
        piece, file, rank, destination, promotion = match.groups()

    for move in ruleset.moves(position):
        move_san = move.san(position, ruleset)
        if not strict:
            move_san = stripped_san(move_san)
        if move_san == san:
            return move
        # pylint: disable=too-many-boolean-expressions
        if (
                or castling
                or move.destination != destination
                or (file and move.origin.file != file)
                or (rank and str(move.origin.rank) != rank)
                or (piece and move.piece.type != piece.lower())
                or (promotion
                    and move.promotion != promotion.lower())
                or (not piece and move.piece.type != Type.PAWN)):
        if not promotion and move.promotion:
        return move

    raise ValueError('{!r} is not a valid move'.format(san))

Regexes! This solution is almost perfect. The only mistake I’ve made here is that I keep generating SANs anyway, even though I have all the scaffolding in place to return a move without generating the SAN. So after a bit of fiddling (e262997c), Move.san is no longer called nearly as often. It still uses Move.san when resolving castling moves or when being strict about what notations are allowed, but sorting that out is left as an exercise for the reader.

Having changed the innards of Move.from_san, the benchmark really flies. It now only takes 0.46 seconds to run the benchmark. At this point, I am basically satisfied. Well, almost satisfied. You see, I had to sacrifice my Enums for this performance gain, and I really want my Enums back. So that is basically what I did. I merged the improvements sans removing the Enums, which resulted in 0.65 seconds. Not quite as good, but workable. It still irks me that the Enums are slower, though, so I take another look at the profiler graph:

Profiler graph after changing all Board.__getitem__ calls to Board.get andfixing Move.from_san, but while keeping theEnums

What is evident from this graph, is that attacked and is_check are responsible for most calls to Enum.__hash__. So it seems obvious to me to check there first. What am I looking for? Anything that looks like dict[Enum] or Enum in set. And sure enough, I find a line that looks like this:

to_check = {  # type: Dict[Type, Set[Type]]
    _ROOK: {_ROOK, _QUEEN, _KING},
for type_, targets in to_check.items():

Ah-ha! A dictionary of sets, sure to be the culprit. You need a hash to put a key in a dictionary, and a hash to put a value in a set. So I do the sensible thing, turn the dictionary into a global constant (66aa89c9), and run the benchmark again. 0.62 seconds… Not quite the improvement I had hoped for, perhaps even negligible.

I do find some other variables that can be turned into global constants, but they don’t yield much, either. Not wanting to give up hope just yet, I turn to pdb. I call'list(Game().moves())' in the interactive interpreter, set a breakpoint at Enum.__hash__, and I spam ‘c’ and ‘w’ to get the traceback for each call to Enum.__hash__.

It turns out: Piece(type, side) shows up reasonably often. Why does it show up? This is not a hash operation. Except … it is. To improve performance and memory usage, I memoised Piece.__new__ with lru_cache, which is basically a fancy dictionary that remembers parameters as keys and return values as values. Removing the cache would send performance down the drain, but perhaps we can go hunting for places where pieces are instantiated.

Sure enough, there is one in attacked, and one in is_check. They both look a bit like this:

for square, piece in position.board.all_pieces():
    if piece == Piece(_KING, side):

This is easily fixed. Just instantiate the piece outside of the loop (d66497be), and run the benchmark again. Sure enough, the timing is now improved to 0.51 seconds. Almost as fast as the benchmark without Enums! As a matter of fact, what would happen if we merged these improvements with the Enumless branch (136245f4)? We get 0.35 seconds! That’s all well and good, but I don’t want to give up my Enums just yet.

So we look at the graph again, and we notice that attacked still performs a lot of calls to Enum.__hash__, and so does Board.get.

In attacked, the offending line is Square.in_bounds(path). This function is also memoised with lru_cache and would be extremely slow without the cache. But perhaps it might be possible to turn path (which is a sequence of Direction Enums) into something that is more easily hashed. And sure enough, upon turning path into a sequence of (int, int) tuples (cfe128dc), the performance is increased to 0.46 seconds, and the amount of calls from attacked to Enum.__hash__ is greatly reduced.

The last function that incurs a heavy penalty to Enum.__hash__, then, is Board.get. Or more specifically, BitBoard.get. BitBoard has a couple of integers as internal representation rather than storing the pieces. Thus when BitBoard.get is called, it creates a Piece(type, side) on-the-fly, which is memoised and therefore requires hashing, causing a slowdown. The solution, therefore, is to use a board that internally stores the pieces, avoiding unnecessary Piece.__new__ calls. ListBoard is that board.

Using ListBoard, our performance increases to 0.21 seconds, and the calls to Enum.__hash__ are just short of 7% of cumulative time:

Final profiler graph of ListBoard, having applied all optimalisations exceptremoving Enums

Meanwhile on the Enumless branch, ListBoard’s performance is 0.18 seconds. This is extremely close—close enough for me keep my precious Enums.

edit 1 (2018-04-08): Adding a simple tweak to the way castling moves are parsed (f07478f7) revs up the performance to 0.11 seconds. That’s 100x faster than what we started with!

edit 2 (2018-04-08): Changing the interpreter to PyPy 3 instead of CPython 3.6 and allowing the interpreter to warm up its JIT results in a performance of 0.05 seconds. That is 200x faster than what we started with.