Python fundamentals

How it’s made: Python

Author

Karsten Naert

Published

February 9, 2026

Introduction

Python is a programming language specification. The most common implementation is CPython, which comes with the familiar python.exe executable on Windows.

When you run python my_program.py, CPython processes your code through four distinct stages:

Tokenization - Breaking source code into meaningful chunks
AST (Abstract Syntax Tree) - Organizing tokens into a logical structure
Compilation - Converting the AST into bytecode
Execution - Running the bytecode

What makes Python interesting is that all of these stages are accessible to you through built-in modules. We’ll explore each stage hands-on.

Version matters

The internals we discuss are specific to CPython 3.13/3.14. Other Python implementations (PyPy, GraalPy) work differently under the hood. Even between CPython versions, bytecode and internal representations can change significantly.

Tokenization

The first step is breaking your source code into tokens - sequences of characters that have meaning in Python. Think of it like breaking a sentence into words, except for code.

import tokenize
from io import StringIO

text = "print('Hello World')"
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(tok.string)

print
(
'Hello World'
)

Our simple print statement consists of 7 tokens. Let’s make this clearer:

text = "print('Hello World')"
s = StringIO(text)
'|'.join(tok.string for tok in tokenize.generate_tokens(s.readline))

"print|(|'Hello World'|)||"

Notice that 'Hello World' (including quotes) is treated as a single token. Each token also has a type:

text = "abc  +  xyz"
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(f"{tok.type:2d} {tokenize.tok_name[tok.type]:10s} {tok.string!r}")

 1 NAME       'abc'
55 OP         '+'
 1 NAME       'xyz'
 4 NEWLINE    ''
 0 ENDMARKER  ''

Here’s a more complex example showing how Python handles function definitions:

text = """
def f(x):
    return 2 * x
"""
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(f"{tokenize.tok_name[tok.type]:10s} {tok.string!r}")

NL         '\n'
NAME       'def'
NAME       'f'
OP         '('
NAME       'x'
OP         ')'
OP         ':'
NEWLINE    '\n'
INDENT     '    '
NAME       'return'
NUMBER     '2'
OP         '*'
NAME       'x'
NEWLINE    '\n'
DEDENT     ''
ENDMARKER  ''

Notice the INDENT and DEDENT tokens - Python’s whitespace sensitivity is baked in at the tokenization level.

Exercise

Try tokenizing code with syntax errors:

text = "print('unclosed string"
s = StringIO(text)
list(tokenize.generate_tokens(s.readline))

What happens? The tokenizer catches some errors but not all - it doesn’t understand code semantics yet.

Practical uses

Tokenization is useful beyond Python’s internal workings:

Code analysis: Find all variable names, detect naming conventions
Syntax highlighting: Colorize code in editors
Code formatting tools: Tools like Black use tokenization to understand code structure

Quick tool: Variable name finder

def find_names(code):
    """Find all variable names in Python code"""
    s = StringIO(code)
    names = {tok.string for tok in tokenize.generate_tokens(s.readline)
             if tok.type == tokenize.NAME and not tok.string in ['def', 'return']}
    return names

code = """
def calculate(x, y):
    result = x + y
    return result
"""

print(find_names(code))

{'y', 'result', 'x', 'calculate'}

Aside: Tokenization in LLMs

Large Language Models also use “tokens”, but they’re different. Python tokenization splits code based on syntax rules. LLM tokenization (like GPT’s) splits text into subword units based on frequency. Similar name, completely different purpose.

The Abstract Syntax Tree (AST)

Tokens tell us what pieces we have. The AST tells us how they fit together.

import ast

code = "3 + 5"
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))

Module(
  body=[
    Expr(
      value=BinOp(
        left=Constant(value=3),
        op=Add(),
        right=Constant(value=5)))])

That’s a lot of structure for 3 + 5! Let’s break it down:

The root is a Module (every Python file is a module)
Inside is an Expr (expression statement)
The expression is a BinOp (binary operation)
It has a left operand (3), an operator (+), and a right operand (5)

Here’s a more interesting example:

code = "x = 5 + 6  # test"
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))

Module(
  body=[
    Assign(
      targets=[
        Name(id='x', ctx=Store())],
      value=BinOp(
        left=Constant(value=5),
        op=Add(),
        right=Constant(value=6)))])

The comment disappeared - the AST only captures code structure, not formatting or comments.

You can go from AST back to source code (since Python 3.9):

print(ast.unparse(tree))

x = 5 + 6

But you won’t get your original code back - just equivalent code. The comment is gone, and whitespace may differ.

Unpacking assignments

Even simple-looking code can have complex ASTs:

code = 'x, *stuff, y = L'
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))

Module(
  body=[
    Assign(
      targets=[
        Tuple(
          elts=[
            Name(id='x', ctx=Store()),
            Starred(
              value=Name(id='stuff', ctx=Store()),
              ctx=Store()),
            Name(id='y', ctx=Store())],
          ctx=Store())],
      value=Name(id='L', ctx=Load()))])

The AST reveals how Python interprets the unpacking: x and y are regular targets, stuff is a Starred target that collects the rest.

Exercise

Parse these statements and explore their AST structure:

statements = [
    "x += 1",
    "[i for i in range(10)]",
    "lambda x: x + 1",
    "def f(a, *args, **kwargs): pass"
]

for stmt in statements:
    tree = ast.parse(stmt)
    print(f"\n{stmt}")
    print(ast.dump(tree, indent='  '))

Can you identify the key node types?

Transforming code

The AST can be modified programmatically. Python provides ast.NodeTransformer for this:

class AddToMul(ast.NodeTransformer):
    """Convert all additions to multiplications"""
    def visit_BinOp(self, node):
        if isinstance(node.op, ast.Add):
            node.op = ast.Mult()
        return node

code = "x = 3 + 4 + 5"
tree = ast.parse(code)
transformed = AddToMul().visit(tree)

print("Original:", ast.unparse(tree))
print("Modified:", ast.unparse(transformed))

Original: x = (3 + 4) * 5
Modified: x = (3 + 4) * 5

This is how tools like code formatters and refactoring tools work - they parse, transform, and regenerate code.

Design Patterns: The Visitor Pattern

What we’ve just seen is an example of the Visitor pattern - a design pattern where different classes work together through a carefully orchestrated structure. The NodeTransformer is the visitor that “visits” each node in the AST tree, and the visit_BinOp method defines what happens when we encounter a binary operation node.

This pattern allows us to add new operations on the AST without modifying the AST node classes themselves. We’ll explore design patterns in much greater detail later in the course, but it’s important to recognize them when we encounter them in real-world code.

CST: When formatting matters

The AST is “abstract” - it discards formatting details like comments, whitespace, and exact syntax choices. For tools that need to preserve these (like code formatters), there’s an alternative: the Concrete Syntax Tree.

The external library libCST provides this. With libCST, you can parse a file and write it back byte-for-byte identical, then make targeted changes while preserving formatting.

Project idea: Consider building a code modernization tool that automatically updates deprecated API calls across a large codebase. For example, you might want to:

Replace all os.path.join() calls with pathlib.Path operations
Update old-style string formatting (%s) to f-strings
Migrate from unittest assertions to pytest style

Using CST would preserve all comments, docstrings, and code style choices while making these targeted transformations. This is particularly valuable when working on legacy codebases where maintaining existing formatting and documentation is crucial. The tool could scan a project, identify patterns to modernize, and apply transformations while keeping the code’s original structure and style intact - something that would be impossible with AST alone since it discards all formatting information.

Bytecode: What Python actually runs

The AST still isn’t executable. Python compiles it into bytecode - a low-level instruction set for the Python virtual machine.

code = "print(3 + 4)"
code_object = compile(code, '<example>', 'eval')

The compile() function takes three arguments:

The source code (or AST)
A filename (or <example> for inline code)
A mode: 'eval' for expressions, 'exec' for statements

The result is a code object containing bytecode:

print(code_object.co_code)

b'\x95\x00\\\x00"\x00S\x005\x01\x00\x00\x00\x00\x00\x00$\x00'

These bytes are what Python executes. Let’s make them readable with the dis module:

import dis

dis.dis(code_object)

  0           RESUME                   0

  1           LOAD_NAME                0 (print)
              PUSH_NULL
              LOAD_CONST               0 (7)
              CALL                     1
              RETURN_VALUE

Each line is an instruction:

RESUME - Checkpoints for debugging/tracing
LOAD_NAME - Load a variable (here, print)
LOAD_CONST - Load a constant (here, 7, the result of 3 + 4)
CALL - Call a function
RETURN_VALUE - Return the result

Notice that 3 + 4 was pre-computed! Python’s compiler can optimize constant expressions.

Constants and names

Code objects store constants and names separately:

code1 = compile('a = 10; print(3 + a)', '<ex1>', 'exec')
code2 = compile('a = 11; print(4 + a)', '<ex2>', 'exec')

print("Same bytecode?", code1.co_code == code2.co_code)
print()
print("Code 1 constants:", code1.co_consts)
print("Code 2 constants:", code2.co_consts)
print()
print("Code 1 names:", code1.co_names)
print("Code 2 names:", code2.co_names)

Same bytecode? True

Code 1 constants: (10, 3, None)
Code 2 constants: (11, 4, None)

Code 1 names: ('a', 'print')
Code 2 names: ('a', 'print')

The bytecode is identical! Only the constants differ. This separation makes the bytecode more compact and flexible.

The .pyc files

When you import a module, Python saves the compiled bytecode to a .pyc file in the __pycache__ directory. This speeds up subsequent imports - Python can skip tokenization, parsing, and compilation.

Performance implications

Creating .pyc files can take time for large codebases. The first import is slow, but subsequent imports are much faster. This is why:

Your app might start slowly the first time after code changes
.pyc files should generally be in .gitignore (they’re machine-generated)
Deployment systems sometimes pre-compile to speed up cold starts

You can control .pyc generation with command-line flags:

python -B my_script.py

The -B flag prevents Python from writing .pyc files. Useful when you don’t have write permissions or want to avoid clutter during development.

Or use environment variables:

set PYTHONDONTWRITEBYTECODE=1
python my_script.py

To customize where .pyc files go:

set PYTHONPYCACHEPREFIX=C:\temp\pycache
python my_script.py

Bytecode versioning

Bytecode changes between Python minor versions (3.13 vs 3.14), but not between patch versions (3.13.1 vs 3.13.2). That’s why .pyc files include the Python version in their name:

__pycache__/mymodule.cpython-313.pyc

If you run the same code with Python 3.14, you’ll get a new file:

__pycache__/mymodule.cpython-314.pyc

Exploring bytecode

You can examine individual instructions:

bytecode = dis.Bytecode('a = 11; print(4 + a)')

for instr in bytecode:
    print(f"{instr.opname:20s} {instr.argval}")

RESUME               0
LOAD_CONST           11
STORE_NAME           a
LOAD_NAME            print
PUSH_NULL            None
LOAD_CONST           4
LOAD_NAME            a
BINARY_OP            0
CALL                 1
POP_TOP              None
RETURN_CONST         None

Or compare different Python constructs:

dis.dis("x = 5")
print()
dis.dis("x += 5")

  0           RESUME                   0

  1           LOAD_CONST               0 (5)
              STORE_NAME               0 (x)
              RETURN_CONST             1 (None)

  0           RESUME                   0

  1           LOAD_NAME                0 (x)
              LOAD_CONST               0 (5)
              BINARY_OP               13 (+=)
              STORE_NAME               0 (x)
              RETURN_CONST             1 (None)

+= is not syntactic sugar for = + - it uses different bytecode (BINARY_OP with augmented assignment).

Exercise

Compare the bytecode of these equivalent operations:

# List comprehension
dis.dis("[x*2 for x in range(10)]")

# Generator expression  
dis.dis("(x*2 for x in range(10))")

# Map function
dis.dis("list(map(lambda x: x*2, range(10)))")

Which is most complex? Can you guess which is fastest?

Visualizing with Godbolt

Godbolt Compiler Explorer lets you see bytecode interactively. Select “Python” as the language, write code on the left, and see the disassembly on the right.

You can:

Compare different Python versions side by side
Hover over code to highlight corresponding bytecode
See how optimizations change bytecode

Try it with this example:

def add(a, b):
    return a + b

def add_constant(x):
    return x + 42

You’ll see that add_constant pre-computes less than you might expect - the addition still happens at runtime.

Using Godbolt effectively

Start simple - complex code generates lots of bytecode
Compare Python versions to see optimizations
Look for patterns in how Python handles common constructs
Remember: fewer instructions ≠ faster code (but it’s often correlated)

Performance and the future

Understanding Python’s compilation pipeline helps explain performance characteristics:

Startup time: Includes tokenization, parsing, and compilation
Import time: Saved by .pyc files on subsequent runs
Runtime: Dominated by bytecode execution

The JIT revolution

Python 3.13 introduced experimental JIT (Just-In-Time) compilation support. See PEP 744 and the official Python 3.13 documentation for details.

Traditional Python:

Source → Tokens → AST → Bytecode → Interpreter

With JIT:

Source → Tokens → AST → Bytecode → [JIT Compiler] → Machine Code

The JIT compiler can:

Detect hot code paths (frequently executed code)
Compile bytecode to native machine code
Optimize based on runtime behavior

This is still experimental in Python 3.13/3.14, but represents a major shift in how Python executes code. Future versions may enable JIT by default, dramatically improving performance for CPU-bound code, but as of Python 3.14, the performance gains are very small, at least according to this blogpost.

JIT Availability

The JIT compiler is only available when Python is built with the --enable-experimental-jit configuration option. To use the JIT, you’ll need to build Python from source with this flag enabled, or use a distribution that includes JIT support. In python 3.14 there a submodule sys._jit was added.

Enabling JIT (Python 3.13+)

If you have a JIT-enabled Python build:

set PYTHON_JIT=1
python my_script.py

Practical implications

Why should you care about Python’s internals?

Debugging: Error messages reference these stages

  File "<example>", line 1
    print(3 + 4
          ^
SyntaxError: '(' was never closed

The tokenizer caught this before we even got to the AST.

Performance: Understanding bytecode helps optimize - List comprehensions generate cleaner bytecode than loops - Local variables are faster than globals (different bytecode instructions)

Tooling: Modern Python tools work at these levels

Black (formatter): Works with tokens and AST
MyPy (type checker): Analyzes AST
Coverage.py: Tracks bytecode execution

Code generation: You can write code that writes code

Generate optimized functions at runtime
Create DSLs (Domain-Specific Languages)
Build advanced decorators and metaprogramming tools

Summary

Python’s journey from source to execution:

Tokenization: Source code → Tokens
- Breaks code into meaningful pieces
- Catches basic syntax errors
AST: Tokens → Tree structure
- Represents code logic
- Enables code analysis and transformation
Compilation: AST → Bytecode
- Generates platform-independent instructions
- Cached in .pyc files
Execution: Bytecode → Results
- Interpreted by Python VM
- (Future: JIT compilation to machine code)

Each stage is accessible through Python’s standard library. Experiment, explore, and demystify what happens when you hit “run”.

Final exercise: Full pipeline

Write a small Python script, then trace it through all stages:

code = """
def greet(name):
    return f"Hello, {name}!"

print(greet("World"))
"""

# Tokenize
from io import StringIO
tokens = list(tokenize.generate_tokens(StringIO(code).readline))
print(f"Token count: {len(tokens)}")

# Parse to AST  
tree = ast.parse(code)
print(f"AST nodes: {len(list(ast.walk(tree)))}")

# Compile to bytecode
bytecode = compile(tree, '<example>', 'exec')
print(f"Bytecode length: {len(bytecode.co_code)} bytes")

# Disassemble
dis.dis(bytecode)

# Execute
exec(bytecode)

Additional resources

--- title: "Python fundamentals" subtitle: "How it's made: Python" author: "Karsten Naert" date: today toc: true execute: echo: true output: true --- # Introduction Python is a programming language specification. The most common implementation is **CPython**, which comes with the familiar `python.exe` executable on Windows. When you run `python my_program.py`, CPython processes your code through four distinct stages: 1. **Tokenization** - Breaking source code into meaningful chunks 2. **AST** (Abstract Syntax Tree) - Organizing tokens into a logical structure 3. **Compilation** - Converting the AST into bytecode 4. **Execution** - Running the bytecode What makes Python interesting is that all of these stages are accessible to you through built-in modules. We'll explore each stage hands-on. ::: {.callout-note icon=true} ## Version matters The internals we discuss are specific to **CPython 3.13/3.14**. Other Python implementations (PyPy, GraalPy) work differently under the hood. Even between CPython versions, bytecode and internal representations can change significantly. ::: # Tokenization The first step is breaking your source code into **tokens** - sequences of characters that have meaning in Python. Think of it like breaking a sentence into words, except for code. ```{python} import tokenize from io import StringIO text = "print('Hello World')" s = StringIO(text) for tok in tokenize.generate_tokens(s.readline): print(tok.string) ``` Our simple `print` statement consists of 7 tokens. Let's make this clearer: ```{python} text = "print('Hello World')" s = StringIO(text) '|'.join(tok.string for tok in tokenize.generate_tokens(s.readline)) ``` Notice that `'Hello World'` (including quotes) is treated as a single token. Each token also has a type: ```{python} text = "abc + xyz" s = StringIO(text) for tok in tokenize.generate_tokens(s.readline): print(f"{tok.type:2d} {tokenize.tok_name[tok.type]:10s} {tok.string!r}") ``` Here's a more complex example showing how Python handles function definitions: ```{python} text = """ def f(x): return 2 * x """ s = StringIO(text) for tok in tokenize.generate_tokens(s.readline): print(f"{tokenize.tok_name[tok.type]:10s} {tok.string!r}") ``` Notice the `INDENT` and `DEDENT` tokens - Python's whitespace sensitivity is baked in at the tokenization level. ::: {.callout-note icon=false collapse="true"} ## Exercise Try tokenizing code with syntax errors: ```{python} #| eval: false text = "print('unclosed string" s = StringIO(text) list(tokenize.generate_tokens(s.readline)) ``` What happens? The tokenizer catches some errors but not all - it doesn't understand code semantics yet. ::: ## Practical uses Tokenization is useful beyond Python's internal workings: - **Code analysis**: Find all variable names, detect naming conventions - **Syntax highlighting**: Colorize code in editors - **Code formatting tools**: Tools like Black use tokenization to understand code structure ::: {.callout-tip icon=false} ## Quick tool: Variable name finder ```{python} def find_names(code): """Find all variable names in Python code""" s = StringIO(code) names = {tok.string for tok in tokenize.generate_tokens(s.readline) if tok.type == tokenize.NAME and not tok.string in ['def', 'return']} return names code = """ def calculate(x, y): result = x + y return result """ print(find_names(code)) ``` ::: ::: {.callout-note icon=true} ## Aside: Tokenization in LLMs Large Language Models also use "tokens", but they're different. Python tokenization splits code based on syntax rules. LLM tokenization (like GPT's) splits text into subword units based on frequency. Similar name, completely different purpose. ::: # The Abstract Syntax Tree (AST) Tokens tell us what pieces we have. The **AST** tells us how they fit together. ```{python} import ast code = "3 + 5" tree = ast.parse(code) print(ast.dump(tree, indent=' ')) ``` That's a lot of structure for `3 + 5`! Let's break it down: - The root is a `Module` (every Python file is a module) - Inside is an `Expr` (expression statement) - The expression is a `BinOp` (binary operation) - It has a left operand (3), an operator (+), and a right operand (5) Here's a more interesting example: ```{python} code = "x = 5 + 6 # test" tree = ast.parse(code) print(ast.dump(tree, indent=' ')) ``` The comment disappeared - the AST only captures code structure, not formatting or comments. You can go from AST back to source code (since Python 3.9): ```{python} print(ast.unparse(tree)) ``` But you won't get your original code back - just equivalent code. The comment is gone, and whitespace may differ. ## Unpacking assignments Even simple-looking code can have complex ASTs: ```{python} code = 'x, *stuff, y = L' tree = ast.parse(code) print(ast.dump(tree, indent=' ')) ``` The AST reveals how Python interprets the unpacking: `x` and `y` are regular targets, `stuff` is a `Starred` target that collects the rest. ::: {.callout-note icon=false collapse="true"} ## Exercise Parse these statements and explore their AST structure: ```{python} #| eval: false statements = [ "x += 1", "[i for i in range(10)]", "lambda x: x + 1", "def f(a, *args, **kwargs): pass" ] for stmt in statements: tree = ast.parse(stmt) print(f"\n{stmt}") print(ast.dump(tree, indent=' ')) ``` Can you identify the key node types? ::: ## Transforming code The AST can be modified programmatically. Python provides `ast.NodeTransformer` for this: ```{python} class AddToMul(ast.NodeTransformer): """Convert all additions to multiplications""" def visit_BinOp(self, node): if isinstance(node.op, ast.Add): node.op = ast.Mult() return node code = "x = 3 + 4 + 5" tree = ast.parse(code) transformed = AddToMul().visit(tree) print("Original:", ast.unparse(tree)) print("Modified:", ast.unparse(transformed)) ``` This is how tools like code formatters and refactoring tools work - they parse, transform, and regenerate code. ::: {.callout-note icon=true} ## Design Patterns: The Visitor Pattern What we've just seen is an example of the **Visitor pattern** - a design pattern where different classes work together through a carefully orchestrated structure. The `NodeTransformer` is the visitor that "visits" each node in the AST tree, and the `visit_BinOp` method defines what happens when we encounter a binary operation node. This pattern allows us to add new operations on the AST without modifying the AST node classes themselves. We'll explore design patterns in much greater detail later in the course, but it's important to recognize them when we encounter them in real-world code. ::: ::: {.callout-warning icon=true} ## CST: When formatting matters The AST is "abstract" - it discards formatting details like comments, whitespace, and exact syntax choices. For tools that need to preserve these (like code formatters), there's an alternative: the **Concrete Syntax Tree**. The external library [libCST](https://github.com/Instagram/LibCST) provides this. With libCST, you can parse a file and write it back byte-for-byte identical, then make targeted changes while preserving formatting. ::: **Project idea**: Consider building a **code modernization tool** that automatically updates deprecated API calls across a large codebase. For example, you might want to: - Replace all `os.path.join()` calls with `pathlib.Path` operations - Update old-style string formatting (`%s`) to f-strings - Migrate from `unittest` assertions to `pytest` style Using CST would preserve all comments, docstrings, and code style choices while making these targeted transformations. This is particularly valuable when working on legacy codebases where maintaining existing formatting and documentation is crucial. The tool could scan a project, identify patterns to modernize, and apply transformations while keeping the code's original structure and style intact - something that would be impossible with AST alone since it discards all formatting information. # Bytecode: What Python actually runs The AST still isn't executable. Python compiles it into **bytecode** - a low-level instruction set for the Python virtual machine. ```{python} code = "print(3 + 4)" code_object = compile(code, '<example>', 'eval') ``` The `compile()` function takes three arguments: - The source code (or AST) - A filename (or `<example>` for inline code) - A mode: `'eval'` for expressions, `'exec'` for statements The result is a code object containing bytecode: ```{python} print(code_object.co_code) ``` These bytes are what Python executes. Let's make them readable with the `dis` module: ```{python} import dis dis.dis(code_object) ``` Each line is an instruction: - `RESUME` - Checkpoints for debugging/tracing - `LOAD_NAME` - Load a variable (here, `print`) - `LOAD_CONST` - Load a constant (here, `7`, the result of `3 + 4`) - `CALL` - Call a function - `RETURN_VALUE` - Return the result Notice that `3 + 4` was pre-computed! Python's compiler can optimize constant expressions. ## Constants and names Code objects store constants and names separately: ```{python} code1 = compile('a = 10; print(3 + a)', '<ex1>', 'exec') code2 = compile('a = 11; print(4 + a)', '<ex2>', 'exec') print("Same bytecode?", code1.co_code == code2.co_code) print() print("Code 1 constants:", code1.co_consts) print("Code 2 constants:", code2.co_consts) print() print("Code 1 names:", code1.co_names) print("Code 2 names:", code2.co_names) ``` The bytecode is identical! Only the constants differ. This separation makes the bytecode more compact and flexible. ## The .pyc files When you import a module, Python saves the compiled bytecode to a `.pyc` file in the `__pycache__` directory. This speeds up subsequent imports - Python can skip tokenization, parsing, and compilation. ::: {.callout-tip icon=true} ## Performance implications Creating `.pyc` files can take time for large codebases. The first import is slow, but subsequent imports are much faster. This is why: - Your app might start slowly the first time after code changes - `.pyc` files should generally be in `.gitignore` (they're machine-generated) - Deployment systems sometimes pre-compile to speed up cold starts ::: You can control `.pyc` generation with command-line flags: ```bash python -B my_script.py ``` The `-B` flag prevents Python from writing `.pyc` files. Useful when you don't have write permissions or want to avoid clutter during development. Or use environment variables: ```bash set PYTHONDONTWRITEBYTECODE=1 python my_script.py ``` To customize where `.pyc` files go: ```bash set PYTHONPYCACHEPREFIX=C:\temp\pycache python my_script.py ``` ::: {.callout-note icon=true} ## Bytecode versioning Bytecode changes between Python minor versions (3.13 vs 3.14), but not between patch versions (3.13.1 vs 3.13.2). That's why `.pyc` files include the Python version in their name: ``` __pycache__/mymodule.cpython-313.pyc ``` If you run the same code with Python 3.14, you'll get a new file: ``` __pycache__/mymodule.cpython-314.pyc ``` ::: ## Exploring bytecode You can examine individual instructions: ```{python} bytecode = dis.Bytecode('a = 11; print(4 + a)') for instr in bytecode: print(f"{instr.opname:20s} {instr.argval}") ``` Or compare different Python constructs: ```{python} dis.dis("x = 5") print() dis.dis("x += 5") ``` `+=` is not syntactic sugar for `= +` - it uses different bytecode (`BINARY_OP` with augmented assignment). ::: {.callout-note icon=false} ## Exercise Compare the bytecode of these equivalent operations: ```{python} #| eval: false # List comprehension dis.dis("[x*2 for x in range(10)]") # Generator expression dis.dis("(x*2 for x in range(10))") # Map function dis.dis("list(map(lambda x: x*2, range(10)))") ``` Which is most complex? Can you guess which is fastest? ::: # Visualizing with Godbolt [Godbolt Compiler Explorer](https://godbolt.org/) lets you see bytecode interactively. Select "Python" as the language, write code on the left, and see the disassembly on the right. You can: - Compare different Python versions side by side - Hover over code to highlight corresponding bytecode - See how optimizations change bytecode Try it with this example: ```python def add(a, b): return a + b def add_constant(x): return x + 42 ``` You'll see that `add_constant` pre-computes less than you might expect - the addition still happens at runtime. ::: {.callout-tip icon=true} ## Using Godbolt effectively 1. Start simple - complex code generates lots of bytecode 2. Compare Python versions to see optimizations 3. Look for patterns in how Python handles common constructs 4. Remember: fewer instructions ≠ faster code (but it's often correlated) ::: # Performance and the future Understanding Python's compilation pipeline helps explain performance characteristics: - **Startup time**: Includes tokenization, parsing, and compilation - **Import time**: Saved by `.pyc` files on subsequent runs - **Runtime**: Dominated by bytecode execution ## The JIT revolution Python 3.13 introduced experimental **JIT** (Just-In-Time) compilation support. See [PEP 744](https://peps.python.org/pep-0744/) and the [official Python 3.13 documentation](https://docs.python.org/3/whatsnew/3.13.html#whatsnew313-jit-compiler) for details. Traditional Python: ``` Source → Tokens → AST → Bytecode → Interpreter ``` With JIT: ``` Source → Tokens → AST → Bytecode → [JIT Compiler] → Machine Code ``` The JIT compiler can: - Detect hot code paths (frequently executed code) - Compile bytecode to native machine code - Optimize based on runtime behavior This is still experimental in Python 3.13/3.14, but represents a major shift in how Python executes code. Future versions may enable JIT by default, dramatically improving performance for CPU-bound code, but as of Python 3.14, the performance gains are very small, at least according to [this blogpost](https://blog.miguelgrinberg.com/post/python-3-14-is-here-how-fast-is-it). ::: {.callout-warning icon=true} ## JIT Availability The JIT compiler is only available when Python is built with the `--enable-experimental-jit` configuration option. To use the JIT, you'll need to build Python from source with this flag enabled, or use a distribution that includes JIT support. In python 3.14 there a submodule `sys._jit` was [added](https://docs.python.org/3/library/sys.html#sys._jit). ::: ::: {.callout-note icon=true} ## Enabling JIT (Python 3.13+) If you have a JIT-enabled Python build: ```bash set PYTHON_JIT=1 python my_script.py ``` ::: # Practical implications Why should you care about Python's internals? **Debugging**: Error messages reference these stages ``` File "<example>", line 1 print(3 + 4 ^ SyntaxError: '(' was never closed ``` The tokenizer caught this before we even got to the AST. **Performance**: Understanding bytecode helps optimize - List comprehensions generate cleaner bytecode than loops - Local variables are faster than globals (different bytecode instructions) **Tooling**: Modern Python tools work at these levels - **Black** (formatter): Works with tokens and AST - **MyPy** (type checker): Analyzes AST - **Coverage.py**: Tracks bytecode execution **Code generation**: You can write code that writes code - Generate optimized functions at runtime - Create DSLs (Domain-Specific Languages) - Build advanced decorators and metaprogramming tools # Summary Python's journey from source to execution: 1. **Tokenization**: Source code → Tokens - Breaks code into meaningful pieces - Catches basic syntax errors 2. **AST**: Tokens → Tree structure - Represents code logic - Enables code analysis and transformation 3. **Compilation**: AST → Bytecode - Generates platform-independent instructions - Cached in `.pyc` files 4. **Execution**: Bytecode → Results - Interpreted by Python VM - (Future: JIT compilation to machine code) Each stage is accessible through Python's standard library. Experiment, explore, and demystify what happens when you hit "run". ::: {.callout-note icon=false collapse="true"} ## Final exercise: Full pipeline Write a small Python script, then trace it through all stages: ```{python} #| eval: false code = """ def greet(name): return f"Hello, {name}!" print(greet("World")) """ # Tokenize from io import StringIO tokens = list(tokenize.generate_tokens(StringIO(code).readline)) print(f"Token count: {len(tokens)}") # Parse to AST tree = ast.parse(code) print(f"AST nodes: {len(list(ast.walk(tree)))}") # Compile to bytecode bytecode = compile(tree, '<example>', 'exec') print(f"Bytecode length: {len(bytecode.co_code)} bytes") # Disassemble dis.dis(bytecode) # Execute exec(bytecode) ``` ::: # Additional resources - [Python AST Documentation](https://docs.python.org/3/library/ast.html) - [dis module - Disassembler](https://docs.python.org/3/library/dis.html) - [PEP 744 - JIT Compilation](https://peps.python.org/pep-0744/) - [Godbolt Compiler Explorer](https://godbolt.org/) - [libCST - Concrete Syntax Tree](https://libcst.readthedocs.io/) - [Green Tree Snakes - AST Tutorial](https://greentreesnakes.readthedocs.io/)