Python fundamentals

How it’s made: Python

Author

Karsten Naert

Published

November 15, 2025

Introduction

Python is a programming language specification. The most common implementation is CPython, which comes with the familiar python.exe executable on Windows.

When you run python my_program.py, CPython processes your code through four distinct stages:

  1. Tokenization - Breaking source code into meaningful chunks
  2. AST (Abstract Syntax Tree) - Organizing tokens into a logical structure
  3. Compilation - Converting the AST into bytecode
  4. Execution - Running the bytecode

What makes Python interesting is that all of these stages are accessible to you through built-in modules. We’ll explore each stage hands-on.

Version matters

The internals we discuss are specific to CPython 3.13/3.14. Other Python implementations (PyPy, GraalPy) work differently under the hood. Even between CPython versions, bytecode and internal representations can change significantly.

Tokenization

The first step is breaking your source code into tokens - sequences of characters that have meaning in Python. Think of it like breaking a sentence into words, except for code.

import tokenize
from io import StringIO

text = "print('Hello World')"
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(tok.string)
print
(
'Hello World'
)

Our simple print statement consists of 7 tokens. Let’s make this clearer:

text = "print('Hello World')"
s = StringIO(text)
'|'.join(tok.string for tok in tokenize.generate_tokens(s.readline))
"print|(|'Hello World'|)||"

Notice that 'Hello World' (including quotes) is treated as a single token. Each token also has a type:

text = "abc  +  xyz"
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(f"{tok.type:2d} {tokenize.tok_name[tok.type]:10s} {tok.string!r}")
 1 NAME       'abc'
55 OP         '+'
 1 NAME       'xyz'
 4 NEWLINE    ''
 0 ENDMARKER  ''

Here’s a more complex example showing how Python handles function definitions:

text = """
def f(x):
    return 2 * x
"""
s = StringIO(text)

for tok in tokenize.generate_tokens(s.readline):
    print(f"{tokenize.tok_name[tok.type]:10s} {tok.string!r}")
NL         '\n'
NAME       'def'
NAME       'f'
OP         '('
NAME       'x'
OP         ')'
OP         ':'
NEWLINE    '\n'
INDENT     '    '
NAME       'return'
NUMBER     '2'
OP         '*'
NAME       'x'
NEWLINE    '\n'
DEDENT     ''
ENDMARKER  ''

Notice the INDENT and DEDENT tokens - Python’s whitespace sensitivity is baked in at the tokenization level.

Try tokenizing code with syntax errors:

text = "print('unclosed string"
s = StringIO(text)
list(tokenize.generate_tokens(s.readline))

What happens? The tokenizer catches some errors but not all - it doesn’t understand code semantics yet.

Practical uses

Tokenization is useful beyond Python’s internal workings:

  • Code analysis: Find all variable names, detect naming conventions
  • Syntax highlighting: Colorize code in editors
  • Code formatting tools: Tools like Black use tokenization to understand code structure
Quick tool: Variable name finder
def find_names(code):
    """Find all variable names in Python code"""
    s = StringIO(code)
    names = {tok.string for tok in tokenize.generate_tokens(s.readline)
             if tok.type == tokenize.NAME and not tok.string in ['def', 'return']}
    return names

code = """
def calculate(x, y):
    result = x + y
    return result
"""

print(find_names(code))
{'y', 'result', 'x', 'calculate'}
Aside: Tokenization in LLMs

Large Language Models also use “tokens”, but they’re different. Python tokenization splits code based on syntax rules. LLM tokenization (like GPT’s) splits text into subword units based on frequency. Similar name, completely different purpose.

The Abstract Syntax Tree (AST)

Tokens tell us what pieces we have. The AST tells us how they fit together.

import ast

code = "3 + 5"
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))
Module(
  body=[
    Expr(
      value=BinOp(
        left=Constant(value=3),
        op=Add(),
        right=Constant(value=5)))])

That’s a lot of structure for 3 + 5! Let’s break it down:

  • The root is a Module (every Python file is a module)
  • Inside is an Expr (expression statement)
  • The expression is a BinOp (binary operation)
  • It has a left operand (3), an operator (+), and a right operand (5)

Here’s a more interesting example:

code = "x = 5 + 6  # test"
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))
Module(
  body=[
    Assign(
      targets=[
        Name(id='x', ctx=Store())],
      value=BinOp(
        left=Constant(value=5),
        op=Add(),
        right=Constant(value=6)))])

The comment disappeared - the AST only captures code structure, not formatting or comments.

You can go from AST back to source code (since Python 3.9):

print(ast.unparse(tree))
x = 5 + 6

But you won’t get your original code back - just equivalent code. The comment is gone, and whitespace may differ.

Unpacking assignments

Even simple-looking code can have complex ASTs:

code = 'x, *stuff, y = L'
tree = ast.parse(code)
print(ast.dump(tree, indent='  '))
Module(
  body=[
    Assign(
      targets=[
        Tuple(
          elts=[
            Name(id='x', ctx=Store()),
            Starred(
              value=Name(id='stuff', ctx=Store()),
              ctx=Store()),
            Name(id='y', ctx=Store())],
          ctx=Store())],
      value=Name(id='L', ctx=Load()))])

The AST reveals how Python interprets the unpacking: x and y are regular targets, stuff is a Starred target that collects the rest.

Parse these statements and explore their AST structure:

statements = [
    "x += 1",
    "[i for i in range(10)]",
    "lambda x: x + 1",
    "def f(a, *args, **kwargs): pass"
]

for stmt in statements:
    tree = ast.parse(stmt)
    print(f"\n{stmt}")
    print(ast.dump(tree, indent='  '))

Can you identify the key node types?

Transforming code

The AST can be modified programmatically. Python provides ast.NodeTransformer for this:

class AddToMul(ast.NodeTransformer):
    """Convert all additions to multiplications"""
    def visit_BinOp(self, node):
        if isinstance(node.op, ast.Add):
            node.op = ast.Mult()
        return node

code = "x = 3 + 4 + 5"
tree = ast.parse(code)
transformed = AddToMul().visit(tree)

print("Original:", ast.unparse(tree))
print("Modified:", ast.unparse(transformed))
Original: x = (3 + 4) * 5
Modified: x = (3 + 4) * 5

This is how tools like code formatters and refactoring tools work - they parse, transform, and regenerate code.

Design Patterns: The Visitor Pattern

What we’ve just seen is an example of the Visitor pattern - a design pattern where different classes work together through a carefully orchestrated structure. The NodeTransformer is the visitor that “visits” each node in the AST tree, and the visit_BinOp method defines what happens when we encounter a binary operation node.

This pattern allows us to add new operations on the AST without modifying the AST node classes themselves. We’ll explore design patterns in much greater detail later in the course, but it’s important to recognize them when we encounter them in real-world code.

CST: When formatting matters

The AST is “abstract” - it discards formatting details like comments, whitespace, and exact syntax choices. For tools that need to preserve these (like code formatters), there’s an alternative: the Concrete Syntax Tree.

The external library libCST provides this. With libCST, you can parse a file and write it back byte-for-byte identical, then make targeted changes while preserving formatting.

Project idea: Consider building a code modernization tool that automatically updates deprecated API calls across a large codebase. For example, you might want to:

  • Replace all os.path.join() calls with pathlib.Path operations
  • Update old-style string formatting (%s) to f-strings
  • Migrate from unittest assertions to pytest style

Using CST would preserve all comments, docstrings, and code style choices while making these targeted transformations. This is particularly valuable when working on legacy codebases where maintaining existing formatting and documentation is crucial. The tool could scan a project, identify patterns to modernize, and apply transformations while keeping the code’s original structure and style intact - something that would be impossible with AST alone since it discards all formatting information.

Bytecode: What Python actually runs

The AST still isn’t executable. Python compiles it into bytecode - a low-level instruction set for the Python virtual machine.

code = "print(3 + 4)"
code_object = compile(code, '<example>', 'eval')

The compile() function takes three arguments:

  • The source code (or AST)
  • A filename (or <example> for inline code)
  • A mode: 'eval' for expressions, 'exec' for statements

The result is a code object containing bytecode:

print(code_object.co_code)
b'\x95\x00\\\x00"\x00S\x005\x01\x00\x00\x00\x00\x00\x00$\x00'

These bytes are what Python executes. Let’s make them readable with the dis module:

import dis

dis.dis(code_object)
  0           RESUME                   0

  1           LOAD_NAME                0 (print)
              PUSH_NULL
              LOAD_CONST               0 (7)
              CALL                     1
              RETURN_VALUE

Each line is an instruction:

  • RESUME - Checkpoints for debugging/tracing
  • LOAD_NAME - Load a variable (here, print)
  • LOAD_CONST - Load a constant (here, 7, the result of 3 + 4)
  • CALL - Call a function
  • RETURN_VALUE - Return the result

Notice that 3 + 4 was pre-computed! Python’s compiler can optimize constant expressions.

Constants and names

Code objects store constants and names separately:

code1 = compile('a = 10; print(3 + a)', '<ex1>', 'exec')
code2 = compile('a = 11; print(4 + a)', '<ex2>', 'exec')

print("Same bytecode?", code1.co_code == code2.co_code)
print()
print("Code 1 constants:", code1.co_consts)
print("Code 2 constants:", code2.co_consts)
print()
print("Code 1 names:", code1.co_names)
print("Code 2 names:", code2.co_names)
Same bytecode? True

Code 1 constants: (10, 3, None)
Code 2 constants: (11, 4, None)

Code 1 names: ('a', 'print')
Code 2 names: ('a', 'print')

The bytecode is identical! Only the constants differ. This separation makes the bytecode more compact and flexible.

The .pyc files

When you import a module, Python saves the compiled bytecode to a .pyc file in the __pycache__ directory. This speeds up subsequent imports - Python can skip tokenization, parsing, and compilation.

Performance implications

Creating .pyc files can take time for large codebases. The first import is slow, but subsequent imports are much faster. This is why:

  • Your app might start slowly the first time after code changes
  • .pyc files should generally be in .gitignore (they’re machine-generated)
  • Deployment systems sometimes pre-compile to speed up cold starts

You can control .pyc generation with command-line flags:

python -B my_script.py

The -B flag prevents Python from writing .pyc files. Useful when you don’t have write permissions or want to avoid clutter during development.

Or use environment variables:

set PYTHONDONTWRITEBYTECODE=1
python my_script.py

To customize where .pyc files go:

set PYTHONPYCACHEPREFIX=C:\temp\pycache
python my_script.py
Bytecode versioning

Bytecode changes between Python minor versions (3.13 vs 3.14), but not between patch versions (3.13.1 vs 3.13.2). That’s why .pyc files include the Python version in their name:

__pycache__/mymodule.cpython-313.pyc

If you run the same code with Python 3.14, you’ll get a new file:

__pycache__/mymodule.cpython-314.pyc

Exploring bytecode

You can examine individual instructions:

bytecode = dis.Bytecode('a = 11; print(4 + a)')

for instr in bytecode:
    print(f"{instr.opname:20s} {instr.argval}")
RESUME               0
LOAD_CONST           11
STORE_NAME           a
LOAD_NAME            print
PUSH_NULL            None
LOAD_CONST           4
LOAD_NAME            a
BINARY_OP            0
CALL                 1
POP_TOP              None
RETURN_CONST         None

Or compare different Python constructs:

dis.dis("x = 5")
print()
dis.dis("x += 5")
  0           RESUME                   0

  1           LOAD_CONST               0 (5)
              STORE_NAME               0 (x)
              RETURN_CONST             1 (None)

  0           RESUME                   0

  1           LOAD_NAME                0 (x)
              LOAD_CONST               0 (5)
              BINARY_OP               13 (+=)
              STORE_NAME               0 (x)
              RETURN_CONST             1 (None)

+= is not syntactic sugar for = + - it uses different bytecode (BINARY_OP with augmented assignment).

Exercise

Compare the bytecode of these equivalent operations:

# List comprehension
dis.dis("[x*2 for x in range(10)]")

# Generator expression  
dis.dis("(x*2 for x in range(10))")

# Map function
dis.dis("list(map(lambda x: x*2, range(10)))")

Which is most complex? Can you guess which is fastest?

Visualizing with Godbolt

Godbolt Compiler Explorer lets you see bytecode interactively. Select “Python” as the language, write code on the left, and see the disassembly on the right.

You can:

  • Compare different Python versions side by side
  • Hover over code to highlight corresponding bytecode
  • See how optimizations change bytecode

Try it with this example:

def add(a, b):
    return a + b

def add_constant(x):
    return x + 42

You’ll see that add_constant pre-computes less than you might expect - the addition still happens at runtime.

Using Godbolt effectively
  1. Start simple - complex code generates lots of bytecode
  2. Compare Python versions to see optimizations
  3. Look for patterns in how Python handles common constructs
  4. Remember: fewer instructions ≠ faster code (but it’s often correlated)

Performance and the future

Understanding Python’s compilation pipeline helps explain performance characteristics:

  • Startup time: Includes tokenization, parsing, and compilation
  • Import time: Saved by .pyc files on subsequent runs
  • Runtime: Dominated by bytecode execution

The JIT revolution

Python 3.13 introduced experimental JIT (Just-In-Time) compilation support. See PEP 744 and the official Python 3.13 documentation for details.

Traditional Python:

Source → Tokens → AST → Bytecode → Interpreter

With JIT:

Source → Tokens → AST → Bytecode → [JIT Compiler] → Machine Code

The JIT compiler can:

  • Detect hot code paths (frequently executed code)
  • Compile bytecode to native machine code
  • Optimize based on runtime behavior

This is still experimental in Python 3.13/3.14, but represents a major shift in how Python executes code. Future versions may enable JIT by default, dramatically improving performance for CPU-bound code, but as of Python 3.14, the performance gains are very small, at least according to this blogpost.

JIT Availability

The JIT compiler is only available when Python is built with the --enable-experimental-jit configuration option. To use the JIT, you’ll need to build Python from source with this flag enabled, or use a distribution that includes JIT support. In python 3.14 there a submodule sys._jit was added.

Enabling JIT (Python 3.13+)

If you have a JIT-enabled Python build:

set PYTHON_JIT=1
python my_script.py

Practical implications

Why should you care about Python’s internals?

Debugging: Error messages reference these stages

  File "<example>", line 1
    print(3 + 4
          ^
SyntaxError: '(' was never closed

The tokenizer caught this before we even got to the AST.

Performance: Understanding bytecode helps optimize - List comprehensions generate cleaner bytecode than loops - Local variables are faster than globals (different bytecode instructions)

Tooling: Modern Python tools work at these levels

  • Black (formatter): Works with tokens and AST
  • MyPy (type checker): Analyzes AST
  • Coverage.py: Tracks bytecode execution

Code generation: You can write code that writes code

  • Generate optimized functions at runtime
  • Create DSLs (Domain-Specific Languages)
  • Build advanced decorators and metaprogramming tools

Summary

Python’s journey from source to execution:

  1. Tokenization: Source code → Tokens
    • Breaks code into meaningful pieces
    • Catches basic syntax errors
  2. AST: Tokens → Tree structure
    • Represents code logic
    • Enables code analysis and transformation
  3. Compilation: AST → Bytecode
    • Generates platform-independent instructions
    • Cached in .pyc files
  4. Execution: Bytecode → Results
    • Interpreted by Python VM
    • (Future: JIT compilation to machine code)

Each stage is accessible through Python’s standard library. Experiment, explore, and demystify what happens when you hit “run”.

Write a small Python script, then trace it through all stages:

code = """
def greet(name):
    return f"Hello, {name}!"

print(greet("World"))
"""

# Tokenize
from io import StringIO
tokens = list(tokenize.generate_tokens(StringIO(code).readline))
print(f"Token count: {len(tokens)}")

# Parse to AST  
tree = ast.parse(code)
print(f"AST nodes: {len(list(ast.walk(tree)))}")

# Compile to bytecode
bytecode = compile(tree, '<example>', 'exec')
print(f"Bytecode length: {len(bytecode.co_code)} bytes")

# Disassemble
dis.dis(bytecode)

# Execute
exec(bytecode)

Additional resources