Python variables, references and mutability
There are many articles like it, but this one is mine
Summary
The analogy of putting things in a box that works for most languages to explain variables doesn't fit Python very well. Python variables are quite dumb: they have no type, no permission, they are barely stickers you put on objects, like the ones you put proudly on your laptop.
In fact, you can't even choose to copy an object from the variable perspective: all variables are references, and when you "copy" a variable with another one, you are really just adding one more sticker to the original object.
Objects, on the other hand, are quite sophisticated. They have a type, but above all, some of them can be modified, such as lists, dicts and sets, while others cannot be modified, like strings, numbers or tuples.
Add a pinch of weird initialization behavior, and you get a few interesting corner cases when you mix references with mutability.
Variables are very dumb
The typical analogy for using variables for most languages is putting something in a box.
It doesn't work like that in Python.
Python variables are dumb, very dumb. They don't really contain anything. They can't. They don't have a shape, or a size. They don't have a permission or type.
When you do:
name = "Daenerys Targaryen"
You are not creating a container name
to put Daenerys Targaryen
in.
Even the phrase "assigning 'Daenerys Targaryen' to the variable 'name'" is not great, because it seems to say you take Daenerys Targaryen
, and do something to name
with it.
But it's the other way around.
name
is more like a sticker you put on Daenerys Targaryen
.
And like your laptop, Daenerys Targaryen
can have as many stickers as you want:
>>> name = "Daenerys Targaryen"
>>> title = name
>>> aka = name
Now Daenerys Targaryen
has 3 stickers on it: name
, title
and aka
.
You might say, but if I do:
name: str = "Daenerys Targaryen"
Then it does have a type!
But no, annotations are just executable documentation. They have no effect on the program. They are meant for 3rd party tools and introspection.
This will work perfectly:
>>> name: str = "Daenerys Targaryen"
>>> name = 1
>>> print(name + 1)
2
And what about _
? Doesn't it make something private? Like:
>>> _password = "admin123"
Again no, it's just a convention for Python developers to communicate an intent. Even the __
in classes is not really private:
>>> class MysteriousStranger:
... def __init__(self):
... self.__secret = "sepreh evah I"
>>> man = MysteriousStranger()
>>> man.__dict__['_MysteriousStranger__secret']
'sepreh evah I'
Python variables are as basic as they get. They are references to stuff.
References
Again, many programming languages can pass things either by reference or by value. "By value" is a copy. "By reference" passes the information of where to get the thing you are talking about.
In Python, there is no "by value". Everything is "by reference".
You can get a unique number that represents the object behind that reference using the id()
function:
>>> name = "Daenerys Targaryen"
>>> title = name
>>> id(title)
140606631565472
>>> id(name)
140606631565472
The number is the same because the two variables are really references to the same object under the hood. If you create a new string, though:
>>> clone = "Daenerys Targaryen"
>>> id(clone)
140606631565568
The new number is different.
is
is the keyword that checks identity, and it will return True
if two variables reference the same object:
>>> name is title
True
>>> clone is title
False
That's why, when comparing things, you should usually use ==
and not is
, since it's a very niche use case to need checking for the reference. You usually want to know if the objects are equivalent:
>>> name == title
True
>>> clone == title
True
The way Python manages memory, you cannot delete an object yourself in Python. You can only remove a reference to it.
E.G., here I remove one reference, but the object is still here:
>>> name
'Daenerys Targaryen'
>>> title
'Daenerys Targaryen'
>>> id(name)
140606631565472
>>> id(title)
140606631565472
>>> del name # deleting the 'name' variable
>>> title # the object is still referenced here
'Daenerys Targaryen'
>>> id(title)
140606631565472
To delete an object in Python, you have to remove all references to it. Then a part of Python called "the garbage collector" will eventually realize, "oh, there is no more reference to that object, I can delete it".
We can visualize it using the __del__
method, since it's called automatically when an object is deleted:
>>> class ImmortalButNotForLong:
... def __del__(self):
... print("So young...")
...
>>> first_reference = ImmortalButNotForLong()
>>> print(first_reference)
<__main__.ImmortalButNotForLong object at 0x7fe188567840>
>>> second_reference = first_reference
>>> del first_reference
>>> print(second_reference)
<__main__.ImmortalButNotForLong object at 0x7fe188567840>
>>> del second_reference
So young...
It's only when we remove the second reference than the object is really deleted by Python, and the message is printed.
Here the process looks immediate, but there is a slight delay between the moment you remove the last reference, and the clean-up.
Note that deleting a variable is not the only way to remove a reference. You can also reassign the variable, and you get the same result:
>>> first_reference = ImmortalButNotForLong()
>>> first_reference = 1
So young...
That's because you are moving the sticker from one object to the other. The variable now references another object, and the original one doesn't have any reference pointing to it anymore.
Some operations will sneakily do this:
>>> title = "Daenerys Targaryen"
>>> id(title)
140606594352224
>>> title += ", the Unburnt"
>>> print(title)
Daenerys Targaryen, the Unburnt
>>> id(title)
140606594234096
The number is different! So the variable now points to a new object.
But how? Didn't we just modify the original string?
Nope, because strings in Python cannot be modified. They are immutable.
Mutability
Python doesn't have rules on whether you can modify a variable or not. All variables can be modified, no matter what they reference. There is no private
or const
.
Instead, there are rules on whether objects can be modified or not.
Some objects can be modified, like lists, dictionaries, sets, or instances of your own classes (at least by default). They are said to be mutable.
Some objects cannot be modified, like strings, numbers or tuples. They are said to be immutable.
E.G., if you add an element to a list, you get the same list:
>>> titles = ['Daenerys Targaryen', 'the Unburnt']
>>> id(titles)
140606594312352
>>> titles += ["Queen of Meereen"]
>>> titles
['Daenerys Targaryen', 'the Unburnt', 'Queen of Meereen']
>>> id(titles)
140606594312352
After the operation, id()
returns the same value, because titles
references the same list object.
This is not true for strings:
>>> titles = 'Daenerys Targaryen, the Unburnt'
>>> id(titles)
140606594235552
>>> titles += ", Queen of Meereen"
>>> titles
'Daenerys Targaryen, the Unburnt, Queen of Meereen'
>>> id(titles)
140606594243264
id()
return a new number, because titles
references a new string object.
How can it be?
Well, strings cannot be modified in Python. Any operation on them returns a new object. If you do "Hello".upper()
, you suddenly have two objects: "Hello"
and "HELLO"
. It's the same here:
>>> titles = 'Daenerys Targaryen, the Unburnt'
Creates one string object. Then:
>>> titles += ", Queen of Meereen"
Create another completely new object. AND It also moves the titles
sticker from the first object to the second. So after a while, the original object will be deleted by Python, giving the illusion we have modified the string.
This interraction has an interesting property: it means if you have several references to the same object, some operations will have a totally different effect depending on whether the object is mutable or not.
Reference playing with mutability
This behavior is an endless source of surprises, also knonwn as bugs, for beginners.
Indeed, because beginners think with boxes instead of stickers, they might think two variables "contain" different things. They then get shocked when they have two stickers on the same object and they use one to modify it.
With strings, which are immutable, they get this:
>>> titles = 'Daenerys Targaryen, the Unburnt'
>>> names = titles
>>> titles += ", Queen of Meereen"
>>> print(names)
Daenerys Targaryen, the Unburnt
>>> print(titles)
Daenerys Targaryen, the Unburnt, Queen of Meereen
Any operation on strings produce a new object, so there are eventually two strings, each with their own variable, their own sticker.
But with lists, which are mutable, they get this:
>>> titles = ['Daenerys Targaryen', 'the Unburnt']
>>>
>>> names = titles
>>> titles += ["Queen of Meereen"]
>>> print(names)
['Daenerys Targaryen', 'the Unburnt', 'Queen of Meereen']
>>> print(titles)
['Daenerys Targaryen', 'the Unburnt', 'Queen of Meereen']
It feels like the modification they applied has propagated everywhere! But of course, it's not true, there is no "everywhere". There was only one list, and two stickers on this single list.
It's not that the addition has been applied to the two variables, it's that the addition is not applied to a variable, but the underlying object. The variables just lead to this object. Here there are two variables, but behind the scenes, they really are leading to the same object. So if you modify the object, it doesn't matter which variable you look at, you will see the one and only list object there is, which is the modified list.
This example is simple, because we created the variables, and hence the references, manually ourselves. But some operations in Python create references to the same object behind your back.
The mutability trap
Indeed, since Python doesn't pass anything "by value", containers don't really contain values, but references to objects. So a list of numbers don't really contain the numbers, it contains only references to that number. A list is like a huge series of variables, really.
Let's store numbers in a list ourselves:
>>> start = 0
>>> scores = [start, start, start, start]
>>> scores
[0, 0, 0, 0]
It's the same object, 0
, repeated 4 times. Look at the id()
:
>>> id(scores[0])
140606633034112
>>> id(scores[1])
140606633034112
>>> id(scores[2])
140606633034112
>>> id(scores[3])
140606633034112
Now, if add to this number, the behavior is quite intuitive:
>>> scores[2] += 1
>>> scores
[0, 0, 1, 0]
That's because numbers are immutable. They cannot be modified. So I just created a new object, and replaced one of the references in the list with it.
>>> id(scores[0])
140606633034112
>>> id(scores[1])
140606633034112
>>> id(scores[2]) # <--- a new object
140606633034176
>>> id(scores[3])
140606633034112
But if I do the same with a list of lists:
>>> operations = []
>>> accounts = [operations, operations, operations, operations]
>>> accounts
[[], [], [], []]
Then the behavior is not intuitive:
>>> accounts[2] += [-300]
>>> accounts
[[-300], [-300], [-300], [-300]]
Again, it's not that I modified "all the accounts" at once. There is no such thing as "all the accounts". There was only one account to begin with, that we referenced 4 times.
The solution is to create 4 separate lists:
>>> accounts = [[], [], [], []]
>>> accounts[2] += [-300]
>>> accounts
[[], [], [-300], []]
If you think this trap is obvious, you have to remember Python has plenty corner cases where this behavior lurks.
E.G., you will see this shortcut in tutorials:
>>> scores = [0] * 4
>>> scores
[0, 0, 0, 0]
>>> scores[2] += 1
>>> scores
[0, 0, 1, 0]
It's very handy, but it's also tricky, as it really creates 4 references to the same object, not 4 objects:
>>> operations = [[]] * 4
>>> operations
[[], [], [], []]
>>> operations[2] += [-300]
>>> operations
[[-300], [-300], [-300], [-300]]
More insidious is the famous "mutable default argument" trap, which you probably read about a lot already: default values in functions are initialized once and for all.
Let's create a very stupid and buggy function:
def biggest_operation(account, reference_account=[]):
reference_account.append(max(account))
return max(reference_account)
If I use it twice, I will get a very weird bug:
>>> biggest_operation([5, 10, 20])
20
>>> biggest_operation([5, 10])
20
Wait, what?
But things get clearer if I display reference_account
:
def biggest_operation(account, reference_account=[]):
print(f"Reference account starts with {reference_account}")
reference_account.append(max(account))
return max(reference_account)
We now can see that the second time, reference_account is not empty:
>>> biggest_operation([5, 10, 20])
Reference account starts with []
20
>>> biggest_operation([5, 10])
Reference account starts with [20]
20
That's again because reference_account
is a reference to a list. But this reference is created only once, and not when you think. It's created when Python starts, not every time you run the function. Every time you run the function, reference_account
is giving you the same reference to the same list object it created when Python started. But that's probably not what you wanted. You probably wanted a new empty list every time.
There is no elegant solution to this, the typical fix is:
def biggest_operation(account, reference_account=None):
reference_account = reference_account or []
reference_account.append(max(account))
return max(reference_account)
And it's the same problem with class attributes:
class Account:
operation = [] # <- this is a class attribute, not an instance one!
Here, if you create 2 accounts, they will share the same list. Probably not what you want, neither.
Beware of the cache
Some implementations cache stuff, which can muddy the water even more.
For once, CPython will cache small integers, for performance reasons:
>>> id(257) # this is not cached
140606631651008
>>> id(257)
140606631651072
>>> id(256) # this is cached
140606633050560
>>> id(256)
140606633050560
As you can see, typing 256
will not create a new object. It will reuse an existing one.
This gets trickier, because it will change depending on the Python implementation, whether you are in a shell or not, and which shell.
E.G., here is what ipython 8.8 using Python 3.10 does:
>>> id('''Princess of Dragonstone
... The Unburnt
... Queen of Meereen
... Queen of the Andals, the Rhoynar and the First Men
... Protector of the Seven Kingdoms
... Khaleesi of the Great Grass Sea
... Breaker of Shackles
... Mother of Dragons''')
140370331239904
>>> id('''Princess of Dragonstone
... The Unburnt
... Queen of Meereen
... Queen of the Andals, the Rhoynar and the First Men
... Protector of the Seven Kingdoms
... Khaleesi of the Great Grass Sea
... Breaker of Shackles
... Mother of Dragons''')
140370331239040
And here is what Python 3.10 default shell does:
>>> id('''Princess of Dragonstone
... The Unburnt
... Queen of Meereen
... Queen of the Andals, the Rhoynar and the First Men
... Protector of the Seven Kingdoms
... Khaleesi of the Great Grass Sea
... Breaker of Shackles
... Mother of Dragons''')
139785940529888
>>> id('''Princess of Dragonstone
... The Unburnt
... Queen of Meereen
... Queen of the Andals, the Rhoynar and the First Men
... Protector of the Seven Kingdoms
... Khaleesi of the Great Grass Sea
... Breaker of Shackles
... Mother of Dragons''')
139785940529888
Most of the time, I advise you to not think to much about it. But if suddenly you see some very strange behavior, that's a good first thing to investigate.
So well explained! There are many articles like it, but this is the best one I ever read on this topic!
May I translate this article for my French students and publish the translated version on my website (with a link to the original content, of course)?
In any case, thank you
Nice.
One trap I've hit sevreal times is: I think I copied an object, like a=b. But all Python does is make a & b point to the same object, so modifying b also changes a!
It's very painful to debug when this goes wrong.
The solution, I found, is to always use deepcopy() when you want to create a new object. But this isn't intuitive, and searching online doesnt give you the simple answer (unless you already know what the problem is)