Iterators are objects that allow you to traverse through all the elements of a collection, regardless of its specific implementation. if you have ever used loops to iterate or run through the values in a container, you have used an iterator.
Why are iterators a good idea?
Code using iterators can avoid intermediate variables, lead to shorter code, run lazily, consume less memory, run faster, are composable, and are more beautiful. In short: they are more elegant.
Iterators
It's a stateful helper object that will produce the next value when you call next() on it. Any object that has a __next__() method is therefore an iterator.
An iterator is an object that implements the iterator protocol. An iterator protocol is nothing but a specific class in Python which further has the __next()__ method. Which means every time you ask for the next value, an iterator knows how to compute it. It keeps information about the current state of the iterable it is working on. The iterator calls the next value when you call next() on it. An object that uses the __next__() method is ultimately an iterator.
So an iterator is a value factory. Each time you ask it for "the next" value, it knows how to compute it because it holds internal state.
Iterators help to produce cleaner looking code because they allows us to work with infinite sequences without having to reallocate resources for every possible sequence, thus also saving resource space
Python has several built-in objects, which implement the iterator protocol and you must have seen some of these before: lists, tuples, strings, dictionaries and even files. There are also many iterators in Python, all of the itertools functions return iterators.
All of the itertools functions return iterators.
Some produce infinite sequences:
>>> from itertools import count
>>> counter = count(start=13)
>>> next(counter)
13
>>> next(counter)
14
Some produce infinite sequences from finite sequences:
>>> from itertools import cycle
>>> colors = cycle(['red', 'white', 'blue'])
>>> next(colors)
'red'
>>> next(colors)
'white'
>>> next(colors)
'blue'
>>> next(colors)
'red'
Some produce finite sequences from infinite sequences:
>>> from itertools import islice
>>> colors = cycle(['red', 'white', 'blue']) # infinite
>>> limited = islice(colors, 0, 4) # finite
>>> for x in limited: # so safe to use for-loop on
... print(x)
red
white
blue
red
Let's build an iterator producing the Fibonacci numbers:
>>> class fib:
... def __init__(self):
... self.prev = 0
... self.curr = 1
...
... def __iter__(self):
... return self
...
... def __next__(self):
... value = self.curr
... self.curr += self.prev
... self.prev = value
... return value
...
>>> f = fib()
>>> list(islice(f, 0, 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
Note that this class is both an iterable (because it sports an __iter__() method), and its own iterator (because it has a __next__() method).
The state inside this iterator is fully kept inside the prev and curr instance variables, and are used for subsequent calls to the iterator. Every call to next() does two important things:
- Modify its state for the next next() call;
- Produce the result for the current call.
Central idea: a lazy factory
From the outside, the iterator is like a lazy factory that is idle until you ask it for a value, which is when it starts to buzz and produce a single value, after which it turns idle again.
Iterables
"An iterable is any object, not necessarily a data structure that can return an iterator". Its main purpose is to return all of its elements. Iterables can represent finite as well as infinite source of data. An iterable will directly or indirectly define two methods: the __iter__() method, which must return the iterator object and the __next()__ method with the help of the iterator it calls.
Examples are open files, open sockets, etc. Where containers are typically finite, an iterable may just as well represent an infinite source of data.
An iterable is any object, not necessarily a data structure, that can return an iterator (with the purpose of returning all of its elements). That sounds a bit awkward, but there is an important difference between an iterable and an iterator. Take a look at this example:
>>> x = [1, 2, 3]
>>> y = iter(x)
>>> z = iter(x)
>>> next(y)
1
>>> next(y)
2
>>> next(z)
1
>>> type(x)
<class 'list'>
>>> type(y)
<class 'list_iterator'>
Here, x is the iterable, while y and z are two individual instances of an iterator, producing values from the iterable x. Both y and z hold state, as you can see from the example. In this example, x is a data structure (a list), but that is not a requirement.
NOTE:
Often, for pragmatic reasons, iterable classes will implement both __iter__() and __next__() in the same class, and have __iter__() return self, which makes the class both an iterable and its own iterator. It is perfectly fine to return a different object as the iterator, though.
Finally, when you write:
x = [1, 2, 3]
for elem in x:
...
This is what actually happens:
When you disassemble this Python code, you can see the explicit call to GET_ITER, which is essentially like invoking iter(x). The FOR_ITER is an instruction that will do the equivalent of calling next() repeatedly to get every element, but this does not show from the byte code instructions because it's optimized for speed in the interpreter.
>>> import dis
>>> x = [1, 2, 3]
>>> dis.dis('for _ in x: pass')
1 0 SETUP_LOOP 14 (to 17)
3 LOAD_NAME 0 (x)
6 GET_ITER
>> 7 FOR_ITER 6 (to 16)
10 STORE_NAME 1 (_)
13 JUMP_ABSOLUTE 7
>> 16 POP_BLOCK
>> 17 LOAD_CONST 0 (None)
20 RETURN_VALUE
An iterable will directly or indirectly define two methods: the __iter__() method, which must return the iterator object and the __next()__ method with the help of the iterator it calls.
Note: Often the iterable classes will implement both __iter__()and __next__() in the same class, and have __iter__() return self, which makes the _iterable_ class both an iterable and its own iterator. It's perfectly fine to return a different object as the iterator, though.
There is an major dissimilarity between what an iterable is and what an iterator is. Here is an example:
a_set = {1, 2, 3}
b_iterator = iter(a_set)
next(b_iterator)
print(type(a_set))
print(type(b_iterator))
Output:
<type 'set'>
<type 'setiterator'>
In the example, a_set is an iterable (a set) whereas b_iterator is an iterator. They are both different data types in Python.
Wondering how an iterator works internally to produce the next sequence when asked? Let's build an iterator that returns a series of number:
class Series(object):
def __init__(self, low, high):
self.current = low
self.high = high
def __iter__(self):
return self
def __next__(self):
if self.current > self.high:
raise StopIteration
else:
self.current += 1
return self.current - 1
n_list = Series(1,10)
print(list(n_list))
__iter__ returns the iterator object itself and the __next__ method returns the next value from the iterator. If there is no more items to return then it raises a StopIteration exception.
Containers
Containers are data structures holding elements, and that support membership tests. They are data structures that live in memory, and typically hold all their values in memory, too. In Python, some well known examples are:
- list, deque,
- set, frozensets, …
- dict, defaultdict, OrderedDict, Counter, …
- tuple, namedtuple, …
- str
Containers are easy to grasp, because you can think of them as real life containers: a box, a cubboard, a house, a ship, etc.
Technically, an object is a container when it can be asked whether it contains a certain element. You can perform such membership tests on lists, sets, or tuples alike:
>>> assert 1 in [1, 2, 3] # lists
>>> assert 4 not in [1, 2, 3]
>>> assert 1 in {1, 2, 3} # sets
>>> assert 4 not in {1, 2, 3}
>>> assert 1 in (1, 2, 3) # tuples
>>> assert 4 not in (1, 2, 3)
Dict membership will check the keys:
>>> d = {1: 'foo', 2: 'bar', 3: 'qux'}
>>> assert 1 in d
>>> assert 4 not in d
>>> assert 'foo' not in d # 'foo' is not a _key_ in the dict
Finally you can ask a string if it "contains" a substring:
>>> s = 'foobar'
>>> assert 'b' in s
>>> assert 'x' not in s
>>> assert 'foo' in s # a string "contains" all its substrings
Containers are the objects that hold data values. They support membership tests, which means you can check if a value exists in the container. Containers are iterables - lists, sets, dictionary, tuple and strings are all containers. But there are other iterables as well like open files and open sockets. You can perform membership tests on the containers:
if 1 in [1,2,3]:
print('List')
if 4 not in {1,2,3}:
print('Tuple')
if 'apple' in 'pineapple':
print('String') #string contains all its substrings
Output:
List
Tuple
String
Generators
The generator is the elegant brother of iterator that allows you to write iterators , but in a much easier syntax where you do not have to write classes with __iter__() and __next__() methods.
Let's be explicit:
- Any generator also is an iterator (not vice versa!);
- Any generator, therefore, is a factory that lazily produces values.
Here is the same Fibonacci sequence factory, but written as a generator:
>>> def fib():
... prev, curr = 0, 1
... while True:
... yield curr
... prev, curr = curr, prev + curr
...
>>> f = fib()
>>> list(islice(f, 0, 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
Prog2:
def series_generator(low, high):
while low <= high:
yield low
low += 1
n_list = []
for num in series_generator(1,10):
n_list.append(num)
print(n_list)
Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The magic word with generators is yield. There is no return statement in the function series_generator. The return value of the function will actually be a generator. Inside the while loop when the execution reaches the yield statement, the value of low is returned and the generator state is suspended. During the second next call, the generator resumes from the value at which it stopped earlier and increases this value by one. It continues with the while loop and comes to the yield statement again.
yield basically replaces the return statement of a function but rather provides a result to its caller without destroying local variables. Thus, in the next iteration, it can work on this local variable value again. So unlike a normal function that you have seen before, where on each call it starts with new set of variables - a generator will resume the execution where it was left off.
Tip: lazy factory is a concept behind the generator and the iterator. Which means they are idle until you ask it for a value. Only when asked is when they get to work and produce a single value, after which it turns idle again. This is a good approach to work with lots of data. If you do not require all the data at once and hence no need to load all the data in the memory, you can use a generator or an iterator which will pass you each piece of data at a time.
Types of Generators
Generators can be of two different types in Python: generator functions and generator expressions.
A generator function is a function where the keyword yield appears in the body. You have already seen an example of this with the series_generator function. Which means the appearance of the keyword yield is enough to make the function a generator function.
The generator expressions are the generator equivalent of a list comprehension. They can be specially useful for a limited use case. Just like a list comprehension returns a list, a generator expressions will return a generator.
Let's see what this means:
squares = (x * x for x in range(1,10))
print(type(squares))
print(list(squares))
<class 'generator'>
[1, 4, 9, 16, 25, 36, 49, 64, 81]
The power of Generators is extreme. They are more memory and CPU efficient and allow you to write code with fewer intermediate variables and data structures.
Where can you insert generators in your code? Tip: find places in your code where you do the following:
def some_function():
result = []
for ... in ...:
result.append(x)
return result
And replace it with:
def iterate_over():
for ... in ...:
yield x
Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop.
Where to use Generators???
1) Reading Large Files
A common use case of generators is to work with data streams or large files, like CSV files.These text files separate data into columns by using commas. This format is a common way to share data. Now, what if you want to count the number of rows in a CSV file?
csv_gen = csv_reader("some_csv.txt")
row_count = 0
for row in csv_gen:
row_count += 1
print(f"Row count is {row_count}")
- Looking at this example, you might expect csv_gen to be a list. To populate this list, csv_reader() opens a file and loads its contents into csv_gen. Then, the program iterates over the list and increments row_count for each row.
- This is a reasonable explanation, but would this design still work if the file is very large? What if the file is larger than the memory you have available? To answer this question, let’s assume that csv_reader() just opens the file and reads it into an array:
def csv_reader(file_name):
file = open(file_name)
result = file.read().split("\n")
return result
- This function opens a given file and uses file.read() along with .split() to add each line as a separate element to a list. If you were to use this version of csv_reader() in the row counting code block you saw further up, then you’d get the following output:
Traceback (most recent call last):
File "ex1_naive.py", line 22, in <module>
main()
File "ex1_naive.py", line 13, in main
csv_gen = csv_reader("file.txt")
File "ex1_naive.py", line 6, in csv_reader
result = file.read().split("\n")
MemoryError
- In this case, open() returns a generator object that you can lazily iterate through line by line. However, file.read().split() loads everything into memory at once, causing the MemoryError.
- Before that happens, you’ll probably notice your computer slow to a crawl. You might even need to kill the program with a KeyboardInterrupt. So, how can you handle these huge data files? Take a look at a new definition of csv_reader():
def csv_reader(file_name):
for row in open(file_name, "r"):
yield row
- In this version, you open the file, iterate through it, and yield a row. This code should produce the following output, with no memory errors:
Output:
Row count is 64186394
- What’s happening here? Well, you’ve essentially turned csv_reader() into a generator function. This version opens a file, loops through each line, and yields each row, instead of returning it.
- You can also define a generator expression (also called a generator comprehension), which has a very similar syntax to list comprehensions. In this way, you can use the generator without calling a function:
csv_gen = (row for row in open(file_name))
- This is a more succinct way to create the list csv_gen. You’ll learn more about the Python yield statement soon. For now, just remember this key difference:
- Using yield will result in a generator object.
- Using return will result in the first line of the file only.
Example 2: Generating an Infinite Sequence
- In Python, to get a finite sequence, you call range() and evaluate it in a list context:
>>> a = range(5)
>>> list(a)
[0, 1, 2, 3, 4]
- Generating an infinite sequence, however, will require the use of a generator, since your computer memory is finite:
def infinite_sequence():
num = 0
while True:
yield num
num += 1
- This code block is short and sweet. First, you initialize the variable num and start an infinite loop. Then, you immediately yield num so that you can capture the initial state. This mimics the action of range().
- After yield, you increment num by 1. If you try this with a for loop, then you’ll see that it really does seem infinite:
>>> for i in infinite_sequence():
... print(i, end=" ")
...
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39 40 41 42
[...]
6157818 6157819 6157820 6157821 6157822 6157823 6157824 6157825 6157826 6157827
6157828 6157829 6157830 6157831 6157832 6157833 6157834 6157835 6157836 6157837
6157838 6157839 6157840 6157841 6157842
KeyboardInterrupt
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
- The program will continue to execute until you stop it manually.
- Instead of using a for loop, you can also call next() on the generator object directly. This is especially useful for testing a generator in the console:
>>> gen = infinite_sequence()
>>> next(gen)
0
>>> next(gen)
1
>>> next(gen)
2
>>> next(gen)
3
- Here, you have a generator called gen, which you manually iterate over by repeatedly calling next(). This works as a great sanity check to make sure your generators are producing the output you expect.
Note: When you use next(), Python calls .__next__() on the function you pass in as a parameter. There are some special effects that this parameterization allows, but it goes beyond the scope of this article. Experiment with changing the parameter you pass to next() and see what happens!
Example 3: Detecting Palindromes
- You can use infinite sequences in many ways, but one practical use for them is in building palindrome detectors. A palindrome detector will locate all sequences of letters or numbers that are palindromes. These are words or numbers that are read the same forward and backward, like 121. First, define your numeric palindrome detector:
def is_palindrome(num):
# Skip single-digit inputs
if num // 10 == 0:
return False
temp = num
reversed_num = 0
while temp != 0:
reversed_num = (reversed_num * 10) + (temp % 10)
temp = temp // 10
if num == reversed_num:
return num
else:
return False
- Don’t worry too much about understanding the underlying math in this code. Just note that the function takes an input number, reverses it, and checks to see if the reversed number is the same as the original. Now you can use your infinite sequence generator to get a running list of all numeric palindromes:
>>> for i in infinite_sequence():
... pal = is_palindrome(i)
... if pal:
... print(pal)
...
11
22
33
[...]
99799
99899
99999
100001
101101
102201
KeyboardInterrupt
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "<stdin>", line 5, in is_palindrome
- In this case, the only numbers that are printed to the console are those that are the same forward or backward.
Note: In practice, you’re unlikely to write your own infinite sequence generator. The itertools module provides a very efficient infinite sequence generator with itertools.count().
Understanding Generators
Generator functions look and act just like regular functions, but with one defining characteristic. Generator functions use the Python yield keyword instead of return. Recall the generator function you wrote earlier:
def infinite_sequence():
num = 0
while True:
yield num
num += 1
This looks like a typical function definition, except for the Python yield statement and the code that follows it. yield indicates where a value is sent back to the caller, but unlike return, you don’t exit the function afterward.
Instead, the state of the function is remembered. That way, when next() is called on a generator object (either explicitly or implicitly within a for loop), the previously yielded variable num is incremented, and then yielded again. Since generator functions look like other functions and act very similarly to them, you can assume that generator expressions are very similar to other comprehensions available in Python.
Building Generators With Generator Expressions
Like list comprehensions, generator expressions allow you to quickly create a generator object in just a few lines of code. They’re also useful in the same cases where list comprehensions are used, with an added benefit: you can create them without building and holding the entire object in memory before iteration. In other words, you’ll have no memory penalty when you use generator expressions. Take this example of squaring some numbers:
>>> nums_squared_lc = [num**2 for num in range(5)]
>>> nums_squared_gc = (num**2 for num in range(5))
Both nums_squared_lc and nums_squared_gc look basically the same, but there’s one key difference. Can you spot it? Take a look at what happens when you inspect each of these objects:
>>> nums_squared_lc
[0, 1, 4, 9, 16]
>>> nums_squared_gc
<generator object <genexpr> at 0x107fbbc78>
The first object used brackets to build a list, while the second created a generator expression by using parentheses. The output confirms that you’ve created a generator object and that it is distinct from a list.
No comments:
Post a Comment