[python] Pandas를 이용한 데이터 전처리 2
The efficient way to loop through data frames with Pandas
먼저 데이터 프레임을 만든다.
1import pandas as pd
2
3member = ['라이언', '무지', '콘', '프로도', '제이지', '네오', '어피치']
4weight = ['30', '25', '5', '20', '25', '15', '20']
5age = ['5', '4', '10', '3', '7', '6', '11']
6
7kakao_friends = pd.DataFrame()
8kakao_friends['member'] = member
9kakao_friends['weight'] = weight
10kakao_friends['age'] = age
member | weight | age | |
---|---|---|---|
0 | 라이언 | 30 | 5 |
1 | 무지 | 25 | 4 |
2 | 콘 | 5 | 10 |
3 | 프로도 | 20 | 3 |
4 | 제이지 | 25 | 7 |
5 | 네오 | 15 | 6 |
6 | 어피치 | 20 | 11 |
1. iterrows
첫번째 변수idx
에 인덱스를 받고, member는 member 열의 행에 하나씩 접근하여 출력한다.
1# Iterate over rows of kako friends
2for idx, member in kakao_friends.iterrows():
3 print(idx, member)
4-------------------------
50 member 라이언
6weight 30
7age 5
8Name: 0, dtype: object
91 member 무지
10weight 25
11age 4
12Name: 1, dtype: object
132 member 콘
14weight 5
15age 10
16Name: 2, dtype: object
173 member 프로도
18weight 20
19age 3
20Name: 3, dtype: object
214 member 제이지
22weight 25
23age 7
24Name: 4, dtype: object
255 member 네오
26weight 15
27age 6
28Name: 5, dtype: object
296 member 어피치
30weight 20
31age 11
32Name: 6, dtype: object
아래와 같이 입력해도 결과는 동일하다.
1# 1
2for member, row in kakao_friends.iterrows():
3 print(member, row)
4
5# 2
6for weight, row in kakao_friends.iterrows():
7 print(weight, row)
8
9# 3
10for age, row in kakao_friends.iterrows():
11 print(age, row)
1# 4: 아래과 같이 출력할 수도 있다.
2for index, row in kakao_friends.iterrows():
3 print(row["member"], row["age"])
4----------------
5라이언 5
6무지 4
7콘 10
8프로도 3
9제이지 7
10네오 6
11어피치 11
2. itertuples
1for row in kakao_friends.itertuples(index=True):
2 print(getattr(row, "member"), getattr(row, "age"))
3
4-----------
5라이언 5
6무지 4
7콘 10
8프로도 3
9제이지 7
10네오 6
11어피치 11
-
itertuples()
is supposed to be faster thaniterrows()
. -
itertuples
사용에 있어서 주의할 점이 있다. 공식문서에서 아래와 같이 밝히고 있다.According to the docs (pandas 0.21.1 at the moment):
- iterrows:
dtype
might not match from row to row
Because iterrows returns a Series for each row, it does not preservedtypes across the rows (dtypes are preserved across columns for DataFrames).
- iterrows: Do not modify rows
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Use DataFrame.apply() instead:
new_df = df.apply(lambda x: x * 2)
- itertuples:
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
- iterrows:
3. zip
과 enumerate
사용
1for index, (member, age) in enumerate(zip(kakao_friends['member'], kakao_friends['age'])):
2 print(index, member, age)
3-----------------
40 라이언 5
51 무지 4
62 콘 10
73 프로도 3
84 제이지 7
95 네오 6
106 어피치 11
추천하지 않는 방법
1for i in range(len(kakao_friends['member'])):
2 print(kakao_friends['member'][i])
3-------------------
4라이언
5무지
6콘
7프로도
8제이지
9네오
10어피치
1%%timeit
2for i in range(len(kakao_friends['member'])):
3 print(kakao_friends['member'][i])
4843 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
5
6
7%%timeit
8for i in kakao_friends['member']:
9 print(i)
10742 µs ± 8.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Crude looping in Pandas, or That Thing You Should Never Ever Do
To start, let’s quickly review the fundamentals of Pandas data structures. The basic Pandas structures come in two flavors: a DataFrame and a Series. A DataFrame is a two-dimensional array with labeled axes. In other words, a DataFrame is a matrix of rows and columns that have labels — column names for columns, and index labels for rows. A single column or row in a Pandas DataFrame is a Pandas series — a one-dimensional array with axis labels.
Just about every Pandas beginner I’ve ever worked with (including yours truly) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects; for example, the way one might loop through a list or a tuple. Conversely, the downside is that a crude loop, in Pandas, is the slowest way to get anything done. Unlike the approaches we will discuss below, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient (and often much less readable) by comparison.