[python] Pandas를 이용한 데이터 전처리 2

5 minute read

The efficient way to loop through data frames with Pandas

먼저 데이터 프레임을 만든다.

 1import pandas as pd
 2
 3member = ['라이언', '무지', '콘', '프로도', '제이지', '네오', '어피치']
 4weight = ['30', '25', '5', '20', '25', '15', '20']
 5age = ['5', '4', '10', '3', '7', '6', '11']
 6
 7kakao_friends = pd.DataFrame()
 8kakao_friends['member'] = member
 9kakao_friends['weight'] = weight
10kakao_friends['age'] = age
member weight age
0 라이언 30 5
1 무지 25 4
2 5 10
3 프로도 20 3
4 제이지 25 7
5 네오 15 6
6 어피치 20 11

1. iterrows

첫번째 변수idx 에 인덱스를 받고, member는 member 열의 행에 하나씩 접근하여 출력한다.

 1# Iterate over rows of kako friends
 2for idx, member in kakao_friends.iterrows():
 3    print(idx, member)
 4-------------------------
 50 member    라이언
 6weight     30
 7age         5
 8Name: 0, dtype: object
 91 member    무지
10weight    25
11age        4
12Name: 1, dtype: object
132 member     콘
14weight     5
15age       10
16Name: 2, dtype: object
173 member    프로도
18weight     20
19age         3
20Name: 3, dtype: object
214 member    제이지
22weight     25
23age         7
24Name: 4, dtype: object
255 member    네오
26weight    15
27age        6
28Name: 5, dtype: object
296 member    어피치
30weight     20
31age        11
32Name: 6, dtype: object

아래와 같이 입력해도 결과는 동일하다.

 1# 1
 2for member, row in kakao_friends.iterrows():
 3    print(member, row)
 4
 5# 2    
 6for weight, row in kakao_friends.iterrows():
 7    print(weight, row)
 8    
 9# 3
10for age, row in kakao_friends.iterrows():
11    print(age, row)
 1# 4: 아래과 같이 출력할 수도 있다.
 2for index, row in kakao_friends.iterrows():
 3    print(row["member"], row["age"])
 4---------------- 
 5라이언 5
 6무지 4
 710
 8프로도 3
 9제이지 7
10네오 6
11어피치 11

2. itertuples

 1for row in kakao_friends.itertuples(index=True):
 2    print(getattr(row, "member"), getattr(row, "age"))
 3    
 4-----------
 5라이언 5
 6무지 4
 710
 8프로도 3
 9제이지 7
10네오 6
11어피치 11
  • itertuples() is supposed to be faster than iterrows() .

  • itertuples 사용에 있어서 주의할 점이 있다. 공식문서에서 아래와 같이 밝히고 있다.

    According to the docs (pandas 0.21.1 at the moment):

    • iterrows: dtype might not match from row to row

    Because iterrows returns a Series for each row, it does not preservedtypes across the rows (dtypes are preserved across columns for DataFrames).

    • iterrows: Do not modify rows

    You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

    Use DataFrame.apply() instead:

    new_df = df.apply(lambda x: x * 2)
    
    • itertuples:

    The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

3. zipenumerate 사용

 1for index, (member, age) in enumerate(zip(kakao_friends['member'], kakao_friends['age'])):
 2    print(index, member, age)
 3-----------------
 40 라이언 5
 51 무지 4
 6210
 73 프로도 3
 84 제이지 7
 95 네오 6
106 어피치 11

추천하지 않는 방법

 1for i in range(len(kakao_friends['member'])):
 2    print(kakao_friends['member'][i])
 3-------------------
 4라이언
 5무지
 6 7프로도
 8제이지
 9네오
10어피치
 1%%timeit
 2for i in range(len(kakao_friends['member'])):
 3    print(kakao_friends['member'][i])
 4843 µs ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
 5
 6
 7%%timeit
 8for i in kakao_friends['member']:
 9    print(i)
10742 µs ± 8.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Crude looping in Pandas, or That Thing You Should Never Ever Do

To start, let’s quickly review the fundamentals of Pandas data structures. The basic Pandas structures come in two flavors: a DataFrame and a Series. A DataFrame is a two-dimensional array with labeled axes. In other words, a DataFrame is a matrix of rows and columns that have labels — column names for columns, and index labels for rows. A single column or row in a Pandas DataFrame is a Pandas series — a one-dimensional array with axis labels.

Just about every Pandas beginner I’ve ever worked with (including yours truly) has, at some point, attempted to apply a custom function by looping over DataFrame rows one at a time. The advantage of this approach is that it is consistent with the way one would interact with other iterable Python objects; for example, the way one might loop through a list or a tuple. Conversely, the downside is that a crude loop, in Pandas, is the slowest way to get anything done. Unlike the approaches we will discuss below, crude looping in Pandas does not take advantage of any built-in optimizations, making it extremely inefficient (and often much less readable) by comparison.