[Kaggle] Kaggel 자주 사용하는 함수

@@@ 데이터분석/Kaggle

HTG 2022. 12. 27. 11:44

728x90

EDA

<DataFrame>.describe()

count, mean, std, min, max, quantile 과 같은 숫자 관련 feature 들에 대한 정보를 알려준다.

<DataFrame>.describe(include="all")

숫자 관련 feature 이외에도 모든 feature에 대한 정보를 알려준다.

<DataFrame>.info()

어떤 feature의 null 값, 데이터타입 등을 알려준다.

<DataFrame>[<Col1>].value_counts()

Col1의 값의 빈도를 보여준다.

<DataFrame>.isnull()

Null 값인 cell 을 찾아준다. 하지만 DataFrame 형태로 나오기 때문에 다 표시가 되지 않기 때문에 뒤에 sum()을 해서 null 값인 갯수를 찾는다.

import missingno as msno

msno.matrix(df=<DateFrame>, figsize=(8,8), color=(0.8, 0.5, 0.2))
msno.bar(df=, figsize=(8,8), color=(0.8, 0.5, 0.2))

null값에 대한 그래프를 그려준다.

<DF>[<Col1>].fillna(<value>, inplace=True)

Col1의 null 값에 value로 채운다.

<DataFrame>[[<Col1>,<Col2>]].groupby([<Col1>], as_index=True).count()

Col1을 기준으로 Col2 의 분포

<DataFrame>[[<Col1>,<Col2>]].groupby([<Col1>], as_index=True).sum()

Col1을 기준으로 Col2 True의 분포

pd.crosstab(<DF>[<Col1>], <DF>[<Col2>], margins=True)

위의 내용을 한번에 쉽게 보여줄 수 있음.

from sklearn.preprocessing import LabelEncoder

le_encoder = LabelEncoder()
<DF>[<Col1>] = le_encoder.fit_transform(<DF>[<Col1>])

라벨링하는 작업 String을 Categorical하게 변화

pd.get_dummies(<DF>, columns=[<Col1>,<Col2>,...])

원핫인코딩

from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
<DF>[<Col1>] = mms.fit_transform(<DF>[[<Col1>]]) //주의 대괄호 2개임

MinMax 스케일링을 하기 위해서

[Kaggle] XGBoost 알아보기 (0)	2023.01.05
[Kaggle] Titanic 성능 향상(23.01.04) (0)	2023.01.04
[Kaggle] Titanic 연습해보기 (1)	2022.12.30
[Kaggle] 분류 문제 - Titanic - Machine Learning from Disaster (2) (0)	2022.12.26
[Kaggle] 분류 문제 - Titanic - Machine Learning from Disaster (1) (0)	2022.12.26

250x250