Python[빅데이터] 데이터 전처리(2)

STUDY/Python

Python[빅데이터] 데이터 전처리(2)

oort2 2023. 3. 6. 16:55

#데이터 전처리 : 원본데이터를 원하는 형태로 변경하는 과정

1. 결측값 처리 : 값이 없는 경우.

❤️2. 중복데이터 처리

❤️3. 오류데이터 존재.(결측값과 목표는 같지만 목적이 조금 다름)

    2. 중복데이터 처리
       - duplicated() : 중복데이터 찾기. 첫번째 데이터는 False,같은 데이터인 경우 두번째 True
       - drop_duplicates() : 중복데이터를 제거. 중복된 데이터 중 한개는 남김.

df = pd.DataFrame({"c1":['a','a','b','a','b'],
                   "c2":[1,1,1,2,2],
                   "c3":[1,1,2,2,2]})
df_dup = df.duplicated()
df_dup   #0    False / 1     True / 2    False/ 3 False/ 4 False
col_dup = df["c1"].duplicated() #c1컬럼을 기준으로 중복 검색
col_dup #0    False / 1     True / 2    False/ 3     True/ 4     True
df2=df.drop_duplicates()
df2

3. 오류데이터 존재.

? 값을 처리 : 결측값 치환[결측값(np.nan) ]
replace(오류문자열, np.nan, inplace=True)

NULL이 들어가있으면 파이선이 무시하는데 ?(오류)가 들어가있으면 에러남

mpg["horsepower"].replace("?", np.nan, inplace=True) #?값을 결측값으로 치환
mpg[mpg["horsepower"].isnull()] #horsepower 값이 결측값인 행 조회하기
mpg.dropna(subset=["horsepower"], axis=0, inplace=True) #horsepower 값이 결측값인 행 삭제하기
mpg.info()

+ 자료형을 실수형 변환하기
astype(자료형): 모든 요소들은 자료형으로 변환

mpg["horsepower"].head() #타입 확인 이땐 object형
mpg["horsepower"] =mpg["horsepower"].astype("float")
mpg.info() #이제 float형
mpg["horsepower"].describe()

저작자표시 (새창열림)

'STUDY > Python' 카테고리의 다른 글

Python[빅데이터] 데이터 크롤링(1) (0)	2023.03.06
Python[빅데이터] 데이터 전처리(3) (0)	2023.03.06
Python[빅데이터] 데이터 전처리(1) (0)	2023.02.24
Python[빅데이터] pandas 활용(3) (0)	2023.02.24
Python[빅데이터] pandas 활용(2) (0)	2023.02.23

현재글Python[빅데이터] 데이터 전처리(2)

Beam me up Scotty

같이 성장하는 개발자 🐈‍⬛https://github.com/oort2

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Beam me up Scotty