728x90

๋ฐ์ดํ„ฐ์ „์ฒ˜๋ฆฌ 4

[Python] Pandas Dataframe ์ค‘๋ณต ์ œ๊ฑฐ, distinctํ•œ ๊ฐ’ ํ™•์ธ

df.drop_duplicates() df ์ „์ฒด์˜ ์ค‘๋ณต ์ œ๊ฑฐ๋„ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์—ด ๋ผ๋ฆฌ ์ค‘๋ณต ์ œ๊ฑฐ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์œ„์˜ ๋ฐ์ดํ„ฐ๋Š” pert_iname์ด๋ผ๋Š” ์—ด์— ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๋“ค์ด ๋งŽ์ด ์žˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ df.drop_duplicates()๋กœ distinctํ•œ ๊ฐ’์€ ๋ช‡ ๊ฐœ์ธ์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์›๋ž˜ 13553๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘๋ณต๊ฐ’์„ ์ œ์™ธํ•˜๋ฉด 6798๊ฐœ๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ df.value_counts() ๋ฅผ ์ด์šฉํ•˜๋ฉด distinctํ•œ ๊ฐ’์„ ์ฐพ์•„์ฃผ๋ฉด์„œ ๋ช‡ ๊ฐœ๊ฐ€ ์ค‘๋ณต๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

[Data Analysis] ๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณผ์ •, ์ „์ฒ˜๋ฆฌ์˜ ์ค‘์š”์„ฑ

๋ฐ์ดํ„ฐ ๋ถ„์„ ๊ณผ์ •(Data Analysis Process) 1. Goal Definition ๊ฐ๊ด€์ , ๊ตฌ์ฒด์ ์œผ๋กœ ๋ถ„์„ ๋Œ€์ƒ ์ •์˜(=๋ฌธ์ œ ์ •์˜) ํ•ด๋‹น ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ์ดํ•ด ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์— ๋Œ€ํ•œ ์ดํ•ด 2. Data Searching & Collecting ๋ฌธ์ œ ์ •์˜ ํ›„ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋ฐ ๋ฐ์ดํ„ฐ ํŒŒ์•… 3. Data Preparation ๋ฐ์ดํ„ฐ์˜ noise๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ์›ํ•˜๋Š” ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” Data preprocessing(๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •)ํฌํ•จ ์ตœ์ข… ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋‹จ๊ณ„ ๊ด€๋ จ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๊ด€๊ณ„ ์„ค์ • ๋ฐ ๋ฐ์ดํ„ฐ ์ดํ•ด, ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ 4. Modeling ์–ด๋–ป๊ฒŒ ๋ชจ๋ธ ์„ค๊ณ„ํ• ์ง€ ๊ตฌ์„ฑ R, Python ๋“ฑ ์ด์šฉํ•ด ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋“ฑ ๋‹ค์–‘ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ 5. Evaluatio..

[Python] Pandas Dataframe ๊ธฐ๋ณธ(merge, concat, concat ํ–‰, ์—ด ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉ, ์—ฐ๊ฒฐ)

์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) 1. ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ Key ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์น˜๊ธฐ pd.merge(left,right,on='id') 2. ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ m..

[Python] Pandas Explode, Pandas Dataframe, column split ๋ฐ”์ด์˜ค๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋กœ ๋‹ค์ง€๋Š” Pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

(์‹ค์ œ ์ฝ”๋“œ ๊ฒฐ๊ณผ๋กœ ์ž‘์„ฑ, transcription factor binding site ๋ฐ์ดํ„ฐ ์ด์šฉ) Dataframe์— ์ƒˆ๋กœ์šด column ์ž‘์„ฑํ•˜๊ธฐ Dataframe์˜ column split ํ›„ ๋‹ค๋ฅธ column์œผ๋กœ ์ €์žฅํ•˜๊ธฐ # df[์—ด์ด๋ฆ„].str.split() ์ด์šฉ Dataframe์˜ ์—ด data๋ฅผ split ํ›„ ๋‹ค์‹œ ์ €์žฅํ•˜๊ธฐ Pandas explode ๋ฉ”์†Œ๋“œ ์‚ฌ์šฉํ•˜๊ธฐ (๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์—ด์— ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ์—์„œ ๋ฆฌ์ŠคํŠธ ์š”์†Œ๋ฅผ ํ–‰์œผ๋กœ ์ถ”๊ฐ€ํ•˜๊ธฐ) import pandas as pd f1 = pd.read_csv('test.txt', delimiter = '\t', names = ['1', '2', '3', '4', '5', '6', '7', '8', '9']) # 1~9๋กœ ์—ด ์ด๋ฆ„ ์ •ํ•ด์„œ ํƒญ์œผ๋กœ ๋ถ„..

728x90