728x90

pandas 10

[Python] ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„(Pandas DataFrame) sys:1: DtypeWarning: Columns have mixed types.Specify dtype option on import or set low_memory=False. ๋ฌด์‹œํ•˜๊ธฐ

column์— NaN๊ฐ’์ด๋‚˜ ์—ฌ๋Ÿฌ type์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ž์—ฌ ์žˆ์œผ๋ฉด ์ด์™€ ๊ฐ™์€ ๊ฒฝ๊ณ ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. ์ด ๋•Œ ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๊ฐ€ ์•Œ๋ ค์ฃผ๋Š” ๋Œ€๋กœ dtype option์œผ๋กœ ํƒ€์ž…์„ ๋ช…์‹œํ•ด์ฃผ๊ฑฐ๋‚˜ low_memory = False๋กœ ์ง€์ •ํ•ด ์ฃผ๋ฉด ๊ฒฝ๊ณ  ๋ฉ”์‹œ์ง€๊ฐ€ ์ถœ๋ ฅ๋˜์ง€ ์•Š๋Š”๋‹ค. pd.read_csv('[ํŒŒ์ผ๋ช…].txt', delimiter = '\t', low_memory=False)

[Python] ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„(pandas dataframe) SettingWithCopyWarning ํ•ด๊ฒฐ

๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ผ๋ถ€๋ฅผ ๋ณต์‚ฌํ•˜๊ฑฐ๋‚˜ ์ธ๋ฑ์‹ฑ ํ›„ ๊ฐ’์„ ์ˆ˜์ •ํ•  ๋•Œ ์ข…์ข… ๋ฐœ์ƒํ•œ๋‹ค. ๊ธฐ์กด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๊ฐ€์ ธ์™€(๋ณต์‚ฌ) ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“ค ๋•Œ ์›๋ณธ์„ ์ˆ˜์ •ํ•  ์ง€ ๋ณต์‚ฌ๋ณธ์„ ์ˆ˜์ •ํ•  ์ง€ ๋ชฐ๋ผ์„œ ๋ฐœ์ƒํ•˜๋Š” ์˜ค๋ฅ˜๋ผ๊ณ  ํ•œ๋‹ค. ๋‘ ๊ฐ€์ง€ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ, ํ•˜๋‚˜๋Š” ๊ฒฝ๊ณ ๋ฅผ ๋ฌด์‹œํ•˜๋Š” ๊ฒƒ์ด๊ณ  ํ•˜๋‚˜๋Š” copy๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ! 1. ๊ฒฝ๊ณ  ๋ฌด์‹œ pd.set_option์„ ์‚ฌ์šฉํ•œ ๊ฒฝ๊ณ ๋ฌธ ์ œ๊ฑฐ # SettingWithCopyError --> ์˜ค๋ฅ˜ raise ๋กœ ์ฝ”๋“œ ์‹คํ–‰ X pd.set_option('mode.chained_assignment', 'raise') # SettingWithCopyWarning --> ์‹คํ–‰์€ ๋˜์ง€๋งŒ ๊ฒฝ๊ณ ๋ฌธ ๋œธ pd.set_option('mode.chained_assignment', 'warn') # err..

[Python] ํŒ๋‹ค์Šค concat, append, join, merge ์ฐจ์ด

Pandas concat vs append vs join vs merge Concat gives the flexibility to join based on the axis( all rows or all columns) Append is the specific case(axis=0, join='outer') of concat Join is based on the indexes (set by set_index) on how variable =['left','right','inner','couter'] Merge is based on any particular column each of the two dataframes, this columns are variables on like 'left_on', 'ri..

[Python] Pandas Dataframe ์ค‘๋ณต ์ œ๊ฑฐ, distinctํ•œ ๊ฐ’ ํ™•์ธ

df.drop_duplicates() df ์ „์ฒด์˜ ์ค‘๋ณต ์ œ๊ฑฐ๋„ ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์—ด ๋ผ๋ฆฌ ์ค‘๋ณต ์ œ๊ฑฐ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. ์œ„์˜ ๋ฐ์ดํ„ฐ๋Š” pert_iname์ด๋ผ๋Š” ์—ด์— ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๋“ค์ด ๋งŽ์ด ์žˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ df.drop_duplicates()๋กœ distinctํ•œ ๊ฐ’์€ ๋ช‡ ๊ฐœ์ธ์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์›๋ž˜ 13553๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘๋ณต๊ฐ’์„ ์ œ์™ธํ•˜๋ฉด 6798๊ฐœ๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ df.value_counts() ๋ฅผ ์ด์šฉํ•˜๋ฉด distinctํ•œ ๊ฐ’์„ ์ฐพ์•„์ฃผ๋ฉด์„œ ๋ช‡ ๊ฐœ๊ฐ€ ์ค‘๋ณต๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

[Python] Pandas dataframe ๊ฒฐํ•ฉ, ์กฐ์ธ, ๋ณ‘ํ•ฉ(Join, Merge)

Join 1. ์˜ˆ์‹œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑ import pandas as pd df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}) other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'B': ['B0', 'B1', 'B2']}) 2. ์—ด์˜ Index ์ง€์ •ํ•ด์„œ Join df.join(other, lsuffix = '_caller', rsuffix = '_other') 3. Key๋ฅผ index๋กœ ์ง€์ •ํ•ด Join df.set_index('key').join(other.set_index('key')) 4. join ๋ฉ”์†Œ๋“œ์˜ parameter ..

[Python] Pandas Dataframe ์—ด์— ์–ด๋–ค ๋ฐ์ดํ„ฐ ์žˆ๋Š”์ง€ value ํ™•์ธ, ๋ฐ์ดํ„ฐ ๋ณ„๋กœ ๊ฐœ์ˆ˜ ์„ธ๊ธฐ, ์ค‘๋ณต๊ฐ’ ํ™•์ธ, ์œ ์ผํ•œ(์œ ๋‹ˆํฌํ•œ) ๊ฐ’ ์ฐพ๊ธฐ

df.unique() ์œ„์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ cell_line์— ๋Œ€ํ•œ ๋‚ด ์˜ˆ์‹œ ๋ฐ์ดํ„ฐ์ด๋‹ค. ์ด์ œ ์ด cell_lien ๋ฐ์ดํ„ฐ์—์„œ ์œ ๋‹ˆํฌํ•œ ๊ฐ’์„ ์ฐพ์•„๋ณผ ๊ฒƒ์ด๋‹ค. 1. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜์ง€ ์•Š๊ณ  ๊ฐ’ ํ™•์ธ cell_line = f['epigenomes_with_experimental_evidence'].values # values = df[์ปฌ๋Ÿผ๋ช…].values ์œ„์˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด arrayํ˜•ํƒœ๋กœ ๋ชจ๋“  ๊ฐ’์ด ์ถœ๋ ฅ๋œ๋‹ค. 2. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ๊ฐ ์š”์†Œ๋ณ„๋กœ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ f['epigenomes_with_experimental_evidence'].value_counts() # df[์ปฌ๋Ÿผ๋ช…].value_counts() 3. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  ์œ ๋‹ˆํฌํ•œ ๊ฐ’ ํ™•์ธ f['epigenome..

[Python] Pandas Dataframe ๊ธฐ๋ณธ(merge, concat, concat ํ–‰, ์—ด ๊ธฐ์ค€์œผ๋กœ ๋ณ‘ํ•ฉ, ์—ฐ๊ฒฐ)

์˜ˆ์‹œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ import pandas as pd left = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']}) right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']}) 1. ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ Key ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์น˜๊ธฐ pd.merge(left,right,on='id') 2. ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ m..

[Python] ํŒŒ์ด์ฌ multiprocessing package๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, ์—ฐ์‚ฐ ์†๋„ ๊ฐœ์„ 

์ตœ๊ทผ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋งŽ์ด ์ค‘์š”ํ•ด์กŒ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํŒŒ์ด์ฌ์—๋Š” ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ํŒจํ‚ค์ง€์ธ multiprocessing์ด ์žˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ multiprocessing ํŒจํ‚ค์ง€๋ฅผ ์ด์šฉํ•ด cpu ์ฝ”์–ด ์ˆ˜๋งŒํผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค„ ์˜ˆ์ •์ด๋‹ค. 1. CPU์— ์žˆ๋Š” ์ฝ”์–ด์˜ ์ˆ˜๋ฅผ multiprocessing.cpu_count()๋ฅผ ์ด์šฉํ•ด ํ™•์ธ import multiprocessing as mp num_cores = mp.cpu_count() # cpu ์ฝ”์–ด ์ˆ˜ ๋ฐ˜ํ™˜ 2. Dataframe multiprocessing ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, ํ•œ ์ค„์”ฉ ์ฒ˜๋ฆฌ def parallel_dataframe(df, func, num_cores): df_split = np.array_split(df, num_cor..

728x90