PL(Programming Language)/Python

[Python] Pandas Dataframe ๊ธฐ๋ณธ (๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ/์ €์žฅํ•˜๊ธฐ, ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ๊ตฌํ•˜๊ธฐ, ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์—ฐ๊ฒฐํ•˜๊ธฐ, column ๋ชฉ๋ก ํ™•์ธ, pd.Series value_counts๋กœ ์—ด์˜ value ํ™•์ธํ•˜๊ธฐ)

ํƒฑ์ ค 2021. 1. 13. 21:55
  • pandas ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

- csv ํ˜•์‹ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd

df = pd.read_csv('ํŒŒ์ผ๋ช….csv') # csvํŒŒ์ผํ˜•์‹์€ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ถˆ๋Ÿฌ์™€์ง

- ํƒญ์œผ๋กœ ๋ถ„๋ฆฌ๋œ txt ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd

df= pd.read_csv('ํŒŒ์ผ๋ช….txt', delimiter = '\t')

# ํƒญ์œผ๋กœ ๋ถ„๋ฆฌ๋œ txt(tsv ํ˜•์‹๋„ ๊ฐ€๋Šฅ) ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

- ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ๋œ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import pandas as pd

df = pd.read_csv(โ€˜ํŒŒ์ผ๋ช….ํ™•์žฅ์žโ€™, delimiter = ' ')

# ๊ณต๋ฐฑ์œผ๋กœ ๋ถ„๋ฆฌ๋œ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

 

 

  • Dataframe์˜ data ๊ฐœ์ˆ˜ ์„ธ๊ธฐ
print(len(df.index))

print(df.shape[0])

print(len(df))

์œ„์˜ ์„ธ ๊ฐœ ์ค‘ ํ•˜๋‚˜ ์ด์šฉํ•˜๋ฉด ๋จ

โ€ป ์ฐธ๊ณ 

shape[0]: ํ–‰ / shape[1]: ์—ด

 

 

  • Dataframe 2๊ฐœ ์—ฐ๊ฒฐํ•˜๊ธฐ
import pandas as pd

pd.concat([df1, df2])

- ์—ด ๊ธฐ์ค€, ํ–‰ ๊ธฐ์ค€ merge ์™€ ๋‹ค๋ฅธ ๊ฐœ๋…์œผ๋กœ ๊ทธ๋ƒฅ ์—ฐ๊ฒฐ๋งŒ ํ•ด์ฃผ๋Š” ์ฝ”๋“œ

- ๋งŒ์•ฝ ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด์ด ์—†๋‹ค๋ฉด, NaN(์—†์Œ)์œผ๋กœ ๊ฐ’์ด ๋“ค์–ด๊ฐ

- ์—ด์ด ๋‹ค๋ฅผ ๋•Œ ๊ตณ๊ตณ

 

  • Dataframe csv๋‚˜ txt๋กœ ์ €์žฅํ•˜๊ธฐ
df.to_csv('ํŒŒ์ผ๋ช….csv') # ๊ทธ๋ƒฅ csv๋กœ ์ €์žฅ

df.to_csv('ํŒŒ์ผ๋ช….txt', sep = '\t') # ํƒญ์œผ๋กœ ๋ถ„๋ฆฌ๋œ txt ํŒŒ์ผ๋กœ ์ €์žฅ

df.to_csv('ํŒŒ์ผ๋ช….ํ™•์žฅ์ž', index = False) # index ๋นผ๊ณ  ์ €์žฅํ•˜๊ธฐ

 

  • Dataframe ์—ด์˜ value ํ™•์ธํ•˜๊ธฐ
import pandas as pd
f1 = pd.read_csv('9606.protein.actions.detailed.v9.1.txt', sep = '\t')

lists = f1['action']

temp = pd.Series(lists)
print(temp.value_counts())

- pd.Series.value_counts() ์ด์šฉ

f1 ํ˜•ํƒœ

 

print(temp.value_counts()) ๊ฒฐ๊ณผ

action ์—ด์˜ vaue ํ™•์ธ

 

  • Dataframe์˜ column_name ๋ณด๊ธฐ
column_name = list(df.columns)
#column_name ๋ณด๊ธฐ

dataframe.columns ์ด์šฉํ•ด column_name ๋ณด๊ธฐ

 

728x90