728x90

๋ณ‘๋ ฌ์ฒ˜๋ฆฌ 2

[SPARK] WINDOWS์— PySpark ์„ค์น˜

โ€ป python(version 3 ์ด์ƒ) ์„ค์น˜๋˜์—ˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ์„ค์น˜ JAVA ์„ค์น˜ SPARK ์„ค์น˜ winutils ์„ค์น˜ pyspark ์„ค์น˜ ์„ค์น˜ ํ™•์ธ 1. JAVA ์„ค์น˜ SPARK 3.0.1์€ java 11์„ ์ง€์›ํ•˜๋ฏ€๋กœ ๋ฐ‘์˜ url์— ๋“ค์–ด๊ฐ€ ์ค‘๊ฐ„์˜ 11 JDK ๋‹ค์šด๋กœ๋“œ๋ฅผ ์„ ํƒ. ๊ทธ ์ „์— ์˜ค๋ผํด ๊ณ„์ • ๋งŒ๋“ค๊ธฐ ํ•„์ˆ˜ www.oracle.com/java/technologies/javase-downloads.html ์œˆ๋„์šฐ ๋ฒ„์ „ ํด๋ฆญ ํ›„ next, next ๋ˆ„๋ฅด๋ฉด์„œ ์„ค์น˜. ์ œ์–ดํŒ - ์‹œ์Šคํ…œ ๋ฐ ๋ณด์•ˆ - ์‹œ์Šคํ…œ ๋“ค์–ด๊ฐ€์„œ ๊ณ ๊ธ‰ ์‹œ์Šคํ…œ ์„ค์ •, ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ํด๋ฆญ ํ™˜๊ฒฝ ๋ณ€์ˆ˜, ์‹œ์Šคํ…œ ๋ณ€์ˆ˜ ํŽธ์ง‘ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ํŽธ์ง‘ Path์— %JAVA_HOME%bin์ถ”๊ฐ€ JAVA_HOME ์‹œ์Šคํ…œ ๋ณ€์ˆ˜ ์ถ”๊ฐ€ C:\Program Files\Jav..

[Python] ํŒŒ์ด์ฌ multiprocessing package๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, ์—ฐ์‚ฐ ์†๋„ ๊ฐœ์„ 

์ตœ๊ทผ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ๋กœ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋งŽ์ด ์ค‘์š”ํ•ด์กŒ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํŒŒ์ด์ฌ์—๋Š” ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ํŒจํ‚ค์ง€์ธ multiprocessing์ด ์žˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ multiprocessing ํŒจํ‚ค์ง€๋ฅผ ์ด์šฉํ•ด cpu ์ฝ”์–ด ์ˆ˜๋งŒํผ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค„ ์˜ˆ์ •์ด๋‹ค. 1. CPU์— ์žˆ๋Š” ์ฝ”์–ด์˜ ์ˆ˜๋ฅผ multiprocessing.cpu_count()๋ฅผ ์ด์šฉํ•ด ํ™•์ธ import multiprocessing as mp num_cores = mp.cpu_count() # cpu ์ฝ”์–ด ์ˆ˜ ๋ฐ˜ํ™˜ 2. Dataframe multiprocessing ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, ํ•œ ์ค„์”ฉ ์ฒ˜๋ฆฌ def parallel_dataframe(df, func, num_cores): df_split = np.array_split(df, num_cor..

728x90