Opensource SW #06 | Machien Learning

Introduce ML

๐Ÿ“Œ Machine Learning

๊ธฐ๊ณ„ ํ•™์Šต(ML)์€ ๊ฒฝํ—˜๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ž๋™์œผ๋กœ ํ–ฅ์ƒ๋˜๋Š” ์ปดํ“จํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์—ฐ๊ตฌํ•˜๋Š” ํ•™๋ฌธ์ด๋‹ค. ์ข€ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ์ปดํ“จํ„ฐ ์‹œ์Šคํ…œ์ด ํŒจํ„ด๊ณผ ์ถ”๋ก ์— ์˜์กดํ•˜์—ฌ ๋ช…์‹œ์  ์ง€์‹œ ์—†์ด ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ํ†ต๊ณ„ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ณผํ•™์ด๋‹ค.

ML์€ ์ธ๊ณต์ง€๋Šฅ(AI)์˜ ํ•˜์œ„ ๋ถ„์•ผ๋กœ ๊ฐ„์ฃผ๋  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ“Œ Feature

๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค์€ Feature๋ผ๋Š” ๊ฐ’๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ๋ฐ›๋Š”๋ฐ Feature๋ž€, ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฐ ์†์„ฑ์„ ์˜๋ฏธํ•˜๋ฉฐ, ์‚ฌ๋žŒ์ด ์•„๋‹Œ ๊ธฐ๊ณ„๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ ๋“ค์„ ์˜๋ฏธํ•œ๋‹ค.

์˜ˆ๋ฅผ๋“ค์–ด ์•„๋ž˜์ฒ˜๋Ÿผ input ๋ฐ์ดํ„ฐ๋“ค์„ ๋„ฃ์–ด์ค˜์•ผ์ง€ ML์ด ์ดํ•ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฐ column(์†์„ฑ)๋“ค (Name, Age, Ticket ์ปฌ๋Ÿผ ๋“ฑ๋“ฑ) ์ด ๋ฐ”๋กœ Featue ๋‹ค.

img
Feature

๐Ÿ“Œ Traning & Interface

img
Traning & Interface

ML์˜ ์™ธ๋ถ€์ ์ธ ์ผ์€ ํฌ๊ฒŒ 2๊ฐ€์ง€ ๊ณผ์ •์œผ๋กœ ๋‚˜๋‰œ๋‹ค.

  • Training(ํ•™์Šต ๊ณผ์ •) : ML ๋ชจ๋ธ์—์„œ ๊ณ„์† ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ์ž…ํ•ด์ฃผ๋ฉด์„œ, ๋‹ต์„ ์ •ํ™•ํžˆ ๋„์ถœํ•ด๋‚ด๋„๋ก ํ›ˆ๋ จ์„ ์‹œํ‚ค๋Š” ๋ฐฉ์‹.
  • Interface(์ถ”๋ก  ๊ณผ์ •) : ์•ž์„œ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ML์ด ์•Œ์•„์„œ ๋ณด์ด์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ ๊ฒฐ๊ณผ๊ฐ’์„ ์˜ˆ์ธก ๋ฐ ๋„์ถœํ•ด๋‚ด๋Š” ๊ฒƒ (Ex. ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ๊ฐ•์•„์ง€์ž„์„ ํŒ๋ณ„ํ•ด๋‚ด๋Š” ๊ฒƒ)

Training ๊ณผ์ •์„ ํ†ตํ•ด ML์—๊ฒŒ ์ถฉ๋ถ„ํžˆ ํ•™์Šต์„ ์‹œ์ผฐ๋‹ค๋ฉด, ์ดํ›„ ์„œ๋น„์Šค๋กœ ์ถœ์‹œํ•˜๋Š” ๊ณผ์ •์ด ๋ฐ”๋กœ Interface ๊ณผ์ •์ด๋‹ค. ์ฆ‰, Inteface๋Š” ML ์ด ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•œ ์ดํ›„ ์‚ฌ๋žŒ๋“ค์ด ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ถฉ๋ถ„ํžˆ ์ถ”๋ก ํ•ด์„œ ๊ฒฐ๊ณผ๊ฐ’์œผ๋กœ ๋ฆฌํ„ดํ•ด์ฃผ๋Š” ๊ณผ์ •์ด๋‹ค.


๐Ÿ“Œ ML์˜ ๋‚ด๋ถ€ Tasks

img
지도 학습, 비지도 학습, 강화 학습

ML๋‚ด๋ถ€์—์„œ์˜ Task๋Š” ํฌ๊ฒŒ 3๊ฐœ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜๋œ๋‹ค.

  • ยนSupervised Learning (์ง€๋„ ํ•™์Šต)
  • ยฒUnsupervised Learning (๋น„์ง€๋„ ํ•™์Šต)
  • ยณReinforcement Learning (๊ฐ•ํ™” ํ•™์Šต)

Label์€ ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ธํ’‹์œผ๋กœ ๋„ฃ์—ˆ์„ ๋•Œ ๋„์ถœ๋˜๋Š” ๊ฐ ์˜ˆ์ธก๊ฐ’( = ์ •๋‹ต = ๊ฒฐ๊ณผ๊ฐ’)์„ ๋งํ•œ๋‹ค.


๐Ÿ“Œ ยนSupervised Learning

img
Input데이터와 Label데이터를 기반으로 학습을 진행하는 ML 모델
  1. ์‚ผ๊ฐํ˜• ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ(Labled Data)๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด์„œ ๋™์‹œ์— โ€œ์ด๊ฑด ์‚ผ๊ฐํ˜•์ด๋ผ๋Š” ํƒ€์ž…(Label) ์ด๋ผ๋Š”๊ฑฐ์•ผ!โ€๋ผ๊ณ  ํ•˜๋ฉฐ Label์ •๋ณด๋„ ๋™์‹œ์— ๊ธฐ๊ณ„์—๋‹ค ์ธํ’‹์„ ๋„ฃ์–ด์ค€๋‹ค.
  2. ML model ์€ ์ด๊ฒƒ์ด ์‚ผ๊ฐํ˜•์ธ์ง€, ๋™๊ทธ๋ผ๋ฏธ์ธ์ง€, ์‚ฌ๊ฐํ˜•์ธ์ง€๋“ฑ์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ์˜ˆ์ธกํ•ด์„œ ๋„์ถœํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๊ณ„์† ํ•™์Šตํ•œ๋‹ค.
  3. Test Data๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ, ML model ์ด ๋„์ถœํ•ด๋‚ธ ๊ฒฐ๊ณผ๊ฐ’์„ ๋ณด๊ณ  ์‚ฌ๋žŒ์€ ์ด๊ฒƒ์ด ํ‹€๋ฆฐ๊ฒƒ์ธ์ง€ ๋งž์€๊ฒƒ์ธ์ง€ ML model ์—๊ฒŒ ์•Œ๋ ค์ค€๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ML model ์€ ๊ณ„์† ์˜ฌ๋ฐ”๋ฅธ ์ •๋‹ต์„ ์ฐพ์•„๋‚ด๋„๋ก ๊ณ„์† ์ธ๊ฐ„์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•˜๊ฒŒ ๋œ๋‹ค.

์ตœ์ข…์ ์œผ๋กœ ML์€ ์ฃผ์–ด์ง„ ์ธํ’‹์— ๋Œ€ํ•ด์„œ ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๊ฐ’์„ ๋„์ถœํ•ด๋‚ผ๋•Œ๊นŒ์ง€ ํ•™์Šต์„ ๋งˆ์ณค๋‹ค๋ฉด ํ•ด๋‹น ML model์€ ์„œ๋น„์Šค์— ํ™œ์šฉ์ด ๋  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.


๐Ÿ“Œ ยฒUnsupervised learning

img
Unsupervised Learning

์ง€๋„ํ•™์Šต๊ณผ ์ฐจ์ด์ ์„ ๋น„๊ตํ•ด๋ณด๋ฉด

  • ์ง€๋„ํ•™์Šต์˜ Input : ๋ฐ์ดํ„ฐ + Label + ํ•™์Šต๋ชฉํ‘œ(training objectives)
  • ๋น„์ง€๋„ํ•™์Šต์˜ Input : ๋ฐ์ดํ„ฐ + ํ•™์Šต๋ชฉํ‘œ(training objectives)

์ „ํ˜•์ ์ธ ๋น„์ง€๋„ํ•™์Šต ๋ฐฉ์‹์€ dimensionality reduction ๊ณผ clustering ๋ผ๋Š” ๊ณผ์ •์„ ํฌํ•จํ•œ๋‹ค.


๐Ÿ“Œ ยณReinforcement Learning

img
Reinforcement Learning

๋‹ค์–‘ํ•œ ํ–‰๋™๊ณผ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฑฒ์œผ๋ฉด์„œ ํ™˜๊ฒฝ์— ๋Œ€ํ•ด ๊ฒฝํ—˜(experience)๋ฅผ ์Œ“์•„๋‚˜๊ฐ€๋ฉด์„œ ํ•™์Šตํ•ด ๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹์ด๋‹ค.

scikit-learn

๐Ÿ“Œ Iris dataset์œผ๋กœ ์‹œ์ž‘ํ•˜๊ธฐ

img
Iris dataset

์—ฌ๋Ÿฌ๊ฐ€์ง€ dataset ์ค‘์—์„œ Iris dataset์— ๋Œ€ํ•ด Classification์„ ์ง„ํ–‰ํ•ด๋ณด์ž.

Iris dataset์€ 4๊ฐ€์ง€์˜ feature(SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm)๋ฅผ ๋ณด์œ ํ•˜๊ณ , 3๊ฐ€์ง€์˜ ํด๋ž˜์Šค(versicolor, setosa, virginica)๋ฅผ ๊ฐ€์ง„๋‹ค.

from sklearn.datasets import load_iris
dataset = load_iris()

sklearn.datasets ์— ์ •์˜๋˜์–ด ์žˆ๋Š” load_iris ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด datast ์„ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.


๐Ÿ“Œ Dataset Properties

print(dataset['data'].shape)
print(dataset['data'][:3])
# (150, 4)
# [[5.1 3.5 1.4 0.2]
#  [4.9 3.  1.4 0.2]
#  [4.7 3.2 1.3 0.2]]
  • Iris dataset์„ ํฌํ•จํ•ด์„œ sklearn์— ์žˆ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ 2์ฐจ์› ํ–‰๋ ฌ์˜ ํ˜•ํƒœ๋ฅผ ์ง€๋‹Œ๋‹ค. ์ด๋•Œ ํ–‰๋ ฌ์˜ ํ–‰์€ sample ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜์ด๊ณ , ์—ด์€ feature ์˜ ๊ฐœ์ˆ˜์ด๋‹ค. ์ฆ‰ โ€œ(sample ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜, feature ๊ฐœ์ˆ˜)โ€ ์˜ ํ–‰๋ ฌ ํ˜•ํƒœ๋ฅผ ๋ณด์œ ํ•˜๊ณ  ์žˆ๋‹ค.
  • 150๊ฐœ์˜ sample ๊ฝƒ ๋ฐ์ดํ„ฐ์™€ 4๊ฐ€์ง€ ์ข…๋ฅ˜์˜ feature ๋ฅผ ๋ณด์œ ํ•˜๊ณ  ์žˆ๋‹ค.
  • shape: ํ•ด๋‹น dataset ์˜ ํ–‰๋ ฌ์˜ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ฐ€์ง„ tupleํ˜•ํƒœ์˜ ๋ณ€์ˆ˜์ด๋‹ค. (150, 4) ๋Š” Iris dataset๊ฐ€ 150ํ–‰ 4์—ด์งœ๋ฆฌ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ ํ–‰๋ ฌ์ž„์„ ์˜๋ฏธํ•œ๋‹ค.
print(dataset['target'].shape)
# (150,)
print(dataset.target)
#[0 0 0 0 0 0 0 0 0 0 ......]
print(dataset.target_names)
#['setosa' 'versicolor' 'virginica']
  • target: ํ•ด๋‹น dataset ์˜ ๋ชจ๋“  ๊ฐ ๋ฐ์ดํ„ฐ๋“ค์ด ์†ํ•ด์žˆ๋Š” label ๊ฐ’์„ ์ˆซ์ž๋กœ ์น˜ํ™˜๋˜์–ด ์žˆ๋‹ค.

    => 0์ด ์ถœ๋ ฅ๋˜๋Š” ๊ฒฝ์šฐ, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” 0๋ฒˆ์งธ label ํด๋ž˜์Šค์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์ž„์„ ์˜๋ฏธ

  • target_names: ํ•ด๋‹น dataset ์˜ label ์ข…๋ฅ˜๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ๋‹ค.

๐Ÿ“Œ Dataset Methods

  • fit(param1, param2): ์ž…๋ ฅ๋ฐ›์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์ ˆํžˆ ๋ณ€ํ™˜์‹œ์ผœ์„œ ML model ์„ ํ•™์Šต(training) ์‹œ์ผœ์ฃผ๋Š” ํ•จ์ˆ˜์ด๋‹ค. dataset์„ ํ†ตํ•ด ML model์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

    • ์ง€๋„ํ•™์Šต์˜ ๊ฒฝ์šฐ, data์™€ label 2๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฐ์ดํ„ฐ(์ธํ’‹ ๊ฐ’)๋กœ ๋„˜๊ฒจ์ค˜์•ผํ•œ๋‹ค.

      classifier.fit(X, y)

    • ๋น„์ง€๋„ํ•™์Šต์˜ ๊ฒฝ์šฐ, data 1๊ฐœ๋งŒ ์ธํ’‹์œผ๋กœ ๋„˜๊ฒจ์ฃผ๋ฉด ๋œ๋‹ค.

      clustering_model.fit(X)

  • predict(): ML model์— input์„ ๋„ฃ์œผ๋ฉด ๊ทธ์— ๋Œ€ํ•œ ์ ์ ˆํ•œ output์„ ์ƒ์‚ฐํ•ด๋‚ธ๋‹ค. fit ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•ด์„œ ML ๋ชจ๋ธ์„ ํ•™์Šต์ด ์ข…๋ฃŒ๋œ ํ›„, ML ๋ชจ๋ธ์„ ์„œ๋น„์Šค์—์„œ ํ™œ์šฉํ• ๋•Œ predict ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ž…๋ ฅํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์•„์›ƒํ’‹์„ ์ƒ์„ฑํ•ด์ค€๋‹ค.

๐Ÿ“Œ ์ „๋ฐ˜์ ์ธ ML ์˜ processes ๊ณผ์ •

img
ML의 process과정
  1. Get Data: ์ธ๊ฐ„์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ๋ฐ›๋Š”๋‹ค.
  2. Clean, Prepare, Manipulate Data: ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ(Training Data) ์™€ ์„ฑ๋Šฅ ํ‰๊ฐ€์‹œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ(Test Data) ๋ฅผ ๋ถ„๋ฅ˜์‹œํ‚จ๋‹ค.
  3. Train Model: ํ›ˆ๋ จ์‹œํ‚จ๋‹ค.
  4. Test Data: ํ•™์Šต์ด ์™„๋ฃŒ๋œ ML ๋ชจ๋ธ์„ ๊ฐ€์ง€๊ณ  ํ…Œ์ŠคํŠธ๋ฅผ ํ•ด๋ณธ๋‹ค.
  5. Improve: ๋” ํ–ฅ์ƒ๋œ ML ๋ชจ๋ธ์ด ๋‚˜์˜ค๊ธฐ์ „๊นŒ์ง€ 1~3 ์‚ฌ์ด์˜ ๊ณผ์ •๋“ค์„ ๊ณ„์†ํ•ด์„œ ๋ฐ˜๋ณตํ•œ๋‹ค.
  6. Final Model: ํ–ฅ์ƒ ๊ณผ์ •๊นŒ์ง€ ๋งˆ์ณค๋‹ค๋ฉด ์ตœ์ข…์ ์ธ ML model ์ด ์ƒ์„ฑ๋œ๋‹ค.

๐Ÿ“Œ Split the dataset

dataset์— ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ๋“ค์— ๋Œ€ํ•ด ํ•™์Šตํ•  ๋ฐ์ดํ„ฐ(Training Data)์™€ ์„ฑ๋Šฅ ํ‰๊ฐ€์‹œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ(Test Data)๋ฅผ ๋ถ„๋ฅ˜์‹œํ‚ค๋Š” ๊ณผ์ •์„ Split the dataset์ด๋ผํ•œ๋‹ค.

์ด๋•Œ test dataset ์€ ์™ธ๋ถ€์— ์œ ์ถœ๋˜์–ด์„œ๋Š” ์•ˆ๋œ๋‹ค. ์ด๋•Œ ์œ ์ถœ๋œ ๋ฐ์ดํ„ฐ๋Š” dara leakage ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

๋‘ dataset์€ ๋น„์œจ ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ, ๋ณดํ†ต training dataset์˜ ์‚ฌ์ด์ฆˆ๊ฐ€ test dataset์˜ ์‚ฌ์ด์ฆˆ๋ณด๋‹ค ๋” ํฌ๋‹ค.

  • tran_test_split(X, y): Training Data์™€ Test Data๋กœ ๋ถ„๋ฅ˜์‹œ์ผœ์ฃผ๋Š” ํ•จ์ˆ˜

    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.1)
    
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)
    
    #(135, 4)
    #(15, 4)
    #(135,)
    #(15,)
    • test_size: test_size๊ฐ€ 0.1์ผ ๋•Œ test data์˜ ๋น„์œจ์ด 10%์ด๊ณ , training data์˜ ๋น„์œจ์ด 90%์ด๋‹ค. (default๋Š” 0.25์ด๋‹ค.)

๐Ÿ“Œ Transformation ๊ณผ์ •

๊ฐ feature๋“ค์˜ ์Šค์ผ€์ผ ๊ฐ’์ด ๋งŽ์ด ์ฐจ์ด๊ฐ€ ๋‚ ์ˆ˜์žˆ๋‹ค.

ex) ์–ด๋–ค dataset์˜ feature1์˜ data๋“ค์˜ ๊ฐ’๋“ค์€ 0.1, 0.3, 0.6๊ณผ ๊ฐ™์€ ๊ฐ’๋“ค์ธ๋ฐ, feature2์˜ data๊ฐ’๋“ค์€ 10000, 500๋ฐฑ๋งŒ ๊ณผ ๊ฐ™์ด ์ฐจ์ด๊ฐ€ ์‹ฌํ•˜๋ฉด ML์ด ํ•™์Šตํ•˜๊ธฐ๊ฐ€ ๊ณค๋ž€ํ•ด์ง„๋‹ค.

๋”ฐ๋ผ์„œ ๊ฐ feature๋“ค์˜ ์Šค์ผ€์ผ์„ ๋งž์ถฐ์ฃผ๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋•Œ ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜๊ฐ€ StandardScaler์ด๋‹ค.

์ฆ‰ dataset์„ splitํ•˜๊ณ  ML์— ๊ฐ’์„ ๋„ฃ์–ด์ฃผ๊ธฐ์ „์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•์‹œํ‚ค๋Š”(์Šค์ผ€์ผ์„ ๋ฐ”๊ฟ”์ฃผ๋Š”) ์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์„ Transformation์ด๋ผ๊ณ ํ•œ๋‹ค.

z = (x - u) / s ์‹์„ ํ†ตํ•ด ์Šค์ผ€์ผ์„ ์žฌ์กฐ์ •ํ•œ๋‹ค.

  • StandardScaler(): feature๋“ค์˜ ์Šค์ผ€์ผ์„ ๋งž์ถฐ์ฃผ๋Š” ํ•จ์ˆ˜

    from sklearn.preprocessing import StandardScaler
    print(X_train[:3])
    print(StandardScaler().fit(X_train[:3]).transform(X_train[:3]))
    # [[5.1 3.4 1.5 0.2]
    # [5.9 3.2 4.8 1.8]
    # [5.7 2.8 4.1 1.3]]
    # [[-1.37281295  1.06904497 -1.38526662 -1.3466376 ]
    #  [ 0.98058068  0.26726124  0.93916381  1.0473848 ]
    #  [ 0.39223227 -1.33630621  0.44610281  0.2992528 ]]

    fit๋ฉ”์„œ๋“œ(dataset์˜ fit๊ณผ ๋‹ค๋ฅด๋‹ค)๋ฅผ ํ†ตํ•ด ํ‰๊ท  u๊ณผ ํ‘œ์ค€ํŽธ์ฐจ s๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , transform์„ ํ†ตํ•ด (x - u) / s๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.


๐Ÿ“Œ Pipeline

์•ž์„œ transform ๊ณผ์ •๊นŒ์ง€ ๊ฑฐ์น˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•์„ ๋‹ค ํ–ˆ๋‹ค๋ฉด, ์ด์ œ ๋‹น์—ฐํžˆ ML ๋ชจ๋ธ์— ๋ณ€ํ˜•ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์„œ ํ•™์Šต์‹œ์ผœ์ฃผ๋ฉด ๋˜๋Š”๋ฐ ์ด๋•Œ Pipeline์„ ์‚ฌ์šฉํ•ด์„œ ML ๋ชจ๋ธ์— ์‰ฝ๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

make_pipeline ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ Pipeline๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ,Transformation๋ณ€ํ˜• ๊ณผ์ •๊ณผ test๊ณผ์ •๊นŒ์ง€ ๋ชจ๋“  ์ผ๋ จ์„ ๊ณผ์ •์„ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

pipe = make_pipeline(StandardScalaer(), RandomForestClassifier())
pipe.fit(X_train, y_train)
print(accuracy_score(pipe.predict(X_test), y_test))
# 0.9333333333333333
  • make_pipeline์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์›ํ•˜๋Š” ์ˆœ์„œ๋Œ€๋กœ ๊ฐ์ฒด๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ๋œ๋‹ค.
  • StandardScaler(): ์ง์ „์— ์‚ดํŽด๋ดค๋“ฏ์ด Transformation๋ฅผ ํ•ด์ฃผ๋Š” ๊ฐ์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. fit๊ณผ transformํ˜ธ์ถœ ์—†์ด ๋„˜๊ธฐ๊ธฐ๋งŒ ํ•˜๋ฉด ๋œ๋‹ค.
  • RandomForsetClassifier(): ์ด ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋ฐ˜ํ™˜๋˜๋Š” ๊ฐ์ฒด๋Š” ๋ถ„๋ฅ˜ ML๋ชจ๋ธ ์ค‘์— ํ•˜๋‚˜์ด๋‹ค.
  • accuracy_score(predict, label) : ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ด์ฃผ๋Š” ํ•จ์ˆ˜

๐Ÿ“Œ Confusion matrix

img
혼동 행렬

์ „์ฒด prediection ์ด ์–ผ๋งˆ๋‚˜ ๋งž์•˜๋Š”์ง€(์ •๋‹ต์ธ์ง€๋ฅผ) ํ˜•๋ ฌ ํ˜•ํƒœ๋กœ ๋ˆˆ์— ๋ณด๊ธฐ์ข‹๊ฒŒ ํ‘œํ˜„ํ•œ ๊ฒƒ์„ ํ˜ผ๋™ ํ–‰๋ ฌ์ด๋ผ ํ•œ๋‹ค.

ํ•œ ์ถ•์€ label ์ˆ˜์น˜๋กœ, ํ•œ ์ถ•์€ predict (์–ผ๋งˆ๋‚˜ ๋งž์•˜๋Š”์ง€)๋ฅผ ํ‘œํ˜„ํ•œ๋‹ค.

  • confusion_matrix(label, predict)

    from sklearn.metrics import confusion_matrix
    print(confusion_matrix(y_test, pipe.predict(X_test)))
    #[[5 0 0]
    # [0 7 0]
    # [0 1 2]]

๐Ÿ“Œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ„ํ•œ hyperparameters

๋Œ€๋ถ€๋ถ„์˜ ML ๋ชจ๋ธ์—๋Š” ์ ํ•ฉ ๋ชจ๋ธ์˜ ์ตœ์ข… ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋งŽ์€ ๋งค๊ฐœ ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋‹ค. RandomForestClassifier๋งŒ ํ•ด๋„ ๋งค์šฐ ๋งŽ์€ ๋งค๊ฐœ ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ์‚ฌ๋žŒ์ด ์ง์ ‘ ๊ฐ’์„ ์ง€์ •ํ•ด์ค˜์•ผํ•˜๋Š” ๊ฐ’๋“ค์„ hyperparameters๋ผ๊ณ ํ•œ๋‹ค.

์ด hyperparameters๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ• ์ค‘ ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์ธ randomized search์€ ํŠน์ • ๋ฒ”์œ„๋ฅผ ์ง€์ •ํ•ด์ฃผ๊ณ  ์ฐพ๊ฒŒํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ 2๊ฐ€์ง€ ์˜ต์…˜์ด ์žˆ๋‹ค. [nextimators, maxdepth]

  • RandomizedSearchCV(): ์ฃผ์–ด์ง„ ๋ฒ”์œ„ ๋‚ด์—์„œ Random ํ•˜๊ฒŒ ๊ฐ’์„ ํ•˜๋‚˜ ๋ฝ‘์•„์„œ ์ง€์ •๋œ ํšŸ์ˆ˜๋งŒํผ๋งŒ ์‹œ๋„ํ•ด๋ณด๋Š” ํ•จ์ˆ˜๋กœ max_depth์— ์ง€์ •๋œ ๊ตฌ๊ฐ„์•ˆ์—์„œ RandomForestClassifier์—๊ฒŒ ํ•™์Šต ์‹œํ‚จ๋‹ค.

    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import randint
    
    param_dists = {'n_estimators': randint(1, 5),
                  'max_depth': randint(5, 10)}
    search = RandomizedSearchCV(estimator = RandomForestClassifier(),
                               n_iter = 5,
                               param_distributions=param_dists)
    search.fit(X_train, y_train)
    
    search.best_params_
    search.score(X_text, y_test)

    n_iter๋Š” ์ด ๋ฐ˜๋ณต ํšŸ์ˆ˜์ด๋‹ค.


๐Ÿ“Œ ์™ธ๋ถ€ csv๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ

<data์˜ ์ข…๋ฅ˜ 2๊ฐ€์ง€>

  • Structured Data (์ •ํ˜• ๋ฐ์ดํ„ฐ) : table ํ˜•ํƒœ๋กœ์จ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ex) ํƒ€์ž„๋ผ์ธ ๋ฐ์ดํ„ฐ
  • UnStructured Data (๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ) : table ํ˜•ํƒœ๋กœ์จ ํ‘œํ˜„๋  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ ex) ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ๋„ํ๋จผํŠธ, ์˜ค๋””์˜ค ๋“ฑ๋“ฑ

ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ structured data๋ฅผ ๋ถ„์„ํ•˜๋Š”๋ฐ ์ข‹๋‹ค. ์ฆ‰ ํ…Œ์ด๋ธ” ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๋Š”๋ฐ ์ข‹์€๋ฐ ์ด๋•Œ pandas๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

  • read_csv(๊ฒฝ๋กœ): csv๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ๋“ค์ธ๋‹ค.

    img
    csv파일의 데이터를 pandas를 통해 가져오기
  • groupby(์ปฌ๋Ÿผ): ํŠน์ • ์—ด์„ ๊ฐ’์˜ ์ข…๋ฅ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

    print(data_df.groupby('target').size())
    #target
    #0    499
    #1    526
    #dtype: int64
  • drop(columns=?, axis=?): ํŠน์ • ์—ด์„ ์—†์•ค ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์ด๋•Œ axis๋Š” ์ถ•์œผ๋กœ 0์ด๋ฉด ํ–‰์ด, 1์ด๋ฉด ์—ด์ด drop๋œ๋‹ค.

    X = data_df.drop(columns="target", axis=1)
    y = data_df["target"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2)
    print(X_train.shape, X_test.shape)
    print(y_train.shape, y_test.shape)
    # (768, 13) (257, 13)
    # (768,) (257,)
    
    rf_cls = RandomForestClassifier()
    rf_cls.fit(X_train, y_train)
    print(accuracy_score(rf_cls.predict(X_test), y_test))
    print(confusion_matrix(y_test), rf_cls.predict(X_test))

    RandomForestClassifier๋Š” ๋ถ„๋ฅ˜๋ชจ๋ธ์ด๋ผ fit, predict๊ฐ€ ์ด์ „์— ํ–ˆ๋˜ ๊ฒƒ๊ณผ ๋™์ผํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.


ํ•™์Šต์‹œํ‚ฌ ๋ชจ๋ธ๋“ค

  • ๊ฒฐ์ • ํŠธ๋ฆฌ: ๊ฒฐ์ •์‚ฌํ•ญ๋“ค์„ ๊ฐ€์ง€๊ณ  ์ด์ง„ํŠธ๋ฆฌ ํ˜•ํƒœ๋กœ ๊ณ„์† ๊ฐˆ๋ฆผ๊ธธ์—์„œ ๋‚˜๋‰˜๋Š” ํ˜•ํƒœ์ด๋‹ค.