Dự đoán giá trị Bitcoin trong tương lai bằng LSTM

5 min readFeb 6, 2020

Lại tiếp tục series sử dụng mạng LSTM trong việc giải quyết bài toán dự đoán. Lần này sẽ là dự đoán giá trị của đồng tiền ảo Bitcoin trong tương lai.

Mời các anh em xem lại bài Phân tích cảm xúc bình luận với mạng LSTM trong Keras

Đoạn code mình có tham khảo của anh @Abhinav Sagar.

Load thư viện Machine Learning

Toàn những thư viện quan trọng như Kera, pandas, seaborn, numpy, matplotlib…Trong đó, Keras mình lấy mạng LSTM.

%tensorflow_version 2.x
import json
import requests
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, LSTM
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
%matplotlib inline

Crawl dữ liệu từ trang so sánh tiền ảo CryptoCompare

Mình sẽ load toàn bộ dữ liệu của trang cryptocompare về bằng hàm request. Việc crawl dữ liệu rất đơn giản do cryptocompare đã chia sẻ API giúp chúng ta đỡ cực. Anh em chỉ cần thay tham số:

fsym=BTC bằng chính các đồng tiền khác;
tsym=CAD là mệnh giá quy đổi như USD, VND….

endpoint = 'https://min-api.cryptocompare.com/data/histoday'
res = requests.get(endpoint + '?fsym=BTC&tsym=CAD&limit=500')

Lưu ý: Các giá trị lấy từ trang CryptoCompare về chỉ mang tính chất học thuật, nghiên cứu tham khảo chứ không dùng vào mục đích kinh tế thực tế

Nếu anh em để ý thấy hàm request của ta lấy được sẽ trả về dạng Json. Trong đó trường Data là thông tin chúng ta cần quan tâm. Trường thông tin này cung cấp nhiều giá trị quan trọng với chúng ta như: thời gian (time), giá trị tiền ảo lúc đóng cửa (close), giá trị cao nhất trong ngày (high), giá trị thấp nhất trong ngày (low) và giá trị tiền ảo lúc mở cửa (open).

Tiếp theo mình sẽ đổ dữ liệu từ trường Data này vào một DataFrame bằng Pandas. Lưu ý là biến cần dự đoán là cột close — giá trị Bitcoin lúc đóng cửa

hist = pd.DataFrame(json.loads(res.content)['Data'])
hist = hist.set_index('time')
hist.index = pd.to_datetime(hist.index, unit='s')
target_col = 'close'

Phân chia tập dữ liệu

Chúng ta sẽ chia tập dữ liệu thu được thành hai bộ: train — test.

def train_test_split(df, test_size=0.2):
    split_row = len(df) - int(test_size * len(df))
    train_data = df.iloc[:split_row]
    test_data = df.iloc[split_row:]
    return train_data, test_data

Trong đó chúng ta lấy 20% dữ liệu về sau để làm test.

train, test = train_test_split(hist, test_size=0.2)

Để xem rõ hơn kết quả, chúng ta có thể xây dựng một hàm vẽ đồ thị cho trực quan.

def line_plot(line1, line2, label1=None, label2=None, title='', lw=2):
    fig, ax = plt.subplots(1, figsize=(13, 7))
    ax.plot(line1, label=label1, linewidth=lw)
    ax.plot(line2, label=label2, linewidth=lw)
    ax.set_ylabel('price [CAD]', fontsize=14)
    ax.set_title(title, fontsize=16)
    ax.legend(loc='best', fontsize=16);

Và gọi hàm xem kết quả:

line_plot(train[target_col], test[target_col], 'training', 'test', title='')

Đoạn màu vàng ở cuối là vùng để test xem mô hình của chúng ta chạy như thế nào. Thực tế đây là bộ test chuẩn hay có thể gọi là bộ valid cũng được.

Tiền xử lý dữ liệu

Ta xây dựng hai hàm để chuẩn hóa dữ liệu.

def normalise_zero_base(df):
    return df / df.iloc[0] - 1

def normalise_min_max(df):
    return (df - df.min()) / (data.max() - df.min())def extract_window_data(df, window_len=5, zero_base=True):
    window_data = []
    for idx in range(len(df) - window_len):
        tmp = df[idx: (idx + window_len)].copy()
        if zero_base:
            tmp = normalise_zero_base(tmp)
        window_data.append(tmp.values)
    return np.array(window_data)

Tách dữ liệu bằng thủ tục extract_window_data vừa xây dựng bên trên.

def prepare_data(df, target_col, window_len=10, zero_base=True, test_size=0.2):
    train_data, test_data = train_test_split(df, test_size=test_size)
    X_train = extract_window_data(train_data, window_len, zero_base)
    X_test = extract_window_data(test_data, window_len, zero_base)
    y_train = train_data[target_col][window_len:].values
    y_test = test_data[target_col][window_len:].values
    if zero_base:
        y_train = y_train / train_data[target_col][:-window_len].values - 1
        y_test = y_test / test_data[target_col][:-window_len].values - 1return train_data, test_data, X_train, X_test, y_train, y_test

Xây dựng mô hình LSTM

Trong keras ta có thể tạo một thủ tục để xây dựng mô hình này

def build_lstm_model(input_data, output_size, neurons=100, activ_func='linear',                      dropout=0.2, loss='mse', optimizer='adam'):     model = Sequential()     model.add(LSTM(neurons, input_shape=(input_data.shape[1], input_data.shape[2])))     model.add(Dropout(dropout))     model.add(Dense(units=output_size))     model.add(Activation(activ_func))      model.compile(loss=loss, optimizer=optimizer)     return model

Ta thiết lập các tham số đầu vào

np.random.seed(42)
window_len = 5
test_size = 0.2
zero_base = True
lstm_neurons = 100
epochs = 20
batch_size = 32
loss = 'mse'
dropout = 0.2
optimizer = 'adam'

Và sử dụng hàm prepare_data này để lấy ra các giá trị đầu vào cho mô hình LSTM

train, test, X_train, X_test, y_train, y_test = prepare_data(
    hist, target_col, window_len=window_len, zero_base=zero_base, test_size=test_size)

Và cuối cùng là build mô hình LSTM

model = build_lstm_model(
    X_train, output_size=1, neurons=lstm_neurons, dropout=dropout, loss=loss,
    optimizer=optimizer)history = model.fit(
    X_train, y_train, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=True)

Đánh giá mô hình

Để đánh giá độ chính xác của mô hình, ta sử dụng mean_absolute_error để đo độ lỗi của mô hình.

targets = test[target_col][window_len:]
preds = model.predict(X_test).squeeze()
mean_absolute_error(preds, y_test)

Việc dự đoán lỗi chỉ khoảng 2%. Ta thử biểu diễn kết quả dự đoán được với giá trị thực tế bằng đồ họa trực quan xem thế nào.

preds = test[target_col].values[:-window_len] * (preds + 1)
preds = pd.Series(index=targets.index, data=preds)
line_plot(targets, preds, 'actual', 'prediction', lw=3)

WOW! Như anh em thấy trên đồ thị ta, phần dự đoán màu vàng chạy khá chính xác với phần thực tế màu xanh.

Tổng kết

Thời điểm viết bài này là ngày 06/02/2020, anh em có thể dùng mô hình này dự đoán tiếp vào các ngày tiếp theo xem giá trị Bitcoin biến động như thế nào nhé.