Skip to content

Instantly share code, notes, and snippets.

@upidea
upidea / gist:4e036f3749bff630574743981dd7fa84
Created August 16, 2022 03:32
bash shell for multiple string into variables and bla..
#!/bin/bash
export PATH=/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
if [[ $ENGINE_CORE_CLICKHOUSE_USER != '' ]]; then
CLARG="--host ${ENGINE_CORE_CLICKHOUSE_HOST} --port ${ENGINE_CORE_CLICKHOUSE_PORT} --user ${ENGINE_CORE_CLICKHOUSE_USER} --password ${ENGINE_CORE_CLICKHOUSE_PASSWORD}"
elif [[ $ENGINE_CORE_CLICKHOUSE_HOST != '' ]]; then
CLARG="--host ${ENGINE_CORE_CLICKHOUSE_HOST} --port ${ENGINE_CORE_CLICKHOUSE_PORT}"
else
CLARG="--host 127.0.0.1 --port 9000"
@upidea
upidea / tfidf
Created March 3, 2020 01:22
tfidf
# tf-idf (term frequency - inverse document frequency)
# 常用于挖掘文章的关键词;
# 在同一篇文章内值大的表示该词在这篇文章中有较高区分度:
# 在该篇文章中反复出现, 而在全部文档中出现较少(逆文档频率)
# 整个语料中值大的, 并无特别的意义, 不适于跨文章比较
# 词频向量化
from sklearn.feature_extraction.text import CountVectorizer
# token_pattern 参数设置来指定字符切分字符串: r"(?u)\b[^@]+\b '\\b\\w+\\b'
# Format data
out = "\n".join([
" ".join([
f"{data[i+1]:02x}{data[i]:02x}"
for i in range(line, min(line + 256, length), 16)
])
for line in range(0, length, 256)
])
import mmap
import struct
print(struct.unpack('<i', b'\xa0\xcf\x6e\x44')) # 1148112800
struct.unpack('>i', b'\x95\x6b\x31\x93') # -1788137069
with open('tmp', 'rb', 0) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
pos = s.find(b'\x64\x65')
@upidea
upidea / to_category.py
Created January 29, 2019 03:20
numpy的one_hot编码函数
def dense_to_one_hot(labels_dense, num_classes):
"""Convert class labels from scalars to one-hot vectors."""
num_labels = labels_dense.shape[0]
index_offset = numpy.arange(num_labels) * num_classes
labels_one_hot = numpy.zeros((num_labels, num_classes))
labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
return labels_one_hot
看起来 这是一段代码
# 直接用numpy做数据打散、划分
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(labels))
print('Shape of Data Tensor:', data.shape)
print('Shape of Label Tensor:', labels.shape)
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
@upidea
upidea / keras_precision_recall.py
Created January 4, 2019 02:19
callback for keras fit, caculate precison and recall.
class Metrics(tf.keras.callbacks.Callback):
def on_train_begin(self, logs={}):
self.confusion = []
self.precision = []
self.recall = []
self.f1s = []
self.kappa = []
self.auc = []
def on_epoch_end(self, epoch, logs={}):
@upidea
upidea / everydayenglish.py
Created January 3, 2019 07:39
Generate Anki flashcard from web snatch.
import os
import re
import requests
import json
import time
import datetime
import genanki
my_model = genanki.Model(
201901021920,
select to_char('2018-04-26 22:23:40', 'yyyyMMdd');
select date_format('2018-04-26 22:23:40', 'yyyyMMdd');
select date_format('2018-04-26 22:23:40', 'yyyy-MM-dd HH:mm:ss');
select to_char('2018-04-26 22:23:40', 'yyyy-MM-dd hh24:mi:ss');
select date_format(to_unix_timestamp(nvl('2018-04-26 22:23:40', '')), 'yyyyMMdd');
select from_unixtime(unix_timestamp('20171205 22:23:40','yyyymmdd HH:mm:ss'),'yyyy-mm-dd HH-mm-ss');
@upidea
upidea / SparkGibbsLDA.scala
Created July 1, 2018 02:17 — forked from waleking/SparkGibbsLDA.scala
We implement gibbs sampling for LDA by Spark. This version performs much better than alpha version, and now can handle 3196204 words, 100 topics, 1000 sample iterations on server in 161.7 minutes. To solve the long time consuming in collect() process in alpha version, we utilize the cache() method as line 261 and line 262. We also solve a pile o…
package topic
import spark.broadcast._
import spark.SparkContext
import spark.SparkContext._
import spark.RDD
import spark.storage.StorageLevel
import scala.util.Random
import scala.math.{ sqrt, log, pow, abs, exp, min, max }
import scala.collection.mutable.HashMap