Bayesian Topic Modelling

Lima-Madrid, 04.Dic.2020

Bienvenidos!

III CONFERENCIA INTERNACIONAL DE PROCESOS ESTOCÁSTICOS, FENÓMENOS ALEATORIOS Y SUS APLICACIONES

“Estadística, matemática y computación para el desarrollo científico e industrial”

Statistics, data, analytics - Dando forma al futuro

Statistics, data, analytics - Desafio, Responsabilidad

“¡Qué buenos resultados! Eso significa que puedo ahorrar costes cambiando el servicio del equipo de análisis de encuestas por este algoritmo.”

Bayesian Topic Modelling - Más que un título

Discovering topics in documents: Clustering

Convergence of Big Data and Artificial Intelligence

UC3M: Master in Statistics for Data Science

Un dia en el Wanda Metropolitano …

Estudiando y Aplicando Estadística desde …

Algo más en mi web

Dos sectores - Dos historias reales

Paquetería
¿Le gustaría hacer algún otro comentario en relación a los servicios de su empresa?

Energía
¿Por qué usted no está plenamente satisfecho con su empresa?

Impacto en el Servicio

Temas Mencionados

Palabras Utilizadas

Expresiones Frecuentes

“¿Qué recomendaciones concretas doy a los Jefes de Oficina?¿Qué deberían cambiar?”

Reducción de Coste

Temas y Sentimientos

Expresiones frecuentes

Visión Cliente

“Puedo cambiar el servicio del equipo de análisis de encuestas por un servicio cognitivo”

Nuestro Reto: Seguro para Coches

Datos: Describiendo el Siniestro

AL DAR MARCHA ATRÁS GOLPEÉ A OTRO VEHÍCULO ESTACIONADO
AL ENTRAR A UN APARCAMIENTO REALIZO GIRO A LA IZQUIERDA Y ME GOLPEO CONTRA UN BOLARDO
AL SALIR DEL GARAJE HE ROZADO TODO EN LATERAL LADO CONDUCTOR CON UNA COLUMNA
VOY CIRCULANDO POR LA VIA Y AL ESTORNUDAR PIERDO EL CONTROL DEL VEHÍCULO Y ME GOLPEO CONTRA LA MEDIANA
VOY CIRCULANDO POR LA VÍA FRENO EN EL PASO DE CEBRA EL VEHÍCULO CONTRARIO NO FRENA A TIEMPO Y ME GOLPEA
CIRCULANDO POR LA VÍA SE CRUZA UN PEATÓN SIN MIRAR Y LE ATROPELLO
TRASERA DEL COCHE

CAYÓ UNA GRANIZADA TREMENDA Y ME HA ABOLLADO LA CHAPA DEL COCHE Y LOS EMBELLECEDORES DE LAS PUERTAS
DEBIDO A LAS LLUVIAS TORRENCIALES SE INUNDA MI VEHÍCULO
ENCUENTRO QUE HAN ROBADO EMBELLECEDORES DE LAS CUATRO LLANTAS
ROTURA DE CRISTAL POR ROBO
ARAÑAZOS EN EL LATERAL IZQUIERDO POR VANDALISMO
ESTOY CIRCULANDO POR LA VIA CUANDO GOLPEO A UN JABALI
CHOQUE FRONTAL CON UNA MACETA
ERROR AL ECHAR GASOLINA AL COCHE EN VEZ DE DIESEL
EN UNA EMPRESA DE RECOGIDA DE COCHE EN EL AEROPUERTO

Datos: Palabras Más Utilizadas

Datos: Los Problemas de Siempre

Documentos con textos relativamente cortos: 14 palabras por texto

Muchas palabras comunes entre textos con diferente asunto

Escritura humana: ortografía, sinónimos, no nativos, etc.

Datos: Expresiones (Bigram Count Network)

Modelo: Latent class model (LCM)

Clustering of high-dimensional categorical data.
LCM: mixture models that assign the set of multivariate categorical observations to a latent class $ z $. Within each $ z $ the observed variables are statistically independent.
LCM estimates the class probability $ \lambda $ and the probability of observing a particular response for a question conditioned on the latent class.

\begin{array}{rcl} \lambda | \alpha & \sim & \textrm{Dirichlet}(\alpha) \textrm{ or } \textrm{Dirichlet Process}(\alpha) \\ z_i | \lambda & \sim & \textrm{Multinomial}(\lambda) \\ U_{j,k} | \beta & \sim & \textrm{Dirichlet}(\beta) \\ X_{j,k} | U_{j,z_i=k} & \sim & \textrm{Multinomial}(U_{j,k}) \end{array}

$ X $ representa la variable observada. $ K $ indica que hay un $ U_k $ para cada cluster ( $1 \ldots K) $ . Hay una asignación de cluster para cada texto ( $ 1 \ldots I $ ). $ \lambda $ es el tamaño proporcional de cada cluster.

Modelo: Objetivo

Agrupar los $ I $ Documentos en $ K $ Clases.
Identificar ‘’patrones’’ en el uso de las $ J $ palabras.
Analizar la presencia o no - $ R $ respuestas - de cada palabra en el documento.
$K $ conocido o desconocido.

Modelo: quien es quien

$ \alpha,\beta $ are hyper-parameters, govern the sparsity of the model.
$ \lambda | \alpha $ the size of the classes in simple LCM or non-parametric LCM.
$ z $ contains the latent class assignment for each individual.
$ U $ 3-way tensor of size $ J \times K \times R$. Contains the probability for response - content the word - $ r $ from an individual - a document - from class $ k $ for question - the word - $ j $.
$ X_{j,k} | U_{j,z_i=k} $ specifies that the response of an individual $ i $ that belongs to a class $ k $ is drawn from a Multinomial distribution according to the probability vector $U_{j,k}$.

Modelo: inferencia bayesiana (Joint Distribution)

Simple LCM

$$ p(\lambda ,z,U,X | \alpha, \beta) = p( \lambda |\alpha) \prod_{i=1}^{I}p(z_i | \lambda) \prod_{j=1}^{J}\prod_{k=1}^{K} p(U_{j,k} | \beta)\prod_{i=1}^{I} \prod_{j=1}^{J} \prod_{k=1}^{K} p(X_{i,j}|U{j,k})^{I(z_i=k)}$$

Nonparametric LCM

$$ p(\lambda ,z,U,X | \alpha, \beta) = \prod_{k=1}^{K_{max}-1}p( \nu_k|\alpha) \prod_{i=1}^{I}p(z_i | \lambda) \prod_{j=1}^{J}\prod_{k=1}^{K} p(U_{j,k} | \beta)\prod_{i=1}^{I} \prod_{j=1}^{J} \prod_{k=1}^{K} p(X_{i,j}|U{j,k})^{I(z_i=k)}$$

MixDir - Scalable Variational Bayes

C. Ahlmann-Eltze and C. Yau, MixDir: Scalable Bayesian Clustering for High-Dimensional Categorical Data, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 2018, pp. 526-539.

Paquete en R
Utiliza Variational Inference = Scalable = Big Data.
Permite:
handle missing data
infer a reasonable number of latent class
cluster datasets with more than 70,000 observations and 60 features
propagate uncertainty and produce a soft clustering

Manos a la obra

Preparar datos: limpieza, ortografía, tokenización, etc. (+12k documentos)
Crear diccionario de 400 palabras según tf-idf
Seleccionar 4000 documentos
Aplicar MixDir con diferentes configuraciones:
- Inferir el número de clases latentes $ k=(5,50,100,200) \times 5 $
- Evaluar consistencia del clustering
- Evaluar el efecto de la distribución a priori
Seleccionar modelo y Analizar resultados
- Evaluar tamaño y probabilidad de pertenecia a cada clase
- Identificar palabras más importantes
- Realizar previsiones
Comparar con LDA

¿Vemos los resultados?

Sobre el Número de Clases: Consistencia

$ ARI $ = Adjusted Rank Index se utiliza para comparar los resultados de clustering. Para cada documento se estima la probabilidad de pertenecer a cada clase (soft clustering). Para obtener $ ARI $ cada documento ha sido asignado a la clase con mayor probabilidad.

Sobre el Número de Clases: Hiperparámetro

El alluvial plot muestra el efecto sobre la agrupación al incrementar la penalización de la creación de nuevas clases - $ \alpha_1 $ en el Proceso de Dirichlet como prior.

Como referencia, utilizamos $ K_{max} = 5 $. Las clases pequeñas se unifican según crece el valor del parámetro.

Clase Identificadas

Seleccionamos k=5 clases latentes.
lambda es el vector de probabilidades de las clases.
The method produces probabilistic asignments of individuals to the latent classes.
Arriba: Número de documentos por clase.
Izquierda: Clustering Heatmap

Longitud y Clase

Longitud - número de palabras sin stopwords - de los documentos en cada Clase.
La longitud no se utiliza directamente en el modelo.
Unsupervised clustering can help uncover interesting underlying structures. One must be careful not to over-interpret the data …

Clase B

Word	Probability	Word	Probability
parcial	0.981	entra	0.938
estrecha	0.978	galibo	0.938
sufro	0.974	indicativo	0.938
total	0.974	medidas	0.938
descubro	0.962	plegar	0.938
descubre	0.962	quiera	0.938
pilar	0.962	verifico	0.938
contacto	0.958	ayuntamiento	0.938
ascensor	0.938	rayado	0.911
chipiona	0.938	intento	0.887

Predictive features for each of the classes: words that maximize the probability for class k $$ argmax_{X_j=r}p(z=k|X_j=r) $$

Clase D

Word	Probability	Word	Probability
semaforo	0.999	señal	0.987
stop	0.999	incorpora	0.986
frena	0.997	parada	0.985
rojo	0.997	perpendicular	0.984
respeta	0.994	distancia	0.984
verde	0.993	parado	0.983
viene	0.991	parados	0.982
sale	0.990	paro	0.981
carriles	0.988	ambar	0.980
preferencia	0.987	interseccion	0.979

Predictive features for each of the classes: words that maximize the probability for class k $$ argmax_{X_j=r}p(z=k|X_j=r) $$

Clase E

Word	Probability	Word	Probability
lunas	0.999	comprobado	0.968
percance	0.996	limpiaparabrisas	0.968
robar	0.984	sufrido	0.964
rotura	0.983	cuelan	0.958
inundaciones	0.981	ello	0.958
bolso	0.976	capot	0.958
forzada	0.974	chinazo	0.958
raiz	0.968	granizada	0.953
abolladuras	0.968	abriendo	0.938
sustraido	0.968	aparco	0.938

Predictive features for each of the classes: words that maximize the probability for class k $$ argmax_{X_j=r}p(z=k|X_j=r) $$

Palabras más influyentes

Measure the loss of information using the Jensen-Shannon divergence. Medida de la pérdida de calidad del cluster si se elimina la palabra.

Relación entre palabras

Relación entre palabras más influyentes

Etiquetando

Incluye	A	B	C	D	E
rotura + lunas	0.000	0.000	0.000	0.000	1.000
granizo	0.001	0.001	0.205	0.050	0.744
granizos	0.029	0.029	0.029	0.029	0.886
daños + granizos	0.029	0.029	0.029	0.029	0.886
daños + granizada	0.001	0.043	0.001	0.001	0.953
motocicleta	0.112	0.288	0.051	0.260	0.289
golpeo + motocicleta + estacionamiento	0.938	0.000	0.000	0.062	0.000
salir + garaje + encuentro	0.002	0.996	0.000	0.001	0.000
stop	0.000	0.000	0.000	0.999	0.000
ceda + peaton + paro	0.007	0.000	0.000	0.993	0.000

Incluye	A	B	C	D	E
columna	0.467	0.523	0.000	0.000	0.010
golpe	0.112	0.288	0.051	0.260	0.289
golpeo + columna	1.000	0.000	0.000	0.000	0.000
columna + roce	0.218	0.782	0.000	0.000	0.000
intento	0.050	0.887	0.000	0.005	0.058
estacionamiento	0.569	0.295	0.001	0.135	0.001
tentativa + robo	0.000	0.974	0.000	0.000	0.026
intento + robo	0.003	0.984	0.000	0.000	0.012
retrovisores + robo	0.000	0.795	0.000	0.000	0.204
rotura	0.016	0.000	0.000	0.000	0.983

Objetivo conseguido

¿Próximos pasos?

Lo cierto es …

“There’s a huge difference between building a Jupyter notebook model in the lab and deploying a production system that generates business value.”

Andrew Ng. Stanford University. Coursera & deeplearning.ai co-founder

“Kaggle is to real-life machine learning as chess is to war. Intellectually challenging and great mental exercise, but you don’t know, man! You weren’t there!”

Lukas Vermeer. Director of Experimentation. Booking.com

Muchas más aplicaciones …

NLP will Shape the Enterprise Industry in 2021 Analytics Insight - November 27, 2020.

" … MarketstandMarkets predicts that the NLP market size will grow from USD 10.2 billion in 2019 to USD 26.4 billion by 2024. "

.

::

Seguimos en Contacto


		@RavinesRomy
		https://github.com/RavinesRomy
		https://ravinesromy.org
		Madrid, España

Referencias

David B. Dunson and Chuanhua Xing. Nonparametric bayes modeling of multivariate categorical data. Journal of the American Statistical Association, 104(487):1042–1051, 2009. doi: 10.1198/jasa.2009.tm08439.
C. Ahlmann-Eltze and C. Yau. Mixdir: Scalable bayesian clustering for high-dimensional categorical data. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics(DSAA), pages 526–539, 2018. doi: 10.1109/DSAA.2018.00068

Fotos Unsplash:
Erik Mclean, Serj Sakharovskiy, Inma Santiago, Noah Buscher, Kelly Sikkema, Clint Patterson
Fotos Pixabay:
Hans Braxmeier, Marcel Langthim, Alexandre C. Fukugava, Devanath, ChrisFiedler, Dirk Wouters, Jeon Sang-O