Machine Learning in Network Security and Gene

Transcript Machine Learning in Network Security and Gene

Machine Learning Research and Big Data
Analytics
A Centre of Excellence
Under
FAST, MHRD
Dhruba K Bhattacharyya, FIETE
Professor, CSE
Tezpur University
Group Members
•
•
•
•
Prof Dhruba K Bhattacharyya
Prof Shyamanta M Hazarika
Prof Utpal Sharma
Prof Nityananda Sarma
PI
Co-PI
Co-PI
Co-PI
•
•
•
•
•
•
•
Dr Swarnajyoti Patra
Dr B Borah
Dr Sanjib Deka
Dr Siddharta S Satapathy
Dr Rajib Goswami
Mr Debojit Boro
Ms Sanghamitra Nath
Member
Member
Member
Member
Member
Member
Member
Thrust Areas
•
•
•
•
•
•
•
Machine Learning
Network Security
Natural Language Processing
Robotics
Bio-informatics
Cognitive Radio Networking
Multi-spectral and Hyper-spectral Satellite Data
Processing
Summary of Achievements
Sl No
Publications/facilities created
2013
2014
2015
2016
01
--
--
01
01
Books Authored
02
Journal Papers
15
18
16
12
03
Conference papers
03
07
05
08
04
Book Chapters
02
02
--
02
05
Software Tools Developed
01
03
02
03
06
Laboratories Established
--
--
05
01
07
Workshops Organized
01
08
PhDs awarded
--
DKB and
JKalita(USA)
DKB and
JKalita(USA)
Net Secuity
NLP
Robotics
CRN
Bioinformtcs
HPCC LAB in
assoc with CDAC,
Pune (01 Cluster
& 03 Param
Shavak)
--
--
01
03
04
Partly supported
02
Some Achievements :
Network Security
• Development of a Tool called TUCANNON+ to (i) Capture traffic, (ii) Launch DDoS attacks of
•
•
•
•
•
all types, (iii) monitor packet and flow traffic.
Development of a test-bed for attack traffic simulation, capturing, monitoring and validating
defense methods.
Development of defense methods for both low-rate and high-rate DDoS attacks using statistical
and information
theoretic measures.
Development if an effective correlation measure to discriminate DDoS attack traffic from
legitimate traffic.
Development a real-time defense implemented on hardware to detect both low-rate and high-rate
DDoS attacks.
Development of an effective defense to counter XSS attacks.
Network Security Test-bed & TUCANNON+
The tool has two components: Server and Client program.
The server program comes with an user interface, through
which one can specify different parameters like protocol,
SIP type, attack pattern, attack strength (in terms of threads)
etc. The client program in turn generates the attack traffic
based on the specifications.
R C Baishya, N Hoque and D.K. Bhattacharyya. DDoS Attack Detection Using Unique Source IP Deviation. In
the Journal of Network Security, November 2016 (in press).
TUCANNON+: Network Traffic Monitoring Tool (TUMONITOR)
The tool allows the user to observe a set of selected features viz. packet count per interval, protocol specific packets per
interval, TCP flag specific packets per interval, number of unique source IP addresses per interval time. Also the user
can monitor the value of an arithmetic expression involving a subset of the features. The tool can be used by a
researcher to understand the traffic under different condition. Certainly TUMONITOR is not an IDS, however, a
network administrator can use this tool to keep an eye on the traffic passing through the monitoring point.
D K Bhattacharyya and Jugal Kalita, DDoS Attacks: Evolution, Detection, Prevention, Reaction, and Tolerance, CRC
Press, Taylor & Francis Group, May, 2016
SSM Based TCP Targeted LRDDoS Attack Detection Method
Self-Similarity
Definition:
Scale
invariance
property of an object or process,
that at some time scale looks just
like an appropriate scaled version
of itself measured over a different
time scale.
Self-Similarity Matrix (SSM)
Definition: A self-similarity matrix SSM from a data series is an ordered
sequence of feature vectors V = (v1,v2, …, vn) where each vector vi
describes the relevant features of a data series in a given local interval.
Then the self-similarity matrix is formed by computing the similarity pairs
of feature vectors.
S(j,k) = s(vj,vk) where j,k ϵ (1, … , n)
where s(vj,vk) is a function measuring the similarity of the two vectors.
Features used:

Fractal structures
A SSM based TCP targeted
LRDDoS attack detection method
measures network traffic selfsimilarity across multiple time
scales, over a subset of relevant
features. The method has been
experimented over real life lowrate dataset for multiple scenarios
and the results demonstrate
convincing results that confirms its
efficacy.



Average packets per network
flow (f1)
Number of packets per interval or
sample (f2)
Number of network flows (f3)
Server outflow performance (f4)
Similarity measure used: Euclidean
Distance


2
d vi , v j   f1i  f1j   ...   f i  f 4j 


 4

Self-similarity matrix S for M traffic samples


vi  f1i , f 2i , f 3i , f 4i and
v j   f1j , f 2j , f 3j , f 4j 


where
2
Incoming
traffic
Sample a data series into
N samples. Set a value for
total matrix size M. Set
seed pointer sptr = 1.
If sptr < N then, compute
the features of sample
sptr if not computed and
set sample i = sptr + 2.
Set sptr = i - 2
Set scale count m = 3
No
Compute the features of
sample i if not computed
Yes
2 Attackers
175
3 Attackers
150
4 Attackers
125
100
75
50
25
0
0
5
10
15
20
25
30
35
40
45
50
Matrix Size M
i≤N?
Performance of SSM corresponding to matrix size M
450
Calculate the standard
deviation σm and compute I
using Equation 4.
0
Standard Deviation (σ)
Increment i
Increment m
Indicate matrix S
as self-similar.
1
Alarm anomaly as
LRDDoS attack
Reject sample i.
Increment i.
Reset m = m
Normal
200
Yes
Construct m X m SSM S
starting from sample sptr to
i. Ignore rejected sample if
present.
I = 1/0
?
m≤M
?
Standard Deviation (σ)
225
Yes
i<N
?
400
350
300
250
200
150
100
50
0
No
1
21
41
61
81
101
121
Time (in seconds)
End of N samples
SSM periodic LRDDoS attack detection for 4 attackers
Boro, D., Haloi, M. and Bhattacharyya, D.K., “A Self-Similarity Based TCP Targeted Low-Rate DDoS (LRDDoS) Attack Detection Method”, Security
and Communication Networks, Wiley, 2016 [minor revision].
Cross-Site Scripting (XSS) Attack Detection



Introduced a Client-Server architecture for XSS attack
detection that balances the load between client and server.
An attribute clustering method is presented supported by rank
aggregation to detect confounded Java-Scripts.
Our unsupervised method shows high detection accuracy with
optimal feature subset.
Figure 1 : XSS attack detection architecture
Table1 : Showing the results of attribute clustering in terms of
True positive rate, false positive rate and accuracy
S Goswami, N Hoque and D.K. Bhattacharyya. An Unsupervised Method for Detection of XSS Attack. In the Journal of
Network Security, November, 2016 (in press)
LTDS-An Effective Low-rate TCP DDoS Attack Defense
LTDS is a DDoS defense solution capable of detecting low-rate TCP DDoS attack with high detection accuracy. The core
of our method is to observe the amount of traffic transmitted without two way ACK exchange between the communicating
IPs at every interval. In a TCP DDoS attack the victim does not send ACKs. Hence under an attack we can observe a
significant hike in the amount of traffic transmitted without two way ACK exchange. We use a non-parametric change
point modeling technique to detect such a change in the network traffic.
FFSc: Low-rate and High-rate DDoS Attack Detection
It is very difficult to identify low-rate DDoS attack because the behavior of lowrate network traffic is very similar to normal traffic. For effective identification
of low-rate and high-rate DDoS attack feature-feature score (FFSc) is computed
for each network traffic sample. A normal profile is generated from normal
network traffic analysis that stores mean, maximum and minimum FFSc values.
During captured traffic analysis FFSc is computed for unknown sample. If the
deviation of FFSc between normal and captured traffic is greater than a
threshold value then attack alarm is generated.
Performance analysis of the proposed method
N Hoque, D K Bhattacharyya and J K Kalita, FFSc: A Novel Measure for Low-rate and High-rate DDoS Attack
Detection, Security and Communication Networks, 9(13) 2032-2041, Wiley, 2016





High Performance Computing for BDA
Major computing tools for handling big data:
 Parallel computing
 Distributed computing
 Application specific hardware
Parallel Computing
Multiple processor cores perform similar or
dissimilar tasks simultaneously
Parallel computing technologies:
 Cluster and supercomputer
 General purpose graphics processing unit
(GPGPU)
We are working on developing efficient Deep
learning systems using GPU
GPU cores: Simpler architecture than CPU cores,
energy efficient
 Consumes less IC resources, so huge number
of cores can be put into a single chip
 Suitable for stream processing of graphs with
large number of vertices
 NVIDIA GPU currently being used in our
lab: 384 cores and 6 Gbps bandwidth
Distributed Computing
 Computing components are located on networked
computers and they communicate and coordinate via
message passing
 We categorize the distributed computing architectures to
three classes, along with their generic architectures:
1.
MapReduce architecture
2.
Fault tolerant graph architecture
3. Streaming graph architecture
H Kashyap, HA Ahmed, N Hoque, S Roy, DK Bhattacharyya, Big Data
Analytics in Bioinformatics: Architectures, Techniques, Tools and
Issues, Network Modeling Analysis in Health Informatics and
Bioinformatics 5 (1), 28
A Hardware Solution for DDoS Defense
Application specific hardware
•
•
•
Advantages:
– Optimized application specific datapath requires lesser computation cycles (compared to general datapath in CPUs)
Disadvantages:
– High development cost and time (compared to software development)
Types:
– Non-configurable : Application Specific IC (ASIC)
– Configurable: Configurable Programmable Logic Devices (CPLDs) and Field Programmable Gate Arrays (FPGAs)
•
Developed a hardware module to detect Distributed
Denial of Service (DDoS) attacks in real time
•
Implementation considers a Xilinx Virtex – 5 FPGA device
•
The FPGA design implements our proposed VERC
measure for DDoS attack detection.
•
Resource requirements:
– Slices : 750/7200 (10% of the available)
– Block RAM : 0 (0% of the available)
– DSP Slices : 3/48 (6% of the available)
•
Performance of the detection module:
– Maximum frequency: 118 MHz
– Time required to classify the traffic instances of 1 second
window as either attack or normal: 354ns
NaHiD Correlation Measure for DDoS Attack Detection
A real-time DDoS detection solution demands for minimum
number of features to be used during traffic analysis, whereas
correlation measures such as Spearman, Pearson, and Kendall are
often fail to provide high detection accuracy over less number of
features. An effective measure called NaHiD is designed for
network anomaly detection, towards DDoS detection. For any two
objects X and Y of n dimensions, the proposed correlation measure
considers standard deviation and mean of the two objects
The measure is implemented on both software and hardware
(FPGA) platform. From normal network traffic a normal
profile is generated and from captured traffic s
Figure 1: Performance analysis on CAIDA
Performance analysis on CAIDA
Performance analysis on DARPA
Performance analysis on DARPA 2000
Figure 2: Performance analysis on DARPA
• N Hoque, H Kashyap and D K Bhattacharyya, “A Real-Time DDoS Attack Detection Method using FPGA” in IEEE
Transaction on Network and Service Management, November, 2016 (under review)
Some Achievements:
Bioinformatics
• Development of a robust correlation measure to support identification of co-expressed patterns
that show shifted, scaled and shifted-and-scaled correlations.
• Development of a robust biclustering technique to identify co-expressed gene patterns with high
biological relevance.
• Development of a robust PPI Complex finding method using unsupervised machine learning
approach.
• Development of efficient methods to extract Co-expressed Network Modules using both
traditional and soft-computing approach and rank the modules against a given disease query.
• Development of a Triclustering method to identify coexpressed gene patterns over Gene-SampleTime space.
SSSim Measure
Introduced an effective shifting-and-scaling correlation measure named
SSSim (Shifting and Scaling Similarity), which can detect highly
correlated gene pairs in any gene expression data.
SSSim in ICS Biclustering
Introduced a technique named ICS (Intensive Correlation
Search) biclustering algorithm, which uses SSSim to
extract biologically significant biclusters from a gene
expression dataset.
The technique performs satisfactorily with a number of
benchmarked gene expression datasets when
evaluated in terms of functional categories in Gene
Ontology database.
Some p-values on Yeast Sporulation
dataset
Comparison of ICS with iBBiG on Subset of
Yeast dataset
Ahmed, H A, Mahanta, P, Bhattacharyya, D K and Kalita, J K, "Shifting-and-Scaling Correlation
Based Biclustering Algorithm" IEEE/ACM Transactions on Computational Biology and
Bioinformatics, 6 (2014): 1239-1252.
Core and Peripheral connectivity based Cluster Analysis
over PPI Network (CPCA)
CPCA exploits the core-periphery structural features of complexes. A complex consists of a core dense region with some
proteins weakly connected to the dense region, often called periphery. It uses two connectivity criterion functions to identify
core and peripheral. To locate initial node of a cluster a measure called DNQ (Degree-based Neighborhood Qualification)
index is introduced.
CPCA performs well when compared with well known
counterparts in terms of sensitivity, ppv, precision, recall
and accuracy.
Comparison using Co-localization score
Comparison using MIPS
gold standard
Ahmed, H A, D K Bhattacharyya, and J K Kalita. "Core and Peripheral connectivity based Cluster Analysis over PPI
Network" in Elsevier’s Computational Biology and Chemistry (CBAC), 59, 32-41, 2015.
FUMET: A Fuzzy Network Module Extraction Technique for CEN
A soft thresholding co-expression network construction technique based on fuzzy logic, which
can handle both positive and negative correlations among genes and can handle
membership of a single gene to multiple network modules.
P Mahanta, H A Ahmed, D K Bhattacharyya and A Ghosh FUMET: A Fuzzy Network Module
Extraction Technique for Gene Expression Data in the Journal of Bioscience, vol 39, no 2, June,
2014, Springer.
GeCON: Reconstruction of Gene CEN
Gene pairs showing negative or positive co-regulation under a given number of conditions are used
to construct such gene co-expression network with signed edges to reflect up- and downregulation between pairs of genes. Most existing techniques lacking computational efficiency. A
fast correlogram matrix is used to capture the support of each gene pair to construct the network.
• S Roy, D K Bhattacharyya & J K Kalita, “Reconstruction of Gene Co-expression Network from Microarray Data Using
Local Expression Patterns”, BMC Bioinformatics , Vol. 15, S10, 2014.
CoBi: Polynomial Time Co-regulated Biclustering
A novel expression pattern-based polynomial time biclustering technique for grouping both
positively and negatively regulated genes together as co-regulated genes from microarray
expression data in a deterministic way.
• Roy, S., Bhattacharyya, D. K. and Kalita, J. K. CoBi: Pattern Based Co-Regulated Biclustering of Gene
Expression Data, Pattern Recognition Letters, Elsevier, 34(04), 1669{1678, 2013.
Tricluster Analysis In GST Microarray Data
Developed a triclustering method to find groups of coexpressed genes over sample and time domains using
SSSim measure.
Triclustering results are better in terms of biological
significance than pre-existing algorithms namely
TRICLUSTER and ICSM.
Developed a shared memory shared nothing architecture to
parallelize our THD-Tricluster and to reduce the
execution time.
T Kakati, H A Ahmed, D.K. Bhattacharyya, J K Kalita. THD-Tricluster: An Effective TriCluster
Algorithm with Shifting-and-Scaling Patterns in Elsevier’s CBAC, November, 2016 (under review).
CEN Module Extraction in Finding Disease Related Genes
The work considers the important issue of analysis of CEN using both gene expression similarity and
semantic similarity. The work considers not only the highly co-expressed genes, but also the genes
with less expression similarity, yet high semantic similarity which are termed as border genes. The
border genes obtained are found to be involved in biological pathways, related to some
neurodegenerative disease, Alzheimer’s disease.
T Kakati, H J Kashyap, D K Bhattacharyya, THD-Module Extractor: An Application for CEN Module Extraction and
Interesting Gene Identification for Alzheimer’s Disease, in Nature’s Scientific Reports, 2016 (under minor revision)
Some Achievements:
Multi-spectral and Hyper-spectral
Data Analysis
• To develop
efficient supervised and semi-supervised classification methods to identify
objects of interest from multi-spectral and hyper-spectral satellite data.
• To develop ensemble classification approach for accurate classification of objects over
hyper-spectral satellite data.
• To identify an optimal subset of relevant features for classification of satellite data.
Classification of Hyperspectral satellite data using Object Based Image Classification
(OBIC) Technique
Classification of Hyperion data
Approaches
MLC
ANN
SVM
OBIC
P
ROC
OA
KIA
70
0.66 0.68
0.88
78
0.76 0.77
0.89
80
0.77 0.78
0.89
88
0.85 0.87
0.95
1. Chutia and Bhattacharyya (2014): Effective feature extraction approach for fused images of
Cartosat-I and Landsat ETM+ satellite sensors. Applied Geomatics (Springer), 6(3), 181-195
2. Chutia and Bhattacharyya (2014): OBCsvmFS: Object-Based Classification supported by Support
Vector Machine Feature Selection approach for hyperspectral data. Journal of Geomatics, 8 (1),
12-19
An Effective Ensemble Classification Approach using Random Forests and
Correlation-based Feature Selection (CFS) Technique
Classification of QuickBird image (a part of Shillong city)
Chutia and Bhattacharyya (2015): An Effective Ensemble Classification Framework using Random
Forests and Correlation-Based Feature Selection Technique, IEEE J. of Remote Sensing Letters,
November, 2016 (under minor review)
Some Achievements:
Big Data Mining
• To develop
efficient supervised and unsupervised feature selection methods for
accurate classification of real-life data.
• To develop an integrated classifier that operates over an optimal subset of features and
ensures best possible classification accuracy.
• To develop efficient supervised and unsupervised incremental feature selection
methods for accurate classification of real-life data.
• Multi-objective optimization for selection of views over large data warehouses.
MIFS-ND: Mutual Information-based Feature Selection Method
MIFS-ND is used to select an optimal subset of features
from large dataset
Using feature class and feature feature mutual information,
the method select relevant features and removes redundancy
To select high-ranked feature, NSGA-II optimization
technique is used
Classification accuracy of MIFS-ND is high on many reallife datasets
N Hoque, D K bhattacharyya and J K Kalita, MIFS-ND: A Mutual Information-based Feature Selection
Method, Expert Systems with Applications 41(2014), 6371-6385.
IFS-KNN: An Incremental Feature Selection for Classification using KNN+
Used to select an optimal subset of features in a dynamic way from
high dimensional datasets. During feature selection, a dynamic
profile is created for every new class of instance. It selects only the
high weightage features
The traditional KNN gives equal priority to all features during
nearest neighbor computation. Hence, a noise value of a feature may
yield unpredictable behavior in KNN.
KNN+ classifier does not consider all the features during nearest
neighbor computations.
Performance is evaluated on gene expression, network and text
categorization datasets using DT, RF, NB, KNN and SVM
classifiers. KNN+ performs better than traditional KNN.
N Hoque, H A Ahmed, D K Bhattacharyya and J K Kalita, IFS-KNN: An Incremental Feature Selection Method, in the
Journal of Machine Learning, Elsevier, 2016 (in press)
Multi-Objective Optimization in Selection of Views to Materialize
in Data Warehouses
•
•
•
•
•
Materialized views in Data Warehouse is a promising solution to speed up the analytical
processing of huge volume of historical data for running decision support applications.
The problem is NP-hard.
With the advent of Big data and MapReduce programming paradigm, we investigate on view
selection problem for materializing in Big data framework.
The Forma analysis based multi-objective DE for binary encoded data has been modified and
applied in designing a view selection and recommendation system for materializing in Hadoop
Distributed File System (HDFS) data warehouse framework by promoting diversity of solutions in
solution vector space.
The popular elitist multi-objective GA termed as NSGA-II and Archived Multi-objective
Simulated Annealing (AMOSA) algorithm are customized for applying in materialized view
selection in MapReduce based distributed file system framework for comparative performances
analysis.
`
Goswami, R., Bhattacharyya, D.K., Dutta, M. and Kalita J.K. : Approaches and Issues in View Selection for
Materializing in Data Warehouse, International Journal of Business Information Systems, Vol. 21, No. 1, pp.
17–47, 2016, DOI: 10.1504/IJBIS.2016.073379.
Some Achievements:
Cognitive Radio Networks
•
To develop efficient Opportunity Prediction Scheme at MAC-Layer Sensing for Ad-hoc
Cognitive Radio Networks.
• To develop a cooperative spectrum sensing technique in CRNs using Coalitional
Game Theory.
• To maximize network throughput through joint routing and channel allocation in multihop cognitive radio network.
• To analyze empirically the effectiveness of classification methods in Spectrum Sensing
in CRNs.
Opportunity Prediction at MAC-Layer Sensing for Ad-hoc CRNs
In this work, two important issues of MAC-layer sensing have been investigated
for underlay mode cognitive radio networks. These are -(a) estimation and
modeling of licensed channel usage pattern of PUs, while tolerating interference
from secondary users (SUs), and (b) usage of learnt channel usage patterns for
discovery of opportunities by the SUs. A Hidden Markov Model based channel
usage pattern of PUs is proposed for use by the SUs to predict the spectrum
opportunity. The proposed model uses estimated interference power constraint
(IPC) in determining the interference due to presence of SUs to protect the PUs
from harmful interference. A distributed MAC protocol for data dissemination
(DMDD) in underlay mode CRNs is also proposed which utilizes the proposed
channel usage model.
Figure 4: HMM representing licensed
channel observation sequence
Training the HMM for channel
Performance analysis
of the proposed DMDD using designed Channel model
Channel Ranking in DMDD
where
Figure11: % of msg received w.r.t. no. of channels
Compared to SURF
Figure12: % of msg received under different PU activity compared to SURF
Deka, S. K., Sarma, N., 2016. Opportunity Prediction at MAC-Layer Sensing for Ad-hoc
Cognitive Radio Networks. Journal of Network and Computer Applications. (under review).
Cooperative Spectrum Sensing in CRNs
To overcome the issues of individual spectrum sensing, Cooperative spectrum
sensing (CSS) has been emerging as a prominent solution which exploits
Secondary User (SU) spatial diversity to make a global decision about the
availability of Primary User (PU) in a licensed band. Consideration of reliability
factor of SUs might proven as an important feature during cooperation among
the SUs. In this work, we have proposed a distributed Cooperative Spectrum
Sensing scheme using Coalitional Game theoretic model for Cognitive Radio
Networks which contributes to improve the sensing performance in terms of
detection probability. The utility function of the game is formulated by
considering the trade-off between gain and cost during coalition formation.
Algorithm for Proposed DCSS scheme
Utility function for the proposed game theoretic CSS model
Performance analysis of the proposed method
J.Gupta, P.Chauhan, M. Nath, M. Manvithasree, S.K. Deka and N. Sarma, Coalitional Game Theory based Cooperative Spectrum Sensing in
CRNs, ACM 18th International Conference on Distributed Computing and Networking, ICDCN -2017.(in press)
Network Throughput Maximization through Joint Routing and Channel Allocation
in Multi-hop Cognitive Radio Network
The existing spectrum sharing approaches only involve maximize
throughput/utilization by optimally allocating resources channels. However,
resource allocation alone only lead to sub-optimal result. In order to maximize
spectrum utilization, spectrum sharing demands cross-layer design of routing
and resource allocation to efficiently allocate resources to CR nodes in a multihop CRN. The main contributions of the paper are –
• Defining the joint routing and channel allocation as an optimization problem
with objective to maximize network throughput
•An Integer Linear Programming(ILP) formulation to solve the optimization
problem
•Implementation of the formulation using CPLEX
ILP formulation:
Performance analysis:
Test Case I: Flows that need to be scheduled:
Flow
source
destination
Demand
(Mbps)
F1
3
4
2
F2
2
6
2
F3
6
4
2
F4
1
5
2
With the given input, the scheduled obtained
from the solver is shown in the table below.
T1
T2
We present an Integer Linear Programming (ILP) formulation for the
optimization problem.
Fl
o
w

We introduce two decision variables:
F1
5->4[c2]

Variable 1:
3>5[c1]

T3
F2
F3

Variable 2:
F4


6->3[c1]
3->5[c2]
1->2[c1]
T4
T5
2->3[c1]
3->6[c1]
5->4[c2]
2->5[c2]
All 4 flows are scheduled achieving a maximum throughput of 8Mbps.
The computational time taken to solve the problem is 0.13 sec (or
26.49 ticks).
Z Ahmed and N Sarma, Network Throughput Maximization through Joint Routing and Channel Allocation in Multi-hop Cognitive
Radio Network, accepted for publication in proc. of IEEE Sponsored Intl Conference on Applications and Innovations in Mobile Computing
(AIMoC) 2016, February 10-12, 2016, Kolkata, India.
Applying Classification Methods for Spectrum Sensing in
Cognitive Radio Networks – An Empirical Study
In low SNR environment (fading channels) where there is high
noise level and regardless of the fact that there is a signal
present (low amplitude) it can't be distinguished in cognitive
radio networks. This work exploits the signal power and the
SNR features collected in test bed to take a good spectrum
decision in such condition by employing supervised learning.
The conventional energy detection method may cause
misdetection of the signal as it fails in a low SNR environment.
Here, all supervised learning model is built not only based on
just the power received of the signal, but also the SNR feature
so that even if there is a low power signal in a highly noisy
environment the classifier can still give a decision to detect the
signal with a priori knowledge. Our empirical study clearly
reveals that supervised learning gives a high classification
accuracy by detecting low amplitude signal in a noisy
environment.
Figure 1 : Showing the Experimental Setup of the CRN Test Bed
Comprising of USRP 1
For collecting real time experimental data GNU Radio is used and the
existing
sample
python
scripts
usrp_spectrum_sense.py
and
benchmark_tx.py is modified as sensing.py and transmission.py for sensing
and transmission respectively.
Other
program parameters in the
transmission.py script are The Sampling rate = 1 Mega Samples Modulation
used = GMSK, Sub Channel Bandwidth =6.25 KHz So, the no of fft bins
collected in a particular Channel = 160 (1MS/6.25e3) The receiver which is
tuned to the center frequency of the channel can sweep only 8 MHz channel
Bandwidth due to the USRP1 daughter board constraint. Out of the total 160
bins 75 percent is taken and 25 percent is discarded from both the lower and
upper cut frequency (12.5 percent each) of the channel. The program senses
the power level from bin 20 to bin 140.
The sensing data was captured with the power and SNR features with active
transmission and another with no transmission. All the sensing data are
labeled as “Free” and “Occupied” class with respect to the known occupied
and free channels respectively.
Figure 2 : Performance of Classifiers with different Number of
Testing Samples and Average F1 measure
N.Basumatary, N.Sarma, B.Nath Applying Classification Methods in Spectrum Sensing in Cognitive Radio Networks:
An Empirical Study In ETAEERE 2016 (Springer Conference), (in press).
Some Achievements:
Natural Language Processing
Compute and analyze VOT (Voice Onset Time) values for the stops of the Assamese
language and its dialectal variants to provide a better understanding of the
phonological differences that exist among the different dialectal variants of a
language.
• To develop a speech corpus.
• Computational Modeling of Morphology and Syntax of Manipuri – a resource poor
Tibeto-Burman language.
• Computational Modeling of Morphology and Syntax of Assamese – a resource poor
inflectional language.
•
Speech and Natural Language Processing
Incorporating Dialectal Features in Synthesized Speech
Objective 1: Compute and analyze VOT (Voice Onset Time)
values for the stops of the Assamese language and its
dialectal variants to provide a better understanding of the
phonological differences that exist among the different
dialectal variants of a language which may prove to be useful
for dialect translation and synthesis.
Tasks: Compute and analyze the VOT values for the stops of
the Assamese language and its dialectal variants (Nalbaria
variety) and find out their position in the standard VOT
continuum.
Subtask 1: Development of Speech Corpus
List of words having the voiced/voiceless plosives in word
initial position followed by vowel sounds ‘a’,‘e’,‘i’,‘o’ and ‘u’ is
prepared and recorded from 4 speakers (2 speaking the AIR
variety & 2 speaking the Nalbaria variety) at a sampling rate of
44.1kHz and 16 bit resolution in a noise free environment.
Subtask 2: VOT measurement
PRAAT speech analysis software is used to generate the
waveform and spectrogram for each word utterance
containing the plosive in word-initial position. On each
waveform 2 points in time are located: the onset of burst
release marked by the onset of low amplitude, aperiodic
noise and the onset of voicing marked by the onset of high
amplitude periodic energy.
Classification of Assamese Stop Consonants
Sanghamitra Nath,Himangshu Sarma, and Utpal Sharma. A preliminary study on the VOT patterns of the Assamese language
and its Nalbaria variety. In Computational Linguistics and Intelligent Text Processing, pages 542-552. Springer, 2014
Results:
VOT (lead) range for both the varieties of Assamese is similar to the standard ranges although the maximum value is much larger.
The VOT (short lag) range for the AIR variety fall into the standard range, but the maximum value for the Nalbari variety is larger.
The range for the aspirated stops needs to be extended on both ends.
VOT for the voiced stops in the Nalbaria variety has both positive and negative values
Stops in the AIR variety are much more aspirated than the stops in the Nalbaria variety.
Conclusions:
VOT values for the two varieties of Assamese under study show differences which can be used for dialect identification/recognition.
It is likely that VOT will also make a substantial difference in the synthesis of the Assamese dialects. Experiments on speech synthesis
with varying VOT values are yet to be carried out.
Objective 2:
Formant structure of vowels and diphthongs are important to
distinguish the vowel/diphthong sounds from each other.
Furthermore, accurate estimation of segmental duration is crucial for
natural sounding text-to-speech synthesis. Therefore analyzing the
vowels and diphthongs with respect to formants and segmental
duration may reveal information that might help in dialect recognition
and synthesis.
Tasks:
Development of Speech Corpus:
A list of words having the vowels and diphthongs in word initial,
medial and end positions is prepared and recorded from 4 speakers (2
speaking the AIR variety & 2 speaking the Nalbaria variety) at a
sampling rate of 44.1kHz and 16 bit resolution in a noise free
environment using a Sony recorder.
Measurement of formants and vowel duration:
A PRAAT script extracts formants F1 and F2 at 25%, 50% and
75% of the vowel length and at 20%, 40%, 60% and 80% of
diphthong duration and also duration of vowel and
diphthong segments. The Euclidean distance between the
nucleus and the offglide of a diphthong is calculated and
recorded in the excel sheet.
Observations:
In the Nalbaria variety, the /a/ is more close to the /aa/,i.e.,
the backness of /a/ is less than that of the AIR variety while
the /u/ is more central and /o/ is more back, while in the AIR
variety, /u/ is more back and /o/ is more central.
S Nath and U Sharma. An analysis of the vowels and diphthongs of the Assamese language and its Nalbaria variety. In Computing and
Communication Systems (I3CS), 2015 International Conference on, 2015.
Observations(contd):
In almost all cases the distance (between nucleus and offglide) is much larger in the AIR diphthongs making them more prominent.
The dynamic F1F2 plot of most diphthong in AIR almost reaches the target vowel while the dynamic F1F2 plot
of most diphthongs in Nalbaria lies somewhere between the vowel sounds /i/ and /aa/ .
Duration of vowels in Nalbaria is much smaller than the duration of vowels in AIR.
Computational Modelling of Morphology and Syntax of Assamese – a resource poor
inflectional language
A. Stemming of Words
Objective- Automatic identification of the stem of words occurring in texts.
◦ Experimented with Assamese, Bengali, Bishnupriya Manipuri and Bodo
◦ Developed a rule-based approach to remove suffixes from words. Use a dictionary of frequent words to reduce over-stemming and
under-stemming.
◦ To deal with problems due to large number of single-letter suffixes, proposed an HMMbased hybrid approach.
◦ Obtained accuracy of 94% for Assamese and Bengali using the hybrid approach, which is an improvement over existing methods.
◦ Obtained accuracy of 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively.
Ours is the first reported work on these two languages.
Saharia Navanath and Sharma Utpal and Kalita Jugal. Stemming resourcepoor Indian languages. ACM Transactions of Asian Language
Information Processing (TALIP), vol 13, no. 3, article 14, p 14.1-14.26 (26 pages), September 2014.
DOI:http://dx.doi.org/10.1145/2629670
B. Parsing of Assamese SentencesObjective- Recognising the syntactic structure of Assamese sentences.
◦ Experimented with Assamese, a morphologically rich, inflectional and resource-poor Indian language.
◦ Developed a hierarchical Part-of-speech tagset suitable for Assamese.
◦ Part-of-speech tagging for Assamese using a rule-based approach that is augmented with a dictionary.
◦ Part-of-speech tagging for Assamese using an HMM based approach.
◦ Identified multi-word units in texts.
◦ Explored three dependency parsing models for Assamese, viz. Link grammar parsing, Malt parsing, and MST parsing.
◦ Developed an Assamese TreeBank-a repository to store the parsed sentences.
Saharia Navanath. Computational Morphology and Syntax for a Resource- Poor Inflectional Language. PhD
Thesis, Tezpur University, 2014.
Computational Modeling of Syntax of Manipuri- a resource poor Tibeto-Burman Language
Objective- Syntax modeling and development of an effective parser for Manipuri
◦ Collected a raw corpus of about 16 millions words from Manipuri newspapers available in public domain. This is in addition to
about 1.4 million words corpus obtained from Technology Development of Indian Language (TDIL) Programme, DeitY, MC & IT,
Govt. of India, under research license.
◦ Developed transliteration software for conversion of the collected newspaper articles into Unicode (UTF-8) format.
◦ Studied of syntax structure for Manipuri and identified framework for syntax model to be developed- CFG, TAG, etc.
◦ Identified implementation issues for Manipuri parsers.
Some Achievements:
Bio-mimetic and Cognitive Robotics
Development of a combined diagrammatic reasoning and
qualitative spatio-temporal reasoning framework to detect motionevents in video.
• To extend CORE9 for human activity recognition.
• Intent recognition in a generalized framework for collaboration.
•
Real-Time EMG-based Prosthetic Hand Control Design
Mantoo Kaibarta, Nayan M. Kakoty and Shyamanta M. Hazarika
Abstract
Findings
An EMG-based five-fingered prosthetic hand controldesign in real-time is being attempted to give
assistance to the people suffering from upper limb
injury or inability. The EMG is captured from the
surface of the subject’ hand muscle (non-invasively).
The experimental results shows that because of the
embedded system with real time mode the system
has a great potential application with portability.
Although, there are a number of EMG-based
prosthetic hand, we are focusing on to design with
less electrode (channel), low cost and high efficiency.
Objective
Design a classifier for classifying six grasps as shown in
figure 1 that works in real.
Subje
ct
Data
Acquisiti
on
Data
Processi
ng
Prosthe
tic
Hand
Motor
Driver
Figure 2: Basic Block Diagram of The Complete System
Processing the EMG Signal in
Real time
The samples are collected at 1 kHz sampling rate for
100ms. The microcontroller dsPIC33FJ128GP802 from
Microchip is chosen for the EMG signal processing. It
comes in 28 pins, 3.3V dc power supply, 16-bit data
path, 128 KB of ROM and 16 KB of SRAM.
EMG signal pattern changes with change of subject,
location of the electrode placed on hand muscle and
environmental conditions. Our goal is to implement a
robust system.
Figure 4: Prosthetic Hand
Conclusion and Future Work
We are able to detect and process EMG Signal
reasonable accuracies I n real time with one channel.
Our focus is on implementation of classification of six
grasp types using SVM with two channels of 16-bit
EMG data. For filtering and extracting a better EMG
signal pattern we will take advantage of wavelet
technique. Our aim is to make as small a circuit board
as possible so that it fits within the prosthetic hand.
References
1. Kakoty, N. M. and Hazarika, S. M. (2011) “Recognition of
Grasp Types through Principal Components of DWT
based EMG Features”, 12th International Conference on
Rehabilitation Robotics, Zurich, Switzerland. June 2011.
1. EMG Signal
2. Applying Summation Feature
Figure 1: Types of Grasp
Figure 3: Summation of Entire sample (Testing on
PC)
Tezpur University, Tezpur, Assam - 784028
2. P.R.S. Sanches, A.F. Muller, L. Carro, A.A. Susin, P.
Nohama, “Analog reconfigurable techniques for EMG
signal processing”, Sociedade Brasileira de Engenharia
Biomedica, v.23, n.2, p. 153-157, April 2007.
Real-Time EMG-based Prosthetic Hand Control Design
Mantoo Kaibarta, Asst. Prof. Nayan M. Kakoty, Prof. Shyamanta M. Hazarika
Placement of Electrodes
Flexor
Digitorum
Profundus
Electrod
(blue)
e2
Channel 1
Extensor
Digitorum
Muscle
(purple)
Electrod
e1
Figure 3 - Placement of Electrodes
1. Raw EMG Signal
2. Filtered Signal (Difference Filter)
3. Applying Summation Feature
Figure 3.4 – Processing of EMG Signal (Testing on PC)
Tezpur University, Tezpur, Assam - 784028
Development of Cluster Facilities to Support Big Data
Analytics
Following objectives are aimed to achieve by utilizing
the facilities:
1. Generate DDoS attack centric alert dataset using
multiple defense sensors to validate alert correlation
methods.
2. Develop an Alert Correlation Analyzer using Granger
Causality over very large alert datasets.
3. Use of Theano or Py-CUDA platform for classification
of Big data using Deep Learning with Alternate Dropping.
4. Extraction of network modules from voluminous gene
expression data using multi-objective approach towards
disease gene(s) identification.
5. Develop an unsupervised differential analysis method
to analyze disease genes in progression.

Machine Learning in Network Security and Gene

Transcript Machine Learning in Network Security and Gene

Directory