Home Navigation

Thursday 16 December 2021

12 Factor App Principles

 

The Twelve-Factor App methodology is a methodology for building software-as-a-service applications. These best practices are designed to enable applications to be built with portability and resilience when deployed to the web.

1.      Code base

use version control, one repo per application, git submodule, maven submodule

2.      Dependencies:

don't check jar in git, use maven, artifactory, gcp artifact

3.      Config:

configuration should be strictly separated from code. As per "config", what varies for the environment to the environment must be moved to configurations and managed via environment variables.         

-        Database connections and credentials, system integration endpoints

-        Credentials to external services such as Amazon S3 or Twitter or any other external apps

-        Application-specific information like IP Addresses, ports, and hostnames, etc.

Principle: can you make your app opensource at anytime without compromising credentials

4.      Backing service:

Database, Message Brokers, any other external systems that the app communicates is treated as Backing service. Treat backing services as attached resources. like messaging service, postgre sql with signle url that can be replaced without making your code change

5.      Build, release, run:

Build stage: transform the code into an executable bundle/ build package.

Release stage: get the build package from the build stage and combines with the configurations of the deployment environment and make your application ready to run.

Run stage: It is like running your app in the execution environment.

Strictly separate the process with single command. You can use CI/CD tools to automate the builds and deployment process. Docker images make it easy to separate the build, release, and run stages more efficiently.

6.      Processes:

Execute the app as one or more stateless processes

As per 12-factor principles, the application should not store the data in in-memory and it must be saved to a store and use from there. As far as the state concern, your application should store the state in the database instead of in memory of the process.

Avoid using sticky sessions, using sticky sessions are a violation of 12-factor app principles. If you would store the session information, you can choose redis or memcached or any other cache provider based on your requirements.

7.      Port binding :

Export services via port binding. The web app exports HTTP as a service by binding to a port and listening to requests coming in on that port. Spring boot is one example of this one. Spring boot by default comes with embedded tomcat, jetty, or undertow.

8.      Concurrency:

By adopting the containerization, applications can be scaled horizontally as per the demands.

9.      Disposability

Maximize robustness with fast Startup and graceful shutdown. Docker containers can be started or stopped instantly. Storing request, state, or session data in queues or other backing services ensures that a request is handled seamlessly in the event of a container crash.

10.  Dev/prod parity

Keep development, staging, and production as similar as possible. This reduces the risks of showing up bugs in a specific environment.

11.  Logs

Treat logs as event streams

observability is the first-class citizen. Observability can be achieved through using APM tools (ELK, Newrelic, and other tools) or log aggregations tools like Splunk, logs, etc.

12.  Admin processes

Run admin/management tasks as one-off processes. Any needed admin tasks should be kept in source control and packaged with the application.

Twelve-factor principles advocates for keeping such administrative tasks as part of the application codebase in the repository. By doing so, one-off scripts follow the same process defined for your codebase.

Ensure one-off scripts are automated so that you don't need to worry about executing them manually before releasing the build. Twelve-factor principles also suggest using the built-in tool of the execution environment to run those scripts on production servers. 

Ref: https://12factor.net/

Friday 18 June 2021

Continuous Integration Vs Continuous Delivery Vs Continuous Deployment



   Delivery of development and tested code is considered continuous integration


Friday 30 April 2021

Publish maven project to google cloud (GCP) Artifact Registry

  • Enable the Artifact Registry API.
  • Install and initialize the Cloud SDK.
  • Create the repository, you might not see the option maven. It is disabled as it is still in alpha version. To enable the option fill the form. it will take some time to approve. Lets assume your repository name is "quickstart-maven-repo" and location you selected is "us-central1"

  • Now go to your command prompt and login to gcloud console
$gcloud auth login
$gcloud config set project <myProject>
  • Set the repository, run the command:
$gcloud config set artifacts/repository quickstart-maven-repo
  • set the location
$gcloud config set artifacts/location us-central1

  • Create a service account from google cloud console or run the below command

$gcloud artifacts repositories add-iam-policy-binding quickstart-maven-repo --location=us-central1 --member='serviceAccount:ACCOUNT' --role='roles/artifactregistry.writer'

Where ACCOUNT is the ID of your service account in the format USERNAME@PROJECT-ID.iam.gserviceaccount.com

  • Download the service account key

 $gcloud iam service-accounts keys create mykey.json --iam-account=USERNAME@PROJECT-ID.iam.gserviceaccount.com

$export GOOGLE_APPLICATION_CREDENTIALS=mykey.json 
Where mykey.json is gnereated in previous step.

  • Now its time to configure the maven project
  • Choose a Maven project that you want to use. and go to the root directory of the project.
  • Run the following command to print the settings for the default quickstart-maven-repo repository.
$gcloud artifacts print-settings mvn

  • The output should look like below
       <distributionManagement>
  <snapshotRepository>
<id>artifact-registry</id>
<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>
  </snapshotRepository>
  <repository>
<id>artifact-registry</id>
<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>
  </repository>
</distributionManagement>

<repositories>
  <repository>
<id>artifact-registry</id>
<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>
<releases>
  <enabled>true</enabled>
</releases>
<snapshots>
  <enabled>true</enabled>
</snapshots>
  </repository>
</repositories>

<build>
  <extensions>
<extension>
  <groupId>com.google.cloud.artifactregistry</groupId>
  <artifactId>artifactregistry-maven-wagon</artifactId>
  <version>2.1.1</version>
</extension>
  </extensions>
</build>

Add the output to the pom.xml
  • Run the below command to publish to the repo
$mvn clean deploy
and see the magic. go to google cloud console and then artifact-registry , you should see your published jar
  • For dependent project to access the published jar add the below settings to pom.xml
        <repositories>
<repository>
<id>artifact-registry</id>
<url>artifactregistry://us-central1-maven.pkg.dev/PROJECT/quickstart-maven-repo</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<build>
  <extensions>
<extension>
  <groupId>com.google.cloud.artifactregistry</groupId>
  <artifactId>artifactregistry-maven-wagon</artifactId>
  <version>2.1.1</version>
</extension>
  </extensions>
</build>

 and then run the command 
$mvn clean compile

Project should compile and download all the dependency.


Tuesday 9 February 2021

Mongodb replica with docker compose

Clone the repo to get the docker-compose.yml and related files

You will see the below files after cloning the repo



To run a standalone server:
$docker-compose -f docker-compose-standalone.yml up -d

To run  replicated servers:

$chmod 400 resource/mongod-keyfile

$./setup_replica.sh

$docker exec -it mongodb1 bash

$mongo -u root -p admin

Enjoy!

Monday 1 February 2021

Kafka basics

 What is Kafka?

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. (Wikipedia).

Apache Kafka is a publish-subscribe based durable messaging system. A messaging system sends messages between processes, applications, and servers.

Topics:

A Topic is a category/feed name to which records are stored and published. A topic is a particular stream of data. Similar to a table name in a database.

Partitions:

Kafka topics are divided into a number of partitions, which contain records in an unchangeable sequence. Each record in a partition is assigned and identified by its unique offset. A topic can also have multiple partition logs. This allows multiple consumers to read from a topic in parallel. Each message in partition gets an incremental id called offset.

Offset:

  • Offset are like indexes in an array. 
  • Order is guaranteed only within a partition (not across partitions)
  • Data is kept only for a limited time (Default is one week)
  • Data is assigned randomly to a partition unless a key is provided

Broker & Cluster:

Cluster is a collection of brokers. Brokers are the Kafka servers. Every Kafka broker is also called a “bootstrap server”. It means you only need to connect to one broker and you will be connected to the entire cluster.

Leader:

    • At any time one broker can be a leader for a given partition
    • Only that leader can receive and serve data for a partition
    • The other brokers will synchronize the data
    • Therefore each partition has one leader and multiple ISR ( in-sync replica)


Replicas:

Replicas are nothing but backups of a partition. If the replication factor of a topic is set to 4, then Kafka will create four identical replicas of each partition and place them in the cluster to make them available for all its operations. Replicas are never used to read or write data. They are used to prevent data loss.


Producers:

Producers writes data to topics
Message Key:
Producers can choose to send a key with the message 
    • key=null: data is sent round robin (Broker0 then Broker1 then Broker2)
    • key!=null: all messages for that key will always go to the same partition.
A key is sent if you need message ordering for a specific field.

Key Hashing:
    • By default it uses "murmurmur2" algorithm
    • Formula: targetPartion = Utils.abs(Utils.murmur2(recover.key())) % numPartitions
adding/removing partitions to a topic will completely alter the formula

        Acknowledgement:
Producers can choose to receive acknowledgment of data writes
    • ack = 0: Producers won't wait for acknowledgment ( Possible data loss)
    • ack = 1: Producers will wait for leader acknowledgment ( limited data loss)
      • Leader response is requested but replication is not a guarantee.
      • If ack is not received, the producer may retry
      •  If leader broker goes offline but replicas haven't replicated the data yet, we have a loss of data
    • ack = all: Leaders + replicas acknowledgment ( no data loss)


      • Acks=all must be used in conjunction with min.insync.replicas
      • min.insync.replicas can be set a the broker or topic level (Override)
      • min.insync.replicas=2 implies that at least 2 brokers that are ISR(including leader) must response that they have data
      • That means if you use replication.factor = 3, min.insync=2, ack=all, you can only tolerate I broker going down, otherwise the producer will receive an exception on send.
        enable.idempotence=true ( producer level ) + min.insync.replicas=2 ( brokder/topic level)
        implies ack=all, retries=MAX_INT, max.in.flight.requests.per.connection=5 (default)

    Compression:

    • Producer usually send data that is text-based e.g. with JSON data which are large in size
    • In this case, it is important to apply compression to the producer
    • Compression is enabled at the producer level doesn't require change at broker or in the consumer
        Compression Type:
      • "compression.type" can be 'none' (default), 'gzip', 'Iz4', 'snappy'
      • compression is more effective on the bigger batch of data
      • Always use compression if you have high throughput
      • consider tweaking linger.ms and batch.size to have bigger batches and therefore more compression and higher throughput.
    • By default, kafka tries to send records as soon as possible
    • It will have up to 5 request in flight, meaning up to 5 messages individually sent at the same time
    • After this if more message have to be sent while others are in flight, kafka is smart and will start batching them while they wait to send them all at once.
            Linger.ms:
            Number of milliseconds a producer is willing to wait before sending a batch out. ( default 0)
    • By introducing some lag ( for example linger.ms=5 ) we increase the chances of messages being sent together in a batch
    • By introducing a small delay, we can increase throughput, compression and efficiency of a producer
    • If a batch is full ( batch.size ) before the end of the linger.ms period, it will be sent to kafka right away! 
            batch:size:    
            Maximum number of bytes that will be included in a batch. The default is 16KB.
    • Increasing a batch size to 32KB or 64KB can help increasing the compression, throughput, and efficiency
    • Any message that is bigger than the batch size will not be batched
    • A batch is allocated per partition, so make sure that you don't set it to a number that is too high other it will waste memory
    • You can monitor the average batch size metric using kafka producer Metrics
            Advantages:
      • Much smaller producer request size
      • Low latency
      • Better throughput
      • Store messages on disk are smaller in broker
            Disadvantages:
      • Producers must commit some CPU cycles to compression
      • Consumers must commit some CPU cycles to decompression
           Note: If the producer produces faster than the broker can take, the border can take the records will buffer in buffer.memory and fill back down when the throughput to the broker increases max.block.ms=60000: the time .send() will block until throwing an exception.
      • The producer has fill up its buffer
      • The broker is not accepting any new data
      • 60 seconds has elapsed

Consumers:

Consumers read data from topics
  • Kafka stores the offsets at which a consumer group has been reading
  • It will be stored in a Kafka topic and that Kafka topic is named __consumer_offsets.
  • When consumer in a group has processed data received from kafka, it should be comitting the offsets
  • If a consumer dies it will be able to read back from where it left off.

    Delivery Semantics:

    • At most once: Offsets are committed as soon as the message batch is received. If the processing goes wrong, the message will be lost.
    • At least once (usually): Offsets are committed after the message is processed. If the processing goes wrong, the message will be read again. This can result in duplicate processing of messages. Make sure your processing is idempotent ( unique)
    • Exactly once: It can be achieved for Kafka to Kafka workflows using Kafka Streams API. For Kafka to External System workflows use an idempotent consumer.
        
        There are two ways to make consumer record idempotent ( Unique)
    1.  Kafka generic id:- You can take the help of kafka to generate unique id by appending simple strings like String id = record.topic()+"-"+record.partition()+"-"+record.offset();
    2. Application supplied unique value:   You can generate unique value from producer supplied record. 

        Consumer offset strategy:
    • enable.auto.commit = true & synchronous processing of batches, offsets will be committed automatically for you at regular interval by default auto.commit.interval.ms=5000, every time your call .poll(), if you don't use synchronous processing, you will be in "at-most-once" behavior because offsets will be committed before your data is processed
    • enable.auto.commit = false & synchronous processing of batches. you control when you commit offsets and what's the condition for committing them.
        Consumer offset reset strategy:
    • auto.offset.reset=latest // will read from the end of the log
    • auto.offset.reset=earliest // will read from the start of the log
    • auto.offset.reset=none // will throw exception if no offset is found
if a consumer hasn't read new data in 7 days, consumer offset can be lost, it can be controlled by offset.retention.minutes
To Replay data for a consumer group
Take all consumer from a specific group down
Use kafka-consumer-groups command to set offset to what you want restart consumers

        Poll Behavior:
            
                fetch.min.bytes:
      • Controls how much data you want to pull at least on each request
      • Helps improving throughput and decreasing request number
      • At the cost of latency
                Max.poll.records ( default 500)
      • Controls how many records to receive per poll request
      • Increases if you messages are very small and have a lot of available RAM
      • Good to monitor how many records are polled per request.
                Considerations
      • set proper data retention period & offset retention period
      • Ensure the auto offset reset behavior is the one you expect / want
      • use replay capability in case of unexpected behavior

Zookeeper: 

  • Manages brokers keeps a list of them
  • It helps in performing leader election for partitions.
  • It sends the notification to Kafka in case of changes. ( e.g. new topic, broker dies, broker comes up, delete topic etc.)
  • Kafka can not run without zookeeper.
  • It by design operates with an odd number of servers.
  • It has a leader(Leader handle the writes from the brokers) the rest of the servers are followers (handle reads).
  • Zookeeper does not store consumer offsets with Kafka.

Kafka Guarantees:

  • Messages are appended to a topic-partition in the order they are sent.
  • Consumers read messages in the order stored in a topic-partition.
  • With a replication factor of N, producers and consumers can tolerate up to N-1 brokers being down.
  • As long as the number of partitions remains constant for a topic, the same key will always go to the same partition.



Monday 11 January 2021

Linux Awk scripting cheatsheet

 What is awk? 

It’s a full scripting language, as well as a complete text manipulation toolkit for the command line.

Awk is used for to transform data files and produce formatted report.

They way it works
  • Scans a file line by line
  • Splits each input line into fields
  • Compare input line/fields to pattern
  • Performs action on matches lines
in the terminal if you type awk and hit enter you should see the blow output which will show the parameters it accepts and the format of the command.

/$ awk
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options:          GNU long options: (standard)
        -f progfile             --file=progfile
        -F fs                   --field-separator=fs
        -v var=val              --assign=var=val
Short options:          GNU long options: (extensions)
        -b                      --characters-as-bytes
        -c                      --traditional
        -C                      --copyright
        -d[file]                --dump-variables[=file]
        -e 'program-text'       --source='program-text'
        -E file                 --exec=file
        -g                      --gen-pot
        -h                      --help
        -L [fatal]              --lint[=fatal]
        -n                      --non-decimal-data
        -N                      --use-lc-numeric
        -O                      --optimize
        -p[file]                --profile[=file]
        -P                      --posix
        -r                      --re-interval
        -S                      --sandbox
        -t                      --lint-old
        -V                      --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
        gawk '{ sum += $1 }; END { print sum }' file
        gawk -F: '{ print $1 }' /etc/passwd

Create file in any of the directory you choose with following contents
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b
D,CD,BCD,98
F,GH,ABC,XYZ,LF

awk -F, '{ print }' file // -F, is the separator, here the separator is ,
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b
D,CD,BCD,98
F,GH,ABC,XYZ,LF

$awk -F',' '{ print $1}' file
A
B
C
D
F

$0: Represents the entire line of text.
$1: Represents the first field.
$2: Represents the second field.
$7: Represents the seventh field.
$45: Represents the 45th field.

$awk -F',' '{ print $1, $3}' file
A  ABC
B  CBA
C  ACB
D  BCD
F  ABC

OFS (output field separator) variable to put a separator between fields
$awk -F','  'OFS="/" { print $1, $3}' file
A/ ABC
B/ CBA
C/ ACB
D/ BCD
F/ ABC

Replacing all the values of column 2
$awk -F',' '{$2="1";print }' file
A 1  ABC  ABCD
B 1  CBA  C200
C 1  ACB  100b
D 1  BCD  98
F 1  ABC  XYZ  LF

Replacing all the values of colum 2 and putting a quote arround it
$awk -F, '{$2="\"1\"";print }' file
A "1"  ABC  ABCD
B "1"  CBA  C200
C "1"  ACB  100b
D "1"  BCD  98
F "1"  ABC  XYZ  LF

Number of cell in per row after splitting by ,
$awk -F, '{ print NF }' file
4
4
4
4
5

A BEGIN rule is executed once before any text processing starts. In fact, it’s executed before awk even reads any text. An END rule is executed after all processing has completed. You can have multiple BEGIN and END rules, and they’ll execute in order.
$awk  -F',' 'BEGIN {print "Hello world"} { print $0}' file
Hello world
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b
D,CD,BCD,98
F,GH,ABC,XYZ,LF


$awk 'END { print NR } { print }' file
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b
D,CD,BCD,98
F,GH,ABC,XYZ,LF
5

To print the first item along with the row number(NR) 
$awk -F, '{ print NR ", " $0 }' file
1,A,AB,ABC,ABCD
2,B,BA,CBA,C200
3,C,AC,ACB,100b
4,D,CD,BCD,98
5,F,GH,ABC,XYZ,LF

Conditions and regular expressions

$awk -F, '$4 > 90 { print }' file
D,CD,BCD,98

$awk -F, '$3 ~ /A/ { print $0 }' file
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b
F, GH, ABC, XYZ, LF

$awk -F, '$3 ~ /^A/ { print $0 }' file
A,AB,ABC,ABCD
C,AC,ACB,100b
F,GH,ABC,XYZ,LF

for loops in awk:
$awk 'BEGIN { for(i=1;i<=6;i++) print "square of", i, "is",i*i; }'
square of 1 is 1
square of 2 is 4
square of 3 is 9
square of 4 is 16
square of 5 is 25
square of 6 is 36

$awk -F, 'length($4) > 3' file
A,AB,ABC,ABCD
B,BA,CBA,C200
C,AC,ACB,100b

awk if conditions
$awk -F, '{ if($4 == "ABCD") print $0;}' file
A,AB,ABC,ABCD